Greedy algorithms for NLP such as transition-based parsing are prone to error propagation. One way to overcome this problem is to allow the algorithm to backtrack and explore an alternative solution in cases where new evidence contradicts the solution explored so far. In order to implement such a behavior, we use reinforcement learning and let the algorithm backtrack in cases where such an action gets a better reward than continuing to explore the current solution. We test this idea on both POS tagging and dependency parsing and show that backtracking is an effective means to fight against error propagation.

Transition-based parsing has become a major approach in dependency parsing, since the work of Yamada and Matsumoto (2003) and Nivre et al. (2004), for it combines linear time complexity and high linguistic performances. The algorithm follows a local and greedy approach to parsing that consists of selecting at every step of the parsing process the action that maximizes a local score, typically computed by a classifier. The action selected is greedily applied to the current configuration of the parser and yields a new configuration.

At training time, an oracle function transforms the correct syntactic tree of a sentence into a sequence of correct (configuration, action) pairs. These pairs are used to train the classifier of the parser. The configurations that do not pertain to the set of correct configurations are never seen during training.

At inference time, if the parser predicts and executes an incorrect action, it produces an incorrect configuration, with respect to the sentence being parsed, which might have never been seen during training, yielding a poor prediction of the next action to perform. Additionally, the parser follows a single hypothesis by greedily selecting the best scoring action. The solution built by the parser can be sub-optimal for there is no guarantee that the sum of the scores of the actions selected maximizes the global score.

These are well-known problems of transition- based parsing and several solutions have been proposed in the literature to overcome them, they will be briefly discussed in Section 2. The solution we propose in this article consists of allowing the parser to backtrack. At every step of the parsing process, the parser has the opportunity to undo its n previous actions to explore alternative solutions. The decision to backtrack or not is taken each time a new word is considered, before trying to process it, by giving the current configuration to a binary classifier, that will assign a score to the backtracking action. Traditional supervised learning is not suited to learn such a score, since the training data contains no occurrences of backtrack actions. In order to learn in which situation a backtrack action is worthy, we use reinforcement learning. During training, the parser has the opportunity to try backtracking actions and the training algorithm responds to this choice by granting it a reward. If the backtracking action is the adequate move to make in the current configuration, it will receive a positive reward and the parser will learn in which situation backtrack is adequate.

The work presented here is part of a more ambitious project which aims at modeling eye movements during human reading. More precisely, we are interested to predict regressive saccades: eye movements that bring the gaze back to a previous location in the sentence.

There is much debate in the psycholinguistic literature concerning the reasons of such eye movements (Lopopolo et al., 2019). Our position with respect to this debate is the one advocated by Rayner and Sereno (1994), for whom part of these saccades are linguistically motivated and happen in situations where the reader’s incremental comprehension of the sentence is misguided by an ambiguous sentence start, until reaching a novel word which integration will prove incompatible with the current understanding and will trigger a regressive saccade, as in garden path sentences. Our long-term project is to model regressive saccades with backtracking actions. Although this work enters in this long-term project, the focus of this article is on the nlp aspects of this program and we propose a way to implement backtracking in the framework of transition-based parsing. We will just mention some preliminary studies on garden path sentences in the Conclusion.

In order to move in the direction of a more cognitively plausible model, we add two constraints to our model.

The first one concerns the text window around the current word that the parser takes into account when predicting an action. This window can be seen as an approximation of the sliding window introduced by McConkie and Rayner (1975) to model the perceptual span of a human reader.1 Transition-based parsers usually allow taking into account the right context of the current word. The words in the right context constitute a rich source of information and yield better predictions of the next action to perform.

In our model, the parser does not have access to this right context, simulating the fact that a human reader has only a limited access to the right context (few characters to the right of the fixation point (McConkie and Rayner, 1976)). It is only after backtracking that a right context is available for it has been uncovered before backtrack took place.

The second constraint is incrementality. When performing several tasks, such as pos tagging and parsing, as will be done in the tagparser described in Section 3, these tasks are organized in an incremental fashion. At each step, a word is read, pos tagged, and integrated in the syntactic structure— a more cognitively plausible behavior than a sequential approach where the whole sentence is first pos tagged then parsed.

The structure of the paper is the following: In Section 2, we compare our work to other approaches in transition-based parsing which aim at proposing solutions to the two problems mentioned above. In Section 3, we describe the model that we use to predict the linguistic annotation and introduce the notion of a back action. In Section 4 we show how backtracking actions can be predicted using reinforcement learning. Section 5 describes the experimental part of the work and discusses the results obtained. Section 6 concludes the paper and presents different directions in which this work will be extended.

Several ways to overcome the two limits of transition-based parsing mentioned in the Introduction, namely, training the parser with only correct examples and exploring only a single hypothesis at inference time, have been explored in the literature.

##### Beam Search

The standard solution to the single hypothesis search is beam search, which allows considering a fixed number of solutions in parallel during parsing. Beam search is a general technique that has been applied to transition-based parsing in many studies, including Zhang and Clark (2008), Huang and Sagae (2010), and Zhang and Nivre (2012). They show that exploring a fixed number of solutions increases the linguistic performances over a single hypothesis parser. In this work, we do not use a beam search algorithm, but we do explore several solutions in a non-parallel fashion, using backtracking.

##### Dynamic Oracles

A first way to overcome the lack of exploration during training problem is the proposition by Goldberg and Nivre (2012) to replace the standard oracle of transition-based parsing by a dynamic oracle that is able to determine the optimal action a to perform for an incorrect configuration c. During training, the dynamic oracle explores a larger part of the configuration space than the static oracle and produces for an incorrect configuration c an optimal action a. The pair (c,a) is given as training example to the classifier, yielding a more robust classifier that is able to predict the optimal action in some incorrect configurations. Ballesteros et al. (2016) show that the principle of the dynamic oracle can be adapted to train the greedy Stack-LSTM dependency parser of Dyer et al. (2015), improving its performance.

Yu et al. (2018) describe a method for training a small and efficient neural network model that approximates a dynamic oracle for any transition system. Their model is trained using reinforcement learning.

In this paper, we use dynamic oracles in two different ways. First, as baselines for models allowing some exploration during training. Second, in the definition of the immediate reward function, as explained in Section 4.

##### Reinforcement Learning

A second answer to the lack of exploration problem is reinforcement learning. The exploitation–exploration trade-off of reinforcement learning allows, at training time, the model to explore some incorrect configurations and to learn a policy that selects in such cases the action that maximizes the long-term reward of choosing this action. Reinforcement learning is well suited for transition-based algorithms.

To the best of our knowledge, two papers directly address the idea of training a syntactic transition-based parser with reinforcement learning: Zhang and Chan (2009) and Lê and Fokkens (2017).

Zhang and Chan (2009) cast the problem, as we do, as a Decision Markov Process but use a Restricted Boltzmann Machine in order to compute scores of actions and use the SARSA algorithm to compute an optimal policy. In our case, we use deep Q-learning based on a multilayer perceptron, as described in Section 4. In addition, their immediate reward function is based on the number of arcs in the gold tree of a sentence that are absent in the tree being built. We also use an immediate reward function, but it is based on the number of arcs in the gold tree that can no longer be built given the current tree being built, an idea introduced in the dynamic oracle of Goldberg and Nivre (2012).

Two major differences distinguish our approach and the work of Lê and Fokkens (2017). The first is the idea of pre-training a parser in a supervised way and then fine-tuning its parameters using reinforcement learning. In our case, the parser is trained from scratch using reinforcement learning. The reason for this difference is related to our desire to learn how to backtrack: It is difficult to make the parser learn to backtrack when it has been initially trained not to do so (using standard supervised learning). The second major difference is the use of a global reward function, which is computed after the parser has parsed a sentence. In our case, as mentioned above, we use an immediate reward. The reason for this difference is linked to pretraining. Since we do not pre-train our parser, and allow it to backtrack, granting a reward at the end of the sentence does not allow the parser to converge since reaching the end of the sentence is almost impossible using a non pre-trained parser. Other less fundamental differences distinguish our approach, such as the use of Q-learning, in our case, to find an optimal policy, where they use a novel algorithm called Approximate Learning Gradient, based on the fact that their model’s output is a probability distribution over parsing actions (as in standard supervised learning). Another minor difference is the exploration of the search space in the training phase. They sample the next action to perform using the probability distribution computed by their model, while we use an adaptation of ε-greedy that we describe in Section 4.

Reinforcement learning has also been used to train a transition-based model for other tasks. Naseem et al. (2019) present a method to fine-tune the Stack-LSTM transition-based AMR parser of Ballesteros and Al-Onaizan (2017) with reinforcement learning, using the Smatch score of the predicted graph as reward. For semantic dependency parsing, Kurita and Søgaard (2019) found that fine-tuning their parser with a policy gradient allows it to develop an easy-first strategy, reducing error propagation.

The fundamental difference between all papers cited above and our work is the idea of adding backtracking in a greedy transition-based model. We use reinforcement learning as a means to achieve the exploration necessary to this goal. In term of parsing performance, our method will fare lower than the state of the art in transition-based parsing because our parser is constrained to not see words beyond the current word, a constraint that comes from the fact that our long-term goal is not to improve parsing performance but to find a natural way to encourage a parser to simulate regressive saccades observed during human reading.

Our model is an extension of the Reading Machine, a general model for nlp proposed in Dary and Nasr (2021) that generalizes transition-based parsing to other nlp tasks. A Reading Machine is a finite automaton whose states correspond to linguistic levels. There can be, for example, one state for pos tagging, one state for lemmatization, one state for syntactic parsing, and so on. When the machine is in a given state, an action is predicted, which generally writes on an output tape a label corresponding to the prediction just made.2 There are usually as many output tapes as there are levels of linguistic predictions and at least one input tape that usually contains the words of the sentence to process.3 Predictions are realized by classifiers that take as input the configuration of the machine and compute a probability distribution over the possible actions. Configurations are complex objects that describe the aspects of the machine that are useful in order to predict the next action to perform. Among the important elements of a configuration for the rest of the paper, we can cite the current state of the machine, the word index (noted as wi) that is the position of the word currently processed, and the history, a list of all actions performed so far.

The text is read word by word, a window of an arbitrary size centered on wi defines the part of the text that can be used to make predictions on the current word.

Figure 1 shows the architecture of three simple machines. The two machines in the top part of the figure realize a single task. The machine on the left part realizes pos tagging. It is made of a single state and has a series of transitions that loop on the state. The machine has one input tape from which words are read and one output tape, on which predicted pos are written. Each transition is labeled with a tagging action of the form pos(p) that simply writes the pos tag p on the pos output tape at the word index position.

Figure 1:

Three simple reading machines. The top left machine performs pos tagging; the top right one performs unlabeled dependency parsing; and the bottom one performs the two tasks simultaneously.

Figure 1:

Three simple reading machines. The top left machine performs pos tagging; the top right one performs unlabeled dependency parsing; and the bottom one performs the two tasks simultaneously.

Close modal

The machine on the right implements an arc- eager transition-based parser that produces unlabeled dependency trees. It has the same simple structure than the tagging machine. Its transitions are labeled with the four standard arc-eager actions of unlabeled transition-based parsing: left, reduce, shift, and right. The machine has two input tapes, one for words and one for pos tags, and one output tape on which it writes the index of the governor of the current word when left or right actions are predicted.

The machine on the bottom part of the figure, which we call a tagparser, realizes the two tasks simultaneously in an incremental fashion. When in state pos, the machine tags the current word then control jumps to the parser in order to attach the current word to the syntactic structure built so far or to store it in a stack. Once this is done, control returns to state pos to tag the next word. The reason why the transitions labeled right and shift lead to state pos is that these two actions increase the word index wi and initiate the processing of a new word. The machine has one input tape for words and two output tapes, one for pos tags and one for the governor position.

We augment the machines described above in order to allow them to undo some preceding actions. This ability relies on three elements: (a) the definition of a new action, called back, that undoes a certain number of actions; (b) the history of the actions performed so far in order to decide which actions to undo; and (c) the definition of undoing an action.

Undoing an action amounts to recovering the configuration that existed before the action was performed. This is quite straightforward for tagging and parsing actions.4

Three backtracking machines, based on the machines of Figure 1, are shown in Figure 2. They all have an extra state, named back, and two extra transitions. When in state back, the machine predicts one of the two actions back or ¬back. When action ¬back is selected, control jumps either to state pos or syn, depending on the machine, and the machine behaves like the simple machines of Figure 1. Action ¬back does not modify the configuration of the machine. This is not the only possible architecture for backtracking machines, this point will be briefly addressed in the Conclusion.

Figure 2:

Three backtracking machines based on the machines of Figure 1.

Figure 2:

Three backtracking machines based on the machines of Figure 1.

Close modal

If action back is selected, the last actions of the history are undone until a ¬back action is reached. This definition of action back allows undoing all actions that are related to the previous word. After back has been applied, the configuration of the machine is almost the same as the configuration it was in before processing the current word. There is, however, a major difference: It now has access to the following word. Otherwise, the machine would deterministically predict the action it has predicted before. One can notice that the transition labeled back in the machine loops on a single state, this feature allows the machine to perform several successive back actions.

Figure 3 shows how the tagparsing machine of Figure 2 would ideally process the sentence the old man the boat, a classical garden path sentence for which, two words (old and man) should be re-analyzed after the noun phrase the boat has been read. The figure describes the machine configuration each time it is in state back. Three tapes are represented: the input tape that contains tokens, the pos tape, and the parsing tape that contains the index of the governor of the current word. The figure also represents, at the bottom, the ba array, which is described below, as well as the sequence of actions predicted since the last visit to state back. The current word appears in boldface. The figure shows two successive occurrences of a back action, after the second determiner is read, leading to the re-analysis of the word old that was tagged adj and the word man that was tagged noun.

Figure 3:

Processing the garden path sentence the old man the boat with a bactracking tagparser. After reading the second determiner, the machine backtracks in order to reanalyze words old and man.

Figure 3:

Processing the garden path sentence the old man the boat with a bactracking tagparser. After reading the second determiner, the machine backtracks in order to reanalyze words old and man.

Close modal

A backtracking machine as the one described above can run into infinite loops: Nothing prevents it to repeat endlessly the same sequence of actions. One can hope that, during training, such behavior leads to poor performance and is ruled out. But there is no guarantee that this will be the case. Furthermore, we would like to prevent the machine for exploring, at inference time, the whole (or a large part of the) configuration space. In order to do so, we introduce a constraint on the number of times a back action is taken when parsing a sentence. A simple way to introduce such a constraint is to limit to a given constant k the number of authorized back actions per word. This feature is implemented by introducing an array ba of size n, where n is the number of words of the sentence to process. Array ba is initialized with zeros, and every time a back action is predicted in position i, the value of ba[wi] is incremented. When the machine is in state back and ba[wi] is equal to k, performing a back action is not permitted. A ¬back action is forced, bypassing the classifier that decides whether to backtrack or not.

The introduction of array ba and the parameter k defines an upper bound on the size of the action sequence for a sentence of length n. This upper bound is equal to 3 nk + 2 n for the tagger, 4 nk + 3 n for the parser, and 5 nk + 4 n for the tagparser.5 As one can notice, linearity is preserved. In our experiments, we chose k = 1.

Reading Machines, as introduced by Dary and Nasr (2021), are trained in a supervised learning fashion. Given data annotated at several linguistic levels, an oracle function decomposes it into a sequence of configurations and actions ($c0$, $a0$, $c1$, $a1$,…,cn, an). This sequence of configurations and actions constitute the training data of the classifiers of the machine to train: Pairs (ci,ai) are presented iteratively to the classifier during the training stage. A backtracking Reading Machine cannot be trained this way since there are no occurrences of back actions in the data. In order to learn useful occurrences of such actions, the training process should have the ability to generate some back actions and be able to measure if this action was beneficial.

In order to implement such a behavior, we use reinforcement learning (rl). We cast our problem as a Markov Decision Process (mdp). In an mdp, an agent (the parser) is in configuration ct at time t. It selects an action at from an action set (made of the tagging actions, the parsing actions and the back and ¬back actions) and performs it. We note C the set of all configurations and A the set of actions. The environment (the annotated data) responds to action at by giving a reward rt = r(ct,at) and by producing the succeeding configuration ct +1 = δ(ct,at). In our case, configuration ct +1 is deterministically determined by the structure of the machine. The reward function gives high reward to actions that move the parser towards the correct parse of the sentence. The fundamental difference with supervised training is that, during training, the agent is not explicitly told which actions to take, but instead must discover which action yields the most reward through trial and error. This feature gives the opportunity for the parser to try some back actions, provided that a reward can be computed for such actions.

Given an mdp that indicates the reward associated to applying action a in configuration c, the goal is to learn a function q*(c,a) that maximizes the total amount of reward (also called discounted return) that can be expected after action a has been applied. Once this function (or an approximation of it) is computed, one can use it to select which action to choose when in configuration c, by simply picking the action that maximizes function q*. Such behavior is called an optimal policy in the rl literature. Function q* is the solution to the Bellman optimality equation.6
$q*(c,a)=r(c,a)+γmaxa′q*(δ(c,a),a′)$
where γ ∈ [0, 1] is the discount factor that allows discounting the reward of future actions. The equation expresses the relationship between the value of an action a in configuration c and the value of the best action a′ that can be performed in its successor configuration δ(c,a). This recursive definition is the basis for algorithms that iteratively approximate q*, among which is Q learning (Watkins, 1989), which approximates q* with a function called Q. In Q learning, during training, each time an action a is chosen in configuration c, the value of Q(c,a) is updated:
$Q(c,a)←Q(c,a)+α(Q′(c,a)−Q(c,a))$
where α is a learning rate and Q′(c,a) is a new estimation of Q(c,a):
$Q′(c,a)=r(c,a)+γmaxa′Q(δ(c,a),a′)$

It has been proven (Watkins and Dayan, 1992) that such iterative algorithm converges towards the q* function.

In order to store the values of Q, the algorithm uses a large table that has an entry for every (configuration, action) pair. In our case, there are far too many configurations to allocate such a table. Instead we use a simple form of deep Q learning (Mnih et al., 2013) and approximate function Q using a multilayer perceptron CQ.

CQ takes as input a configuration c and outputs a vector whose dimension is the number of different actions in the action set. The component of the vector corresponding to action a is the approximation of Q(c,a) computed by CQ. It is noted CQ(c,a).

During training, every time an action a is performed by the parser in configuration c, the parameters of CQ are updated using gradient descent of the loss function. The loss function should be defined in a way to minimize the difference between the actual value CQ(c,a) and its updated value CQ(c,a). This difference is computed with the smooth l 1 loss (Girshick, 2015):
$L(c,a)=l1(CQ(c,a),CQ′(c,a))$
where CQ(c,a) is computed as follows:
$CQ′(c,a)=r(c,a)+γmaxa′CQ(δ(c,a),a′)$
##### Reward Functions

In rl, the training process is guided by the reward r(c,a) granted by the environment when action a is performed in configuration c. Defining a reward function for tagging action and parsing action is quite straightforward.

In the case of tagging, the reward should be high when the tag chosen for the current word is correct and low when it is not. A simple reward function is one that associates, for example, value 0 in the first case and − 1 in the second. More elaborate reward functions could be defined that penalize more some confusions (e.g., tagging a verb as a preposition).

In the case of parsing, a straightforward reward function will give a reward of zero for a correct action and a negative reward for an incorrect one. We use a slightly more complex function inspired by the dynamic oracle of Goldberg and Nivre (2012). This function, in the case of an incorrect action a, counts the number of dependencies of the correct analysis that cannot anymore be predicted due to a. The reward for a is the opposite of this number. Actions that cannot be executed, such as popping an empty stack, are granted a reward of − 1.5.

Defining the reward function for back actions is more challenging. When a back action is performed, a certain number of actions aiai +k are undone. Each of these actions was granted a reward riri +k. Let’s call $E=−Σt=0kri+t$ the opposite of the sum of these rewards (E ≥ 0). The larger E is, the more errors have been made. Let’s call φ(E) the function that computes the reward for executing a back action, given E. Formally, we want φ to respect the following principles:

1. Don’t execute a back action if there are no errors: φ(0) < 0.

2. φ(E) should be increasing with respect to E: the more errors, the more a back action should be encouraged.

3. φ(E) should not grow too fast with respect to E. Granting too much reward to back actions encourages the system to make errors in order to correct them with a highly rewarded back action.

A function that fulfills these principles is the following:
$φ(E)=−1ifE=0ln(E+1)else$

It is the function that we use to compute the reward of a back action.

##### Exploring the Configuration Space

In order to learn useful back actions, the parser should try, in the training phase, to perform some back actions and update the classifier based on the reward it receives from the environment. One standard way to do that is to adopt an ε-greedy policy in which the model selects a random action with probability ε or the most probable action as predicted by the classifier with probability 1 − ε. This setup allows the system to perform exploitation (choosing the best action so far) as well as exploration (randomly choosing an action). We adopt a variant of this policy, based on two parameters ε and β with 0 ≤ ε ≤ 1, 0 ≤ β ≤ 1, and ε + β ≤ 1. As in standard ε-greedy policy, the agent chooses a random action with probability ε, it chooses the correct action as given by the oracle7 with probability β, and, finally, it chooses the most probable action as predicted by the classifier with probability 1 − (ε + β). Parameter β has been introduced in order to speed up training. In the beginning of the training process, the system is encouraged to follow the oracle. Then, as training progresses, the system relies more on its predicting capacity (exploitation augments) and less on the oracle. Figure 4 shows the evolution of theses parameters.

Figure 4:

Probabilities of choosing the next action during RL: at random, following the oracle, or the model.

Figure 4:

Probabilities of choosing the next action during RL: at random, following the oracle, or the model.

Close modal

Three machines were used in our experiments: a tagger, a parser and a tagparser, based on the architectures of Figure 2. Each of the three machines has been trained in three different learning regimes: Supervised Learning (sl) using a dynamic oracle, Reinforcement Learning without back actions (rl), and Reinforcement Learning with back actions (rlb).

### 5.1 Universal Dependencies Corpus

Our primary experiments were conducted on a French Universal Dependencies corpus (Zeman et al., 2021), more specifically the GSD corpus, consisting of 16,341 sentences and 400,399 words. The original split of the data was 88% train, 9% dev, and 3% test. The size of the test set being too small to obtain significant results, we decided to use a k-fold strategy, where all the data was first merged, randomly shuffled, and then split into ten folds, each fold becoming the test set of a new train/dev/test split of the data in the following proportions: 80%10%/10%.

Using the ten folds was unnecessary to obtain significant results; we therefore decided to limit ourselves to three folds. The size of the test set for which results are reported has a size of 4,902 sentences and 124,560 words.

### 5.2 Experimental Setup

Each machine consists of a single classifier, a multilayer perceptron, with a single hidden layer of size 3200. When the machine realizes several tasks, as in case of the tagparser, the classifier has one decision layer for each task. The output size of each decision layer is the number of actions of its corresponding task. A dropout of 30% is applied to the input vector and to the output of the hidden layer. The output of the hidden layer is given as input to a ReLU function. The structure of the classifier is represented in Figure 5.

Figure 5:

Structure of the classifier.

Figure 5:

Structure of the classifier.

Close modal

The details of the features extracted from configurations and their encoding in the input layer of the classifier are detailed in Appendix B.

In the case of Supervised Learning, the machines are trained with a dynamic oracle. In the beginning of the training process, the machines are trained with a static oracle, for two epochs. Then every two epoch, the machines are used to decode the training corpus and for each configuration produced (which could be incorrect) the dynamic oracle selects the optimal action and these (configuration, action) pairs are used to train the classifier.

Training the machines in the rl regime is longer than training them in the sl regime. In the first case, 200 epochs were needed and 300 in the second. This difference is probably due to the larger exploration of the configuration space.

### 5.3 Results - Performance

The results for the three machines, under the three learning regimes, are shown in Table 1. pos tagging performance is measured with accuracy and displayed in column UPOS. Dependency parsing performance is measured with the unlabeled accuracy score (ratio of words that have been attached to the correct governor) and displayed in the UAS column. The p-value next to each score is a confidence metric indicating whether the score is significantly better than the one below (that’s why the last line is never given a p-value). This p-value has been estimated with a paired bootstrap resampling algorithm (Koehn, 2004) using the script (Popel et al., 2017) of the CoNLL 2018 shared task.

Table 1:

Performances of tagger, parser, and tagparser under three learning regimes on our French corpus.

TAGGERPARSER
RegimeUPOSp val.UASp val.
rlb 97.65 0.000 88.21 0.000
rl 96.84 0.000 86.60 0.037
sl 96.11 __ 86.17 __

TAGPARSER
Regime UPOS p val. UAS p val.
rlb 97.06 0.000 87.85 0.001
rl 96.73 0.090 87.12 0.211
sl 96.59 __ 86.94 __
TAGGERPARSER
RegimeUPOSp val.UASp val.
rlb 97.65 0.000 88.21 0.000
rl 96.84 0.000 86.60 0.037
sl 96.11 __ 86.17 __

TAGPARSER
Regime UPOS p val. UAS p val.
rlb 97.06 0.000 87.85 0.001
rl 96.73 0.090 87.12 0.211
sl 96.59 __ 86.94 __

The table shows the same pattern for the three machines: The rlb regime gets higher results than the simple rl regime, which is itself better than the sl regime. Two important conclusions can be drawn from these results. The first is that rlb regime is consistently better than sl: Backtracking machines make fewer errors than machines trained in supervised mode. At this point we do not know whether this superiority comes from reinforcement learning or the addition of a back action. In fact, previous experiments in Zhang and Chan (2009) and Lê and Fokkens (2017) showed that reinforcement learning (without backtracking) can lead to better results than supervised learning. The comparison of rlb and rl shows that most of the performance boost comes from backtracking.

The results of Table 1 also show that the tagparser gets better results than single-task machines (the tagger and the parser) when trained with supervised learning. Note that this comparison is possible because the parser was not given gold PoS as input, but instead the ones predicted by the tagparser. These results are in line with the work of Bohnet and Nivre (2012) and Alberti et al. (2015) that show that joint prediction of pos tags and syntactic tree improves the performances of both. However, this is not true when the machines are trained with reinforcement learning. In this case the parser and the tagger get better results than the tagparser. One reason that could explain this difference is the size of the configuration space of the tagparser that is an order of magnitude larger than those of the tagger or the parser. We will return to this point in the Conclusion.

### 5.4 Results - Statistics

One can gain a better understanding of the effect of the back actions performed by the three machines with the statistics displayed in Table 2. Each column of the table concerns one machine trained in rlb mode. The first line shows the total number of actions predicted while decoding the test set, the second line shows, the number of errors made, and the third line shows, the number of back actions predicted. Lines 4 and 5 give the precision and recall of the back actions. The precision is the ratio of back actions that were predicted after an error was made and the recall is the ratio of errors after which a back action was predicted. These two figures measure the error detection capabilities of the back actions prediction mechanism. In the case of the parser, the precision is equal to 76.86%, which means that 76.86% of the back actions were predicted after an error was made and 24.52% (recall) of the errors provoked a back action prediction. The recall constitutes an upper bound of the errors that could be corrected. The four last lines break down the back actions predicted into four categories. C$→$C is the case where a back action was predicted after a correct action, but did not change the action, E→E is the case where a back action was predicted after an error, but the error was not corrected, either the same erroneous action was predicted or another erroneous one was predicted. E→C is the case where a back action was predicted after an error and has corrected it, while C→E is the case where a correct action was replaced by an incorrect one after a back action.

Table 2:

Behavior comparison of three rlb machines.

PARSERTAGGERTAGPARSER
#Actions 115,588 79,620 153,764
#Errs 3,597 1,323 6,249
#Backs 1,063 891 4,491
bPrec 76.86% 68.46% 73.48%
bRec 24.52% 46.49% 61.72%
C$→$18.07% 28.65% 56.89%
E$→$41.39% 26.09% 23.46%
C$→$05.15% 02.79% 02.81%
E$→$35.39% 42.47% 16.83%
PARSERTAGGERTAGPARSER
#Actions 115,588 79,620 153,764
#Errs 3,597 1,323 6,249
#Backs 1,063 891 4,491
bPrec 76.86% 68.46% 73.48%
bRec 24.52% 46.49% 61.72%
C$→$18.07% 28.65% 56.89%
E$→$41.39% 26.09% 23.46%
C$→$05.15% 02.79% 02.81%
E$→$35.39% 42.47% 16.83%

Several conclusions can be drawn from these statistics.

First, backtracking corrects errors. For the three machines, there are many more cases where an error is corrected rather than introduced after a back action was predicted (E$→$C » C$→$E). This means that the difference in scores that we observed between rl and rlb in Table 1 can indeed be attributed to backtracking.

Second, backtracking is conservative. The number of predicted back actions is quite low (around 1% of the actions for the tagger and the parser and around 3% for the tagparser) and the precision is quite high. The machines do not backtrack very often and they usually do it when errors are made. This is the kind of behavior we were aiming for. It can be modified by changing the reward function φ of the back action.

Third, tagging errors are easier to correct than parsing errors. The comparison of columns 2 and 3 (parser and tagger) shows that the tagger has a higher recall than the parser, tagging errors are therefore easier to detect. This comparison also shows that E$→$C is higher for the tagger than it is for the parser, tagging errors are therefore easier to correct.

Lastly, the poor performance of the tagparser does not come from the fact that it does not backtrack. It actually does backtrack around three times as often as the parser or the tagger. But it has a hard time correcting the errors; most of the time, it reproduces errors made before. We will return to this point in the Conclusion.

### 5.5 Results on Other Languages

In order to study the behavior of backtracking on other languages, we have trained and evaluated our system on six other languages from various typological origin: Arabic (ar), Chinese (zh), English (en), German (de), Romanian (ro), and Russian (ru). The experimental setup for these languages is different from the one we have used for French: we have used the original split in train, development, and test sets, as defined in the Universal Dependencies corpora. We report in Table 3 the corpora used for each language as well as the size of the training, development, and test sets. We did not run experiments on the tagparser for it gave poor results on our experiments on French. Additionally, for a sanity check, we have rerun experiments on French data, using the original split in order to make sure that the difference of experimental conditions did not yield important differences in the results. The results of these experiments can be found in Table 4. The table indicates p-values of the difference between one system and the next best performing one. The system with the worse performances is therefore not associated with a p-value.

Table 3:

Corpora used for experiments on new languages, with the size of training, development, and test sets (in tokens).

Lang.CorpusTrainDevTest
zh GSD 98,616 12,663 12,012
en GUM 81,861 15,598 15,926
fr GSD 364,349 36,775 10,298
de HDT 2,753,627 319,513 326,250
ro RRT 185,113 17,074 16,324
ru SynTag 871,526 118,692 117,523
Lang.CorpusTrainDevTest
zh GSD 98,616 12,663 12,012
en GUM 81,861 15,598 15,926
fr GSD 364,349 36,775 10,298
de HDT 2,753,627 319,513 326,250
ro RRT 185,113 17,074 16,324
ru SynTag 871,526 118,692 117,523
Table 4:

Results for seven languages.

TAGGERPARSER
RegimeUPOSp val.UASp val.
English-GUM
rlb 94.99 0.00 79.96 0.00
rl 93.53 0.00 72.97
sl 92.63  73.12 0.43

French-GSD
rlb 97.53 0.00 87.70 0.17
rl 96.64 0.10 86.63
sl 96.25  86.97 0.33

German-HDT
rlb 97.88 0.00 93.00 0.00
rl 97.29 0.47 91.26
sl 97.28  91.31 0.35

Romanian-RRT
rlb 97.07 0.00 85.40 0.26
rl 96.28 0.03 84.70
sl 95.77  85.00 0.32

Russian-SynTagRus
rlb 98.41 0.00 86.59 0.00
rl 97.93 0.01 85.25
sl 97.80  85.27 0.46

Chinese-GSD
rlb 93.01 0.00 71.70 0.00
rl 91.61 0.23 64.63
sl 91.30  65.79 0.13

rlb 96.43 0.16 83.81 0.47
rl 96.20 0.03 83.77
sl 95.74  83.95 0.38
TAGGERPARSER
RegimeUPOSp val.UASp val.
English-GUM
rlb 94.99 0.00 79.96 0.00
rl 93.53 0.00 72.97
sl 92.63  73.12 0.43

French-GSD
rlb 97.53 0.00 87.70 0.17
rl 96.64 0.10 86.63
sl 96.25  86.97 0.33

German-HDT
rlb 97.88 0.00 93.00 0.00
rl 97.29 0.47 91.26
sl 97.28  91.31 0.35

Romanian-RRT
rlb 97.07 0.00 85.40 0.26
rl 96.28 0.03 84.70
sl 95.77  85.00 0.32

Russian-SynTagRus
rlb 98.41 0.00 86.59 0.00
rl 97.93 0.01 85.25
sl 97.80  85.27 0.46

Chinese-GSD
rlb 93.01 0.00 71.70 0.00
rl 91.61 0.23 64.63
sl 91.30  65.79 0.13

rlb 96.43 0.16 83.81 0.47
rl 96.20 0.03 83.77
sl 95.74  83.95 0.38

The results obtained on French are lower than the results obtained using the k-fold strategy. But the drop is moderate—0.3% for the tagger and 0.52% for the parser—and could be explained simply by the difference of the test corpora on which the systems were evaluated.

We observe more or less the same pattern for the new languages: The highest performance is reached by reinforcement learning with backtrack (rlb), for both the tagger and the parser. The second-best performing systems for tagging are usually trained with reinforcement learning but differences are usually non-significant. In the case of the parser, the second-best performing systems are trained in a supervised regime, but as was the case for tagging, the differences are often non significant. The performance on Arabic is different, where no significant advantage was observed when using backtracking. The reason for this is the agglutinative nature of Arabic and the tokenization conventions of UD that tokenizes agglutinated pronouns. The effect of this tokenization is to increase the distances between content words. The most common pattern that triggers a backtrack in tagging consists in going back to the previous word in order to modify its part of speech. In the case of Arabic, if the target of the backtrack has an agglutinated pronoun, the tagger has to perform two successive back actions to realize the correction, a pattern that is more difficult to learn.

The general conclusions that we can draw therefore is that reinforcement learning with backtrack yields the best performance for both the parser and the tagger (with the exception of Arabic), but there are no notable differences between supervised learning with a dynamic oracle and reinforcement learning (without backtrack).

The statistics on the situations in which back actions are performed have been displayed for four languages in Table 5. The table reveals some striking differences for two languages: German and Russian. For these languages, the ratio of back actions with respect to the total number of actions predicted is equal to 8.3% for German and to 8.2% for Russian, a figure that is far above what is observed for other languages. A closer look at the results shows that for these two languages, the machine learns a strategy that consists in provoking errors (it tags as punctuation linguistic tokens), in order to be able to correct them using a back action. This behavior is not due to linguistic reasons but rather to the size of the training corpora. As one can see in Table 3, the training corpora for German and Russian are much larger than they are for other languages. Our hypothesis is that, when trained on a large training corpus, the machine has more opportunities to develop complex strategies such as provoking errors in order to correct them using a back action. Indeed, this phenomenon vanishes when we reduce the size of the train set. In order to fight against this behavior, one can act on the reward function and decrease the reward of back action as the size of the training corpus increases. More investigation is needed to fully understand this behavior.

Table 5:

Statistics for back actions performed during tagging for four languages.

Langenderuar
#Act 31,852 652,500 234,658 56,528
#Errs 1,097 54,173 19,374 1,093
#Backs 691 53,573 19,435 310
bPrec 72.94% 95.88% 95.64% 58.06%
bRec 45.94% 94.83% 95.94% 16.47%
C$→$24.31% 03.77% 04.21% 39.35%
E$→$27.06% 07.35% 05.40% 29.03%
C$→$02.75% 00.35% 00.16% 02.58%
E$→$45.88% 88.53% 90.23% 29.03%
Langenderuar
#Act 31,852 652,500 234,658 56,528
#Errs 1,097 54,173 19,374 1,093
#Backs 691 53,573 19,435 310
bPrec 72.94% 95.88% 95.64% 58.06%
bRec 45.94% 94.83% 95.94% 16.47%
C$→$24.31% 03.77% 04.21% 39.35%
E$→$27.06% 07.35% 05.40% 29.03%
C$→$02.75% 00.35% 00.16% 02.58%
E$→$45.88% 88.53% 90.23% 29.03%

The reason why the tagger chose to make regular mistakes—tagging as punctuation the words that are later corrected—is not clear. Our hypothesis is that this is an error that is easy to detect in order to predict a back action.

The same phenomenon (intentional errors) has been observed, to a lesser degree, on the parser.

We have proposed, in this article, an extension to transition-based parsing that consists of allowing the parser to undo the last actions performed in order to explore alternative hypotheses. We have shown that this kind of model can be effectively trained using deep reinforcement learning.

This work will be extended in several ways.

The first concerns the disappointing results of the tagparser. As already mentioned in Section 4, many studies have shown that there is usually an advantage to jointly predict several types of linguistic annotations over predicted them separately. We did not observe this phenomenon in our backtracking tagparser. The problem could come from the structure of the backtracking machines. The structures used, illustrated in Figure 2, are just one possible architecture, others are possible, as for example dedicating different back states for the parser and the tagger.

The second direction concerns the integration of other nlp tasks in a single machine. Dary and Nasr (2021) showed that reading machines can take as input raw text and realize word segmentation, pos tagging, lemmatization, parsing, and sentence segmentation. We would like to train such a complex machine with reinforcement learning and backtracking to study whether the machine can backtrack across several linguistic levels.

A third direction concerns the processing of garden path sentences. Such sentences offer examples in which we expect a backtracking machine to backtrack. We have built a corpus of 54 garden path sentences in French, using four different syntactic patterns of different complexity level and organized the sentences in minimal pairs and have tested our machine on it. The results are mixed; in some cases, the machine behaves as expected, in other cases it does not. A detailed analysis showed that the machine usually backtracks on garden path sentences but has a tendency of not reanalyzing the sentence as was expected. The problem seems to be of a lexical nature. Some words that should be attributed new pos tags in the reanalysis phase resist this reanalysis. This is probably due to their lexical representation. More work is needed to understand this phenomenon and find ways to overcome it.

The last direction is linked to the long-term project of predicting regressive saccades. The general idea is to compare back movements predicted by our model and actual eye movement data and study whether these movements are correlated.

In arc-eager dependency parsing, a sentence of length n is processed in 2n actions: an action that pushes a word on the stack (shift or right) and an action that removes it from the stack (reduce or left). In our backtracking tagparser, we must add one tagging action and one ¬back per word.

Therefore, without applying any back action, the number of actions needed to process a sentence of size n is 4n: for each word a ¬back, a pos action, a push action and a pop action.

Let si be the length of the action sequence taking place when processing the ith word of the sentence. The sum of these lengths is also the number of actions to process the whole sentence, therefore $∑i=1nsi=4n$.

Now, in the worst case scenario where back is applied k times per word, the total number of actions is 5nk + 4n, which is the sum of:

• k back actions per word: nk.

• Initial application of the si sequences: 4n.

• The k re-processing of the sequences: $∑i=1nk×si=k∑i=1nsi=4nk$.

The input layer of the classifiers described in Section 5 is a vector of features extracted from the current configuration. Features are represented as randomly initialized and learnable embeddings of size 128, with the exception of words, that are represented by fastText pretrained word embeddings of size 300 (Bojanowski et al., 2017). Four embedding spaces are used: words, pos tags, letters, and actions.

The features are the following:

• pos tags and form of the words in a window of size [-2,2] centered on the current word, with the addition, for the parser and tagparser, of the pos tag of the governor, the rightmost and the leftmost dependents of the three topmost stack elements.

• The history of the 10 last actions performed.

• Prefix and suffix of size 4 for the current word.

• For backtracking machines, a binary feature indicating whether or not a back is allowed.

When the value of a feature is not available, it is replaced by a special learnable embedding, representing the reason of unavailability. The following situation are distinguished:

• Out of bounds: target word is before the first or after the last word of the sentence.

• Empty stack: target word is in the empty stack.

• No dep / gov: target word is the dependent / governor of a word without one.

• Not seen: target word is in right context, and has not been seen yet.

• Erased: target value has been erased after a back action.

We would like to thank the action editor as well as the anonymous reviewers, for their detailed and thoughtful insights, which helped us improve on our work substantially.

1

It is actually a rough approximation of the sliding window because it defines a span over words and not characters.

2

Some actions can be complex and change, for example, the state of an internal stack. All details concerning the Reading Machine can be found in Dary and Nasr (2021).

3

For the sake of simplicity, we consider in this paper that the text to process has already been segmented into sentences and tokenized into words, even if the Reading Machine allows performing these two operations.

4

In order to be undone, actions reduce and left also need to store stack elements that have been popped.

5

See Appendix A for details.

6

This is actually a simplified form of the Bellman optimality equation, due to the fact that our mdp is deterministic: Applying action a in configuration c yields configuration δ(c,a) with probability 1.

7

¬back in the case of the back state.

Chris
Alberti
,
David
Weiss
,
Greg
Coppola
, and
Slav
Petrov
.
2015
.
Improved transition-based parsing and tagging with neural networks
. In
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
, pages
1354
1359
.
Miguel
Ballesteros
and
Yaser
Al-Onaizan
.
2017
.
AMR parsing using stack-LSTMs
. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pages
1269
1275
,
Copenhagen, Denmark
.
Association for Computational Linguistics
.
Miguel
Ballesteros
,
Yoav
Goldberg
,
Chris
Dyer
, and
Noah A.
Smith
.
2016
.
Training with exploration improves a greedy stack LSTM parser
. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
, pages
2005
2010
.
Bernd
Bohnet
and
Joakim
Nivre
.
2012
.
A transition-based system for joint part-of-speech tagging and labeled non-projective dependency parsing
. In
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
, pages
1455
1465
.
Association for Computational Linguistics
.
Piotr
Bojanowski
,
Edouard
Grave
,
Armand
Joulin
, and
Tomas
Mikolov
.
2017
.
Enriching word vectors with subword information
.
Transactions of the Association for Computational Linguistics
,
5
:
135
146
.
Franck
Dary
and
Alexis
Nasr
.
2021
.
The reading machine: A versatile framework for studying incremental parsing strategies
. In
Proceedings of the 17th International Conference on Parsing Technologies (IWPT 2021)
, pages
26
37
,
Online
.
Association for Computational Linguistics
.
Chris
Dyer
,
Miguel
Ballesteros
,
Wang
Ling
,
Austin
Matthews
, and
Noah A.
Smith
.
2015
.
Transition-based dependency parsing with stack long short-term memory
. In
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
334
343
,
Beijing, China
.
Association for Computational Linguistics
.
Ross
Girshick
.
2015
.
Fast R-CNN
. In
Proceedings of the IEEE International Conference on Computer Vision
, pages
1440
1448
.
Yoav
Goldberg
and
Joakim
Nivre
.
2012
.
A dynamic oracle for arc-eager dependency parsing
. In
Proceedings of COLING 2012
, pages
959
976
.
Liang
Huang
and
Kenji
Sagae
.
2010
.
Dynamic programming for linear-time incremental parsing
. In
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
, pages
1077
1086
.
Philipp
Koehn
.
2004
.
Statistical significance tests for machine translation evaluation
. In
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing
, pages
388
395
,
Barcelona, Spain
.
Association for Computational Linguistics
.
Shuhei
Kurita
and
Anders
Søgaard
.
2019
.
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
2420
2430
,
Florence, Italy
.
Association for Computational Linguistics
.
Minh
and
Antske
Fokkens
.
2017
.
Tackling error propagation through reinforcement learning: A case of greedy dependency parsing
. In
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers
, pages
677
687
.
Alessandro
Lopopolo
,
Stefan L.
Frank
,
Antal Van Den
Bosch
, and
Roel
Willems
.
2019
.
D
. In
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
, pages
77
85
.
George W.
McConkie
and
Keith
Rayner
.
1975
.
The span of the effective stimulus during a fixation in reading
.
Perception & Psychophysics
,
17
(
6
):
578
586
.
George W.
McConkie
and
Keith
Rayner
.
1976
.
Asymmetry of the perceptual span in reading
.
Bulletin of the psychonomic society
,
8
(
5
):
365
368
.
Volodymyr
Mnih
,
Koray
Kavukcuoglu
,
David
Silver
,
Alex
Graves
,
Ioannis
Antonoglou
,
Daan
Wierstra
, and
Martin
Riedmiller
.
2013
.
Playing atari with deep reinforcement learning
.
arXiv preprint arXiv:1312.5602
.
Tahira
Naseem
,
Abhishek
Shah
,
Hui
Wan
,
Florian
,
Salim
Roukos
, and
Miguel
Ballesteros
.
2019
.
Rewarding Smatch: Transition-based AMR parsing with reinforcement learning
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
4586
4592
,
Florence, Italy
.
Association for Computational Linguistics
.
Joakim
Nivre
,
Johan
Hall
, and
Jens
Nilsson
.
2004
.
Memory-based dependency parsing
. In
Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL- 2004) at HLT-NAACL 2004
, pages
49
56
.
Martin
Popel
,
Zdeněk
žabokrtskỳ
, and
Martin
Vojtek
.
2017
.
UDAPI: Universal API for universal dependencies
. In
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017)
, pages
96
101
.
Keith
Rayner
and
Sara C.
Sereno
.
1994
.
Regressive eye movements and sentence parsing: On the use of regression-contingent analyses
.
Memory & Cognition
,
22
(
3
):
281
285
.
Christopher J. C.
H. Watkins
and
Peter
Dayan
.
1992
.
Q-learning
.
Machine learning
,
8
(
3–4
):
279
292
.
Christopher John Cornish Hellaby
Watkins
.
1989
.
Learning from Delayed Rewards
. Ph.D. thesis,
King’s College, Cambridge United Kingdom
.
Hiroyasu
and
Yuji
Matsumoto
.
2003
.
Statistical dependency analysis with support vector machines
. In
Proceedings of the Eighth International Conference on Parsing Technologies
, pages
195
206
.
Xiang
Yu
,
Ngoc Thang
Vu
, and
Jonas
Kuhn
.
2018
.
Approximate dynamic oracle for dependency parsing with reinforcement learning
. In
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)
, pages
183
191
,
Brussels, Belgium
.
Association for Computational Linguistics
.
Daniel
Zeman
,
Joakim
Nivre
,
Mitchell
Abrams
,
Elia
Ackermann
,
Noëmi
Aepli
,
Hamid
Aghaei
,
Željko
Agić
,
Amir
,
Lars
Ahrenberg
,
Chika Kennedy
Ajede
,
Gabrielė
Aleksandravičiūtė
,
Ika
Alfina
,
Lene
Antonsen
,
Katya
Aplonova
,
Angelina
Aquino
,
Carolina
,
Aragon
,
Maria Jesus
Aranzabe
,
Bilge Nas
Arıcan
,
Pórunn
Arnardóttir
,
Gashaw
Arutie
,
Jessica Naraiswari
Arwidarasti
,
Masayuki
Asahara
,
Deniz Baran
Aslan
,
Luma
Ateyah
,
Furkan
Atmaca
,
Mohammed
Attia
,
Aitziber
Atutxa
,
Liesbeth
Augustinus
,
Elena
,
Keerthana
Balasubramani
,
Miguel
Ballesteros
,
Esha
Banerjee
,
Sebastian
Bank
,
Verginica Barbu
Mititelu
,
Starkaður
Barkarson
,
Rodolfo
Basile
,
Victoria
Basmov
,
Colin
Batchelor
,
John
Bauer
,
John
Bauer
,
Seyyit Talha
Bedir
,
Kepa
Bengoetxea
,
Gözde
Berk
,
Yevgeni
Berzak
,
Bhat
,
Bhat
,
Erica
Biagetti
,
Eckhard
Bick
,
Agnė
Bielinskienė
,
Kristín
,
Rogier
Blokland
,
Victoria
Bobicev
,
Loïc
Boizou
,
Emanuel Borges
Völker
,
Carl
Börstell
,
Cristina
Bosco
,
Gosse
Bouma
,
Sam
Bowman
,
Boyd
,
Anouck
Braggaar
,
Kristina
Brokaitė
,
Aljoscha
Burchardt
,
Marie
Candito
,
Bernard
Caron
,
Gauthier
Caron
,
Lauren
Cassidy
,
Tatiana
Cavalcanti
,
Gülşen Cebiroğlu
Eryiğit
,
Flavio Massimiliano
Cecchini
,
Giuseppe G. A.
Celano
,
Slavomír
Čéplö
,
Neslihan
Cesur
,
Savas
Cetin
,
Özlem
Çetinoğlu
,
Fabricio
Chalub
,
Shweta
Chauhan
,
Ethan
Chi
,
Taishi
Chika
,
Yongseok
Cho
,
Jinho
Choi
,
Jayeol
Chun
,
Juyeon
Chung
,
Alessandra T.
Cignarella
,
Silvie
Cinková
,
Aurélie
Collomb
,
Çağrı
Çöltekin
,
Miriam
Connor
,
Marine
Courtin
,
Mihaela
Cristescu
,
Philemon
Daniel
,
Elizabeth
Davidson
,
Marie-Catherine
de Marneffe
,
Valeria
de Paiva
,
Mehmet Oguz
Derin
,
Elvis
de Souza
,
Arantza Diaz
de Ilarraza
,
Carly
Dickerson
,
Arawinda
Dinakaramani
,
Elisa Di
Nuovo
,
Bamba
Dione
,
Peter
Dirix
,
Kaja
Dobrovoljc
,
Timothy
Dozat
,
Kira
Droganova
,
Puneet
Dwivedi
,
Hanne
Eckhoff
,
Sandra
Eich
,
Marhaba
Eli
,
Ali
Elkahky
,
Binyam
Ephrem
,
Olga
Erina
,
Tomaž
Erjavec
,
Aline
Etienne
,
Wograine
Evelyn
,
Sidney
Facundes
,
Richárd
Farkas
,
Jannatul
Ferdaousi
,
Marília
Fernanda
,
Hector Fernandez
Alcalde
,
Jennifer
Foster
,
Cláudia
Freitas
,
Kazunori
Fujita
,
Katarína
Gajdošová
,
Daniel
Galbraith
,
Marcos
Garcia
,
Moa
Gärdenfors
,
Sebastian
Garza
,
Fabrício Ferraz
Gerardi
,
Kim
Gerdes
,
Filip
Ginter
,
Gustavo
Godoy
,
Iakes
Goenaga
,
Koldo
Gojenola
,
Memduh
Gökırmak
,
Yoav
Goldberg
,
Xavier Gómez
Guinovart
,
Berta González
Saavedra
,
Griciūtė
,
Matias
Grioni
,
Loïc
Grobol
,
Normunds
Grūzı-tis
,
Bruno
Guillaume
,
Céline
Guillot-Barbance
,
Tunga
Güngör
,
Nizar
Habash
,
Hinrik
Hafsteinsson
,
Jan
Hajič
,
Jan Hajič
Jr.
,
Mika
Hämäläinen
,
Linh Há
Mỹ
,
Na-Rae
Han
,
Hanifmuti
,
Sam
Hardwick
,
Kim
Harris
,
Dag
Haug
,
Johannes
Heinecke
,
Oliver
Hellwig
,
Felix
Hennig
,
Barbora
,
Jaroslava
Hlaváčová
,
Florinel
Hociung
,
Petter
Hohle
,
Eva
Huber
,
Jena
Hwang
,
Takumi
Ikeda
,
Anton Karl
Ingason
,
Ion
,
Elena
Irimia
,
Ọlájídé
Ishola
,
Kaoru
Ito
,
Siratun
Jannat
,
Tomáš
Jelínek
,
Apoorva
Jha
,
Anders
Johannsen
,
Hildur
Jónsdóttir
,
Fredrik
Jørgensen
,
Markus
Juutinen
,
Sarveswaran
K
,
Hüner
Kaşıkara
,
Andre
Kaasen
,
Kabaeva
,
Sylvain
Kahane
,
Hiroshi
Kanayama
,
Jenna
Kanerva
,
Neslihan
Kara
,
Boris
Katz
,
Tolga
,
Jessica
Kenney
,
Václava
Kettnerová
,
Jesse
Kirchner
,
Elena
Klementieva
,
Elena
Klyachko
,
Arne
Köhn
,
Abdullatif
Köksal
,
Kamil
Kopacewicz
,
Timo
Korkiakangas
,
Mehmet
Köse
,
Natalia
Kotsyba
,
Jolanta
Kovalevskaitė
,
Simon
Krek
,
Parameswari
Krishnamurthy
,
Sandra
Kübler
,
Oğuzhan
Kuyrukçu
,
Aslı
Kuzgun
,
Sookyoung
Kwak
,
Veronika
Laippala
,
Lucia
Lam
,
Lorenzo
Lambertino
,
Tatiana
Lando
,
Septina Dian
Larasati
,
Alexei
Lavrentiev
,
John
Lee
,
Phuong Lê
Hong
,
Alessandro
Lenci
,
Saran
,
Herman
Leung
,
Maria
Levina
,
Cheuk Ying
Li
,
Josie
Li
,
Keying
Li
,
Yuan
Li
,
KyungTae
Lim
,
Bruna Lima
,
Krister
Lindén
,
Nikola
Ljubešić
,
Olga
,
Stefano
Lusito
,
Andry
Luthfi
,
Mikko
Luukko
,
Olga
Lyashevskaya
,
Teresa
Lynn
,
Vivien
Macketanz
,
Menel
Mahamdi
,
Jean
Maillard
,
Aibek
Makazhanov
,
Michael
Mandl
,
Christopher
Manning
,
Ruli
Manurung
,
Büşra
Marşan
,
Cătălina
Mărănduc
,
David
Mareček
,
Katrin
Marheinecke
,
Héctor Martínez
Alonso
,
Lorena
Martín-Rodríguez
,
André
Martins
,
Jan
Mašek
,
Hiroshi
Matsuda
,
Yuji
Matsumoto
,
Alessandro
Mazzei
,
Ryan
McDonald
,
Sarah
McGuinness
,
Gustavo
Mendonça
,
Tatiana
Merzhevich
,
Niko
Miekka
,
Karina
Mischenkova
,
Margarita
Misirpashayeva
,
Anna
Missilä
,
Cătălin
Mititelu
,
Maria
Mitrofan
,
Yusuke
Miyao
,
AmirHossein Mojiri
Foroushani
,
Judit
Molnár
,
Amirsaeid
Moloodi
,
Simonetta
Montemagni
,
Amir
More
,
Laura Moreno
Romero
,
Giovanni
Moretti
,
Keiko Sophie
Mori
,
Shinsuke
Mori
,
Tomohiko
Morioka
,
Shigeki
Moro
,
Bjartur
Mortensen
,
Bohdan
Moskalevskyi
,
Muischnek
,
Robert
Munro
,
Yugo
Murawaki
,
Kaili
Müürisep
,
Pinkey
Nainwani
,
Mariam
Nakhlé
,
Juan Ignacio Navarro
Horñiacek
,
Anna
Nedoluzhko
,
Gunta
Nešpore-Bērzkalne
,
Manuela
Nevaci
,
Lương
Nguyễn Thị
,
Huyền Nguyễn Thị
Minh
,
Yoshihiro
Nikaido
,
Vitaly
Nikolaev
,
Rattima
Nitisaroj
,
Alireza
Nourian
,
Hanna
Nurmi
,
Stina
Ojala
,
Atul Kr.
Ojha
,
Olúòkun
,
Mai
Omura
,
Emeka
Onwuegbuzia
,
Petya
Osenova
,
Robert
Östling
,
Lilja
Øvrelid
,
Şaziye Betül
Özateş
,
Merve
Özçelik
,
Arzucan
Özgür
,
Balkız Öztürk
Başaran
,
Hyunji Hayley
Park
,
Niko
Partanen
,
Elena
Pascual
,
Marco
Passarotti
,
Agnieszka
Patejuk
,
Guilherme
Paulino-Passos
,
Angelika
Peljak-Łapińska
,
Siyao
Peng
,
Cenel-Augusto
Perez
,
Natalia
Perkova
,
Guy
Perrier
,
Slav
Petrov
,
Daria
Petrova
,
Jason
Phelan
,
Jussi
Piitulainen
,
Tommi A.
Pirinen
,
Emily
Pitler
,
Thierry
Poibeau
,
Barbara
Plank
,
Larisa
Ponomareva
,
Martin
Popel
,
Lauma
Pretkalnina
,
Sophie
Prévost
,
Prokopis
Prokopidis
,
Przepiórkowski
,
Tiina
Puolakainen
,
Sampo
Pyysalo
,
Peng
Qi
,
Andriela
Rääbis
,
Alexandre
,
Mizanur
Rahoman
,
Taraka
Rama
,
Loganathan
Ramasamy
,
Carlos
Ramisch
,
Fam
Rashel
,
Rasooli
,
Vinit
Ravishankar
,
Livy
Real
,
Petru
Rebeja
,
Siva
Reddy
,
Mathilde
Regnault
,
Georg
Rehm
,
Ivan
Riabov
,
Michael
Rieβler
,
Erika
Rimkutė
,
Larissa
Rinaldi
,
Laura
Rituma
,
Putri
Rizqiyah
,
Luisa
Rocha
,
Eiríkur
Rögnvaldsson
,
Mykhailo
Romanenko
,
Rudolf
Rosa
,
Valentin
Rosca
,
Davide
Rovati
,
Olga
Rudina
,
Jack
Rueter
,
Kristján
,
Shoval
,
Pegah
Safari
,
Benoît
Sagot
,
Aleksi
Sahala
,
Saleh
,
Alessio
Salomoni
,
Tanja
Samardžić
,
Stephanie
Samson
,
Manuela
Sanguinetti
,
Ezgi
Sanıyar
,
Dage
Särg
,
Baiba
Saulı-te
,
Yanin
Sawanakunanon
,
Shefali
Saxena
,
Kevin
Scannell
,
Salvatore
Scarlata
,
Nathan
Schneider
,
Sebastian
Schuster
,
Lane
Schwartz
,
Djamé
Seddah
,
Wolfgang
Seeker
,
Mojgan
Seraji
,
Syeda
,
Mo
Shen
,
Atsuko
,
Hiroyuki
Shirasu
,
Yana
Shishkina
,
Muh
Shohibussirri
,
Dmitry
Sichinava
,
Janine
Siewert
,
Einar Freyr
Sigurdsson
,
Aline
Silveira
,
Natalia
Silveira
,
Maria
Simi
,
Simionescu
,
Katalin
Simkó
,
Mária
Šimková
,
Kiril
Simov
,
Maria
Skachedubova
,
Aaron
Smith
,
Isabela Soares-
Bastos
,
Shafi
Sourov
,
Carolyn
,
Rachele
Sprugnoli
,
Steinpór
Steingrímsson
,
Antonio
Stella
,
Milan
Straka
,
Emmett
Strickland
,
Jana
,
Alane
Suhr
,
Yogi Lesmana
Sulestio
,
Umut
Sulubacak
,
Shingo
Suzuki
,
Zsolt
Szántò
,
Chihiro
Taguchi
,
Dima
Taji
,
Yuta
Takahashi
,
Fabio
Tamburini
,
Mary Ann C.
Tan
,
Takaaki
Tanaka
,
Dipta
Tanaya
,
Samson
Tella
,
Isabelle
Tellier
,
Marinella
Testori
,
Guillaume
Thomas
,
Liisi
Torga
,
Marsida
Toska
,
Trond
Trosterud
,
Anna
Trukhina
,
Reut
Tsarfaty
,
Utku
Türk
,
Francis
Tyers
,
Sumire
Uematsu
,
Roman
Untilov
,
Zdeňka
Urešová
,
Larraitz
Uria
,
Hans
Uszkoreit
,
Andrius
Utka
,
Sowmya
Vajjala
,
Rob
van der Goot
,
Martine
Vanhove
,
Daniel
van Niekerk
,
Gertjan
van Noord
,
Viktor
Varga
,
Eric Villemonte
de la Clergerie
,
Veronika
Vincze
,
Natalia
Vlasova
,
Aya
Wakasa
,
Joel C.
Wallenberg
,
Lars
Wallin
,
Abigail
Walsh
,
Jing Xian
Wang
,
Jonathan North
Washington
,
Maximilan
Wendt
,
Paul
Widmer
,
Sri Hartati
Wijono
,
Seyi
Williams
,
Mats
Wirén
,
Christian
Wittern
,
Tsegay
Woldemariam
,
Tak-sum
Wong
,
Alina
Wróblewska
,
Mary
Yako
,
Kayo
Yamashita
,
Naoki
Yamazaki
,
Chunxiao
Yan
,
Koichi
Yasuoka
,
Marat M.
Yavrumyan
,
Arife Betül
Yenice
,
Olcay Taner
Yıldız
,
Zhuoran
Yu
,
Arlisa
Yuliawati
,
Zdeněk
Žabokrtský
,
Shorouq
Zahra
,
Amir
Zeldes
,
He
Zhou
,
Hanzhi
Zhu
,
Anna
Zhuravleva
, and
Rayan
Ziane
.
2021
.
Universal dependencies 2.9
.
LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
.
Lidan
Zhang
and
Kwok-Ping
Chan
.
2009
.
Dependency parsing with energy-based reinforcement learning
. In
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)
, pages
234
237
.
Yue
Zhang
and
Stephen
Clark
.
2008
.
A tale of two parsers: Investigating and combining graph-based and transition-based dependency parsing
. In
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing
, pages
562
571
.
Yue
Zhang
and
Joakim
Nivre
.
2012
.
Analyzing the effect of global learning and beam- search on transition-based dependency parsing
. In
Proceedings of COLING 2012: Posters
, pages
1391
1400
.

## Author notes

Action Editor: Yusuke Miyao

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.