Dependency Parsing with Backtracking using Deep Reinforcement Learning

Abstract Greedy algorithms for NLP such as transition-based parsing are prone to error propagation. One way to overcome this problem is to allow the algorithm to backtrack and explore an alternative solution in cases where new evidence contradicts the solution explored so far. In order to implement such a behavior, we use reinforcement learning and let the algorithm backtrack in cases where such an action gets a better reward than continuing to explore the current solution. We test this idea on both POS tagging and dependency parsing and show that backtracking is an effective means to fight against error propagation.


Introduction
Transition based parsing has become a major approach in dependency parsing, since the work of Yamada and Matsumoto (2003) and Nivre et al. (2004) for it combines linear time complexity and high linguistic performances.The algorithm follows a local and greedy approach to parsing that consists in selecting at every step of the parsing process the action that maximizes a local score, typically computed by a classifier.The action selected is greedily applied to the current configuration of the parser and yields a new configuration.
At training time, an oracle function transforms the correct syntactic tree of a sentence into a sequence of correct (configuration, action) pairs.These pairs are used to train the classifier of the parser.The configurations that do not pertain to the set of correct configurations are never seen during training.
At inference time, if the parser predicts and executes an incorrect action, it produces an incorrect configuration, with respect to the sentence being parsed, which might have never been seen during training, yielding a poor prediction of the next ac-tion to perform.Besides, the parser follows a single hypothesis by greedily selecting the best scoring action.The solution built by the parser can be sub-optimal for there is no guarantee that the sum of the scores of the actions selected maximizes the global score.
These are well known problems of transition based parsing and several solutions have been proposed in the literature to overcome them, they will be briefly discussed in Section 2. The solution we propose in this article consists in allowing the parser to backtrack.At every step of the parsing process, the parser has the opportunity to undo its n previous actions to explore alternative solutions.The decision to backtrack or not is taken each time a new word is considered, before trying to process it, by giving the current configuration to a binary classifier, that will assign a score to the backtracking action.Traditional supervised learning is not suited to learn such a score, since the training data contains no occurrences of backtrack actions.In order to learn in which situation a backtrack action is worthy, we use reinforcement learning.During training, the parser has the opportunity to try backtracking actions and the training algorithm responds to this choice by granting it a reward.If the backtracking action is the adequate move to make in the current configuration, it will receive a positive reward and the parser will learn in which situation backtrack is adequate.
The work presented here is part of a more ambitious project which aims at modeling the eye movements during human reading.More precisely, we are interested to predict regressive saccades: eye movements that bring the gaze back to a previous location in the sentence.
There is much debate in the psycholinguistic literature concerning the reasons of such eye movements (Lopopolo et al., 2019).Our position with respect to this debate is the one advocated by Rayner and Sereno (1994) for whom part of these saccades are linguistically motivated and happen in situations where the reader incremental comprehension of the sentence is misguided by an ambiguous sentence start, until reaching a novel word which integration will prove incompatible with the current understanding and will trigger a regressive saccade, as in garden path sentences.Our long time project is to model regressive saccades with backtracking actions.Although this work enters in this long term project, the focus of this article is on the NLP aspects of this program and propose a way to implement backtracking in the framework of transition based parsing.We will just mention some preliminary studies on garden path sentences in the conclusion.
In order to move in the direction of a more cognitively plausible model, we add two constraints to our model.
The first one concerns the text window around the current word that the parser takes into account when predicting an action.This window can be seen as an approximation of the sliding window introduced by McConkie and Rayner (1975) to model the perceptual span of a human reader1 .Transition based parsers usually allow taking into account the right context of the current word.The words in the right context constitute a rich source of information and yield better predictions of the next action to perform.
In our model, the parser does not have access to this right context, simulating the fact that a human reader has only a limited access to the right context (few characters to the right of the fixation point (McConkie and Rayner, 1976)).It is only after backtracking that a right context is available for it has been uncovered before backtrack took place.
The second constraint is incrementality.When performing several tasks, such as POS tagging and parsing, as will be done in the tagparser described in Section 3, these tasks are organized in an incremental fashion.At each step, a word is read, POS tagged and integrated in the syntactic structure, a more cognitively plausible behavior than a sequential approach where the whole sentence is first POS tagged then parsed.
The structure of the paper is the following: in Section 2, we compare our work to other approaches in transition based parsing which aim at proposing solutions to the two problems mentioned above.In Section 3, we describe the model that we use to predict the linguistic annotation and introduce the notion of a BACK action.In Section 4 we show how backtracking actions can be predicted using Reinforcement Learning.Section 5 describes the experimental part of the work and discusses the results obtained.Section 6 concludes the paper and presents different directions in which this work will be extended.

Related Work
Several ways to overcome the two limits of transition based parsing mentioned in the introduction, namely training the parser with only correct examples and exploring only a single hypothesis at inference time have been explored in the literature.

Beam Search
The standard solution to the single hypothesis search is beam search which allows considering a fixed number of solutions in parallel during parsing.Beam search is a general technique that has been applied to transition based parsing in many works, among which, Zhang and Clark (2008), Huang and Sagae (2010) and Zhang and Nivre (2012).They show that exploring a fixed number of solution increases the linguistic performances over a single hypothesis parser.In this work, we do not use a beam search algorithm, but we do explore several solutions in a non parallel fashion, using backtracking.
Dynamic Oracles A first way to overcome the lack of exploration during training problem, is the proposition by Goldberg and Nivre (2012) to replace the standard oracle of transition based parsing by a dynamic oracle that is able to determine the optimal action a to perform for an incorrect configuration c.During training, the dynamic oracle explores a larger part of the configuration space than the static oracle and produces for an incorrect configuration c an optimal action a.The pair (c, a) is given as training example to the classifier, yielding a more robust classifier that is able to predict the optimal action in some incorrect configurations.Ballesteros et al. (2016) show that the principle of the dynamic oracle can be adapted to train the greedy Stack-LSTM dependency parser of Dyer et al. (2015), improving its performances.Yu et al. (2018) describe a method, for training a small and efficient neural network model, that approximates a dynamic oracle for any transition system.Their model is trained using reinforcement learning.
In this paper, we use dynamic oracles in two different ways.First, as baselines for models allowing some exploration during training.Second, in the definition of the immediate reward function, as explained in Section 4.
Reinforcement Learning A second answer to the lack of exploration problem is reinforcement learning.The exploitation, exploration trade-off of reinforcement learning allows, at training time, the model to explore some incorrect configurations and to learn a policy that selects in such cases the action that maximizes the long-term reward of choosing this action.Reinforcement learning is well suited for transition based algorithms.
To the best of our knowledge, two papers directly address the idea of training a syntactic transition based parser with reinforcement learning: Zhang and Chan (2009) and Lê and Fokkens (2017).Zhang and Chan (2009) cast the problem, as we do, as a Decision Markov Process but use a Restricted Boltzmann Machine in order to compute scores of actions and use the SARSA algorithm to compute an optimal policy.In our case, we use deep Q-learning based on a Multi Layer Perceptron, as described in Section 4. In addition, their immediate reward function is based on the number of arcs in the gold tree of a sentence that are absent in the tree being built.We also use an immediate reward function, but it is based on the number of arcs in the gold tree that cannot anymore be built given the current tree being built, an idea introduced in the dynamic oracle of Goldberg and Nivre (2012).
Two major differences distinguish our approach and the work of Lê and Fokkens (2017).The first one is the idea of pre-training a parser in a supervised way and then fine-tuning its parameters using reinforcement learning.In our case, the parser is trained from scratch using reinforcement learning.The reason for this difference is related to our desire to learn how to backtrack: it is difficult to make the parser learn to backtrack when it has been initially trained not to do it (using standard supervised learning).The second major difference is the use of a global reward function, that is computed after the parser has parsed a sentence.In our case, as mentioned above, we use an immediate reward.The reason for this difference is linked to pretraining.Since we do not pre-train our parser, and allow it to backtrack, granting a reward at the end of the sentence does not allow the parser to converge since reaching the end of the sentence is almost impossible using a non pre-trained parser.Other less fundamental differences distinguish our approach, such as the use of Q-learning, in our case, to find an optimal policy, where they use a novel algorithm called Approximate Learning Gradient, based on the fact that their model's output is a probability distribution over parsing actions (as in standard supervised learning).Another minor difference is the exploration of the search space in the training phase.They sample the next action to perform using the probability distribution computed by their model, while we use an adaptation of ε-greedy we describe in Section 4.
Reinforcement learning has also been used to train a transition based model for other tasks.Naseem et al. (2019) present a method to fine-tune the Stack-LSTM transition-based AMR parser of Ballesteros and Al-Onaizan (2017) with reinforcement learning, using the Smatch score of the predicted graph as reward.For semantic dependency parsing, Kurita and Søgaard (2019) found that fine-tuning their parser with policy gradient allows it to develop an easy-first strategy, reducing error propagation.
The fundamental difference between all papers cited above and our work is the idea of adding backtracking in a greedy transition based model.We use reinforcement learning as a mean to achieve the exploration necessary to this goal.In term of parsing performance, our method will fare lower than the state of the art in transition based parsing for our parser is constrained to not see words beyond the current word, a constraint that comes from the fact that our long term goal is not to improve parsing performances but to find a natural way to encourage a parser to simulate regressive saccades observed during human reading.

Backtracking Reading Machines
Our model is an extension of the Reading Machine, a general model for NLP proposed in Dary and Nasr (2021) that generalizes transition based parsing to other NLP tasks.A Reading Machine is a finite automaton which states correspond to linguistic levels.There can be, for example, one state for POS tagging, one state for lemmati-zation, one state for syntactic parsing. . .When the machine is in a given state, an action is predicted, which generally writes on an output tape a label corresponding to the prediction just made2 .There are usually as many output tapes as there are levels of linguistic predictions and at least one input tape which usually contains the words of the sentence to process3 .Predictions are realized by classifiers, that take as input the configuration of the machine and compute a probability distribution over the possible actions.Configurations are complex objects that describe the aspects of the machine that are useful in order to predict the next action to perform.Among the important elements of a configuration for the rest of the paper, we can cite the current state of the machine, the word index, noted wi, that is the position of the word currently processed and the history: a list of all actions performed so far.
The text is read word by word, a window of an arbitrary size centered on wi defines the part of the text that can be used to make predictions on the current word.Figure 1 shows the architecture of three simple machines.The two machines in the top part of the figure realize a single task.The machine on the left part realizes POS tagging.It is made of a single state and has a series of transitions that loop on the state.The machine has one input tape from which words are read and one output tape, on which predicted POS are written.Each transition is labeled with a tagging action of the form POS(p) that simply writes the POS tag p on the POS output tape at the word index position.
The machine on the right implements an arceager transition based parser that produces unlabeled dependency trees.It has the same simple structure than the tagging machine.Its transitions are labeled with the four standard arc-eager actions of unlabeled transition based parsing: LEFT , REDUCE, SHIFT and RIGHT .The machine has two input tapes, one for words and one for POS tags, and one output tape on which it writes the index of the governor of the current word when LEFT or RIGHT actions are predicted.
The machine on the bottom part of the figure, which we call a tagparser, realizes the two tasks simultaneously in an incremental fashion.When in state POS, the machine tags the current word then control jumps to the parser in order to attach the current word to the syntactic structure built so far or to store it in a stack.Once this is done, control returns to state POS to tag the next word.The reason why the transitions labeled RIGHT and SHIFT lead to state POS is that these two actions increase the word index wi and initiate the processing of a new word.The machine has one input tape for words and two output tapes, one for POS tags and one for the governor position.
We augment the machines described above in order to allow them to undo some preceding actions.This ability relies on three elements: (a) the definition of a new action, called BACK, that undoes a certain number of actions (b) the history of the actions performed so far in order to decide which actions to undo and (c) the definition of undoing an action.
Undoing an action amounts to recover the configuration that existed before the action was performed.This is quite straightforward for tagging and parsing actions4 .
Three backtracking machines, based on the machines of Figure 1, are shown in Figure 2.They all have an extra state, named BACK, and two extra transitions.When in state BACK, the machine predicts one of the two actions BACK or ¬BACK.When action ¬BACK is selected, control jumps either to state POS or SYN, depending on the ma- chine, and the machine behaves like the simple machines of Figure 1.Action ¬BACK does not modify the configuration of the machine.This is not the only possible architecture for backtracking machines, this point will be briefly addressed in the conclusion.
If action BACK is selected, the last actions of the history are undone until a ¬BACK action is reached.This definition of action BACK allows undoing all actions that are related to the previous word.After BACK has been applied, the configuration of the machine is almost the same as the configuration it was in before processing the current word.There is however a major difference: it has now access to the following word.Otherwise, the machine would deterministically predict the action it has predicted before.One can notice that the transition labeled BACK in the machine loops on a single state, this feature allows the machine to perform several successive BACK actions.represented: the input tape, that contains tokens, the POS tape and the parsing tape that contains the index of the governor of the current word.The figure also represents, at the bottom, the BA array, that is described below, as well as the sequence of actions predicted since the last visit to state BACK.
The current word appears in boldface.The figure shows two successive occurrences of a BACK actions, after the second determiner is read, leading to the re-analysis of the word old that was tagged ADJ and the word man that was tagged NOUN.
A backtracking machine as the one described above can run into infinite loops: nothing prevents it to repeat endlessly the same sequence of actions.One can hope that, during training, such a behavior leads to poor performances and is ruled out.But there is no guarantee that this will be the case.Furthermore, we would like to prevent the machine for exploring, at inference time, the whole (or a large part of the) configuration space.In order to do so, we introduce a constraint on the num-ber of times a BACK action is taken when parsing a sentence.A simple way to introduce such a constraint is to limit to a given constant k the number of authorized BACK actions per word.This feature is implemented by introducing an array BA of size n, where n is the number of words of the sentence to process.Array BA is initialized with zeros, and every time a BACK action is predicted in position i, the value of BA[wi] is incremented.When the machine is in state BACK and BA[wi] is equal to k, performing a BACK action is not permitted.A ¬BACK action is forced, bypassing the classifier that decides whether to backtrack or not.
The introduction of array BA and the parameter k defines an upper bound on the size of the action sequence for a sentence of length n.This upper bound is equal to 3nk + 2n for the tagger, 4nk + 3n for the parser and 5nk + 4n for the tagparser5 .As one can notice, linearity is preserved.In our experiments, we chose k = 1.

Reading
Machines, as introduced by Dary and Nasr (2021) are trained in a supervised learning fashion.Given data annotated at several linguistic levels, an oracle function decomposes it into a sequence of configurations and actions (c 0 , a 0 , c 1 , a 1 , . . ., c n , a n ).This sequence of configurations and actions constitute the training data of the classifiers of the machine to train: pairs (c i , a i ) are presented iteratively to the classifier during the training stage.A backtracking Reading Machine cannot be trained this way since there are no occurrences of BACK actions in the data.In order to learn useful occurrences of such actions, the training process should have the ability to generate some BACK actions and be able to measure if this action was beneficial.
In order to implement such a behavior, we use Reinforcement Learning (RL).We cast our problem as a Markov Decision Process (MDP).In an MDP, an agent (the parser) is in configuration c t at time t.It selects an action a t from an action set (made of the tagging actions, the parsing actions and the BACK and ¬BACK actions) and performs it.We note C the set of all configurations and A the set of actions.The environment (the annotated data) responds to action a t by giving a reward r t = r(c t , a t ) and by producing the succeed-ing configuration c t+1 = δ(c t , a t ).In our case, configuration c t+1 is deterministically determined by the structure of the RM.The reward function gives high reward to actions that move the parser towards the correct parse of the sentence.The fundamental difference with supervised training is that, during training, the agent is not explicitly told which actions to take, but instead must discover which action yields the most reward through trial and error.This feature gives the opportunity for the parser to try some BACK actions, provided that a reward can be computed for such actions.
Given an MDP that indicates the reward associated to applying action a in configuration c, the goal is to learn a function q * (c, a) which maximizes the total amount of reward (also called discounted return) that can be expected after action a has been applied.Once this function (or an approximation of it) is computed, one can use it to select which action to choose when in configuration c, by simply picking the action that maximizes function q * .Such a behavior is called an optimal policy in the RL literature.Function q * is the solution to the Bellman optimality equation6 : where γ ∈ [0, 1] is the discount factor that allows discounting the reward of future actions.The equation expresses the relationship between the value of an action a in configuration c and the value of the best action a ′ that can be performed in its successor configuration δ(c, a).This recursive definition is the basis for algorithms that iteratively approximate q * , among which Q learning (Watkins, 1989), which approximates q * with a function called Q.In Q learning, during training, each time an action a is chosen in configuration c, the value of Q(c, a) is updated: where α is a learning rate and It has been proven (Watkins and Dayan, 1992) that such iterative algorithm converges towards the q * function.
In order to store the values of Q, the algorithm uses a large table that has an entry for every (configuration, action) pair.In our case, there are far too many configurations to allocate such a table.Instead we use a simple form of deep Q learning (Mnih et al., 2013) and approximate function Q using a multi-layered perceptron C Q .
C Q takes as input a configuration c and outputs a vector whose dimension is the number of different actions in the action set.The component of the vector corresponding to action a is the approx- During training, every time an action a is performed by the parser in configuration c, the parameters of C Q are updated using gradient descent of the loss function.The loss function should be defined in a way to minimize the difference between the actual value C Q (c, a) and its updated value C ′ Q (c, a).This difference is computed with the smooth l1 loss (Girshick, 2015): where Reward Functions In RL, the training process is guided by the reward r(c, a) granted by the environment when action a is performed in configuration c.Defining a reward function for tagging action and parsing action is quite straightforward.
In the case of tagging, the reward should be high when the tag chosen for the current word is correct and low when it is not.A simple reward function is one that associates, for example, value 0 in the first case and 1 in the second.More elaborate reward functions could be defined that penalize more some confusions (for example tagging a verb as a preposition).
In the case of parsing, a straightforward reward function will give a reward of zero for a correct action and a negative reward for an incorrect one.We use a slightly more complex function inspired by the dynamic oracle of Goldberg and Nivre (2012).This function, in the case of an incorrect action a, counts the number of dependencies of the correct analysis that cannot anymore be predicted due to a.The reward for a is the opposite of this number.Actions that cannot be executed, such as popping an empty stack, are granted a reward of 1.5.
Defining the reward function for BACK actions is more challenging.When a BACK action is per-formed, a certain number of actions a i . . .a i+k are undone.Each of these actions was granted a reward r i . . .r i+k .Let's call E = −Σ k t=0 r i+t the opposite of the sum of these rewards (E ≥ 0).The larger E is, the more errors have been made.Let's call ϕ(E) the function that computes the reward for executing a BACK action, given E. Formally, we want ϕ to respect the following principles: 1. Don't execute a BACK action if there are no errors: ϕ(0) < 0.
2. ϕ(E) should be increasing with respect to E: the more errors, the more a BACK action should be encouraged.

ϕ(E)
should not grow too fast with respect to E. Granting too much reward to BACK actions encourages the system to make errors in order to correct them with a highly rewarded BACK action.
A function that fulfills these principles is the following: It is the function that we use to compute the reward of a BACK action.
Exploring the Configurations Space In order to learn useful BACK actions, the parser should try, in the training phase, to perform some BACK actions and update the classifier based on the reward it receives from the environment.One standard way to do that is to adopt an ε-greedy policy in which the model selects a random action with probability ε or the most probable action as predicted by the classifier with probability 1 − ε.This setup allows the system to perform exploitation (choosing the best action so far) as well as exploration (randomly choosing an action).We adopt a variant of this policy, based on two parameters ε and β with 0 ≤ ε ≤ 1, 0 ≤ β ≤ 1 and ε + β ≤ 1.As in standard ε-greedy policy, the agent chooses a random action with probability ε, it chooses the correct action as given by the oracle 7 with probability β and finally, it chooses the most probable action as predicted by the classifier with probability 1−(ε+β).Parameter β has been introduced in order to speed up training.In the beginning of the training process, the system is encouraged to follow the oracle.Then, as training progresses, the system relies more on its predicting capacity (exploitation augments) and less on the oracle.

Experiments
Three machines were used in our experiments: a tagger, a parser and a tagparser, based on the architectures of Figure 2.Each of the three machines has been trained in three different learning regimes: Supervised Learning (SL) using a dynamic oracle, Reinforcement Learning without back actions (RL) and Reinforcement Learning with back actions (RLB).

Universal Dependencies Corpus
Our primary experiments were conducted on a French Universal Dependencies corpus (Zeman et al., 2021), more specifically the GSD corpus, consisting of 16, 341 sentences and 400, 399 words.The original split of the data was 88% train, 9% dev and 3% test.The size of the test set being too small to obtain significant results, we decided to use a k-fold strategy, where all the data was first merged, randomly shuffled and then split into ten folds, each fold becoming the test set of a new train/dev/test split of the data in the following proportions: 80%/10%/10%.
Using the ten folds was unnecessary to get significant results, we therefore decided to limit ourselves to three folds.The size of the test set for which results are reported has a size of 4, 902 sentences and 124, 560 words.

Experimental Setup
Each machine consists of a single classifier, a Multi Layer Perceptron, with a single hidden layer of size 3200.When the machine realizes several tasks, as in case of the tagparser, the classifier has one decision layer for each task.The output size of each decision layer is the number of actions of its corresponding task.A dropout of 30% is applied to the input vector and to the output of the hidden layer.The output of the hidden layer is given as input to a ReLU function.The structure of the classifier has been represented in Figure 5.The details of the features extracted from configurations and their encoding in the input layer of the classifier are detailed in Appendix B.
In the case of Supervised Learning, the machines are trained with a dynamic oracle.In the beginning of the training process, the machines are trained with a static oracle, for two epochs.Then every two epoch, the machines are used to decode the training corpus and for each configuration produced (which could be incorrect) the dynamic oracle selects the optimal action and these (configuration, action) pairs are used to train the classifier.
Training the machines in the RL regime is longer than training them in the SL regime.In the first case, 200 epochs were needed and 300 in the second.This difference is probably due to the larger exploration of the configuration space.

Results -Performances
The results for the three machines, under the three learning regimes are shown in Table 1.POS tagging performances are measured with accuracy and displayed in column UPOS.Dependency parsing performances are measured with the unlabeled accuracy score (ratio of words that have been attached to the correct governor) and displayed in the UAS column.The p-value next to each score is a confidence metric indicating if the score is significantly better than the one below (that's why the last line is never given a p-value).This p-value has been estimated with a paired bootstrap resampling algorithm (Koehn, 2004), using the script (Popel et al., 2017) of the CoNLL 2018 shared task.
The table shows the same pattern for the three machines: the RLB regime gets higher results than the simple RL regime which is itself better than the SL regime.Two important conclusions can be drawn from these results.The first one is that RLB regime is consistently better than SL: backtracking machines make less errors than machines trained in supervised mode.At this point we do not know whether this superiority comes from reinforcement learning or the addition of a BACK action.In fact, previous experiments in Zhang and Chan (2009) and Lê and Fokkens (2017) showed that reinforcement learning (without backtracking) can lead to better results than supervised learning.The comparison of RLB and RL shows that most of the performance boost comes from backtracking.
The results of Table 1 also show that the tagparser gets better results than single task machines (the tagger and the parser) when trained with supervised learning.Note that this comparison is possible because the parser was not given gold PoS as input, but instead the ones predicted by the tagparser.These results are in line with the work of Bohnet and Nivre (2012) and Alberti et al. (2015) that show that joint prediction of POS tags and syntactic tree improves the performances of both.However, this is not true when the machines are trained with reinforcement learning.In this case the parser and the tagger get better results than the tagparser.One reason that could ex-plain this difference is the size of the configuration space of the tagparser that is an order of magnitude larger than those of the tagger or the parser.We will return to this point in the conclusion.One can gain a better understanding of the effect of the BACK actions performed by the three machines with the statistics displayed in Table 2.Each column of the table concerns one machine trained in RLB mode.The first line shows the total number of actions predicted while decoding the test set, the second one, the number of errors made and the third one, the number of BACK actions predicted.Lines four and five give the precision and recall of the BACK actions.The precision is the ratio of BACK actions that were predicted after an error was made while the recall is the ratio of errors after which a BACK action was predicted.These two figures measure the error detection capabilities of the BACK actions prediction mechanism.In the case of the parser, the precision is equal to 76.86%, which means that 76.86% of the BACK actions were predicted after an error was made and 24.52% (recall) of the errors provoked a BACK action prediction.The recall constitutes an upper bound of the errors that could be corrected.The four last lines break down the BACK actions predicted into four categories.C→C is the case where a BACK action was predicted after a correct action, but did not change the action, E→E is the case where a BACK action was predicted after an error, but the error was not corrected, either the same erroneous action was predicted or another erroneous one was predicted.E→C is the case where a BACK action was predicted after an error and has corrected it, while C→E is the case where a correct action was replaced by an incorrect one after a BACK action.

Results
Several conclusions can be drawn from these statistics.
First, backtracking corrects errors.For the three machines, there are much more cases where an error is corrected rather than introduced after a BACK action was predicted (E→C ≫ C→E).This means that the difference in scores that we observed between RL and RLB in Table 1 can indeed be attributed to backtracking.
Second, backtracking is conservative.The number of predicted BACK actions is quite low (around 1% of the actions for the tagger and the parser and around 3% for the tagparser) and the precision is quite high.The machines do not backtrack very often and they usually do it when errors are made.This is the kind of behavior we were aiming for.It can be modified by changing the reward function ϕ of the BACK action.
Third, tagging errors are easier to correct than parsing errors.The comparison of columns two and three (parser and tagger) shows that the tagger has a higher recall than the parser, tagging errors are therefore easier to detect.This comparison also shows that E→C is higher for the tagger than it is for the parser, tagging errors are therefore easier to correct.
At last, the poor performances of the tagparser do not come from the fact that it does not backtrack.It actually does backtrack around three times as much as the parser or the tagger.But it has a hard time correcting the errors, most of the time, it reproduces errors made before.We will return to this point in the conclusion.In order to study the behavior of backtracking on other languages, we have trained and evaluated our system on six other languages from various typological origin: Arabic (AR), Chinese (ZH), English (EN), German (DE), Romanian (RO) and Russian (RU).The experimental setup for these languages is different from the one we have used for French: we have used the original split in train, development and test sets, as defined in the Universal Dependencies corpora.We report in Table 3 the corpora used for each language as well as the size of the training, development and test sets.We did not run experiments on the tagparser for it gave poor results on our experiments on French.Besides, for sanity check, we have rerun experiments on French data, using the original split in order to make sure that the difference of experimental conditions did not yield important differences in the results.The results of these experiments can be found in Table 4.The table indicates p-values of the difference between one system and the next best performing one.The system with the worse performances is therefore not associated to a pvalue.

Results on
The results obtained on French are lower than the results obtained using the k-fold strategy.But the drop is moderate: 0.13% for the tagger and of 0.52% for the parser and could be explained simply by the difference of the test corpora on which the systems were evaluated.
We observe more or less the same pattern for the new languages: the highest performances are reached by reinforcement learning with backtrack (RLB), for both the tagger and the parser.The second best performing systems for tagging are usually trained with reinforcement learning but differences are usually non significant.In the case of the parser, the second best performing systems are trained in a supervised regime, but as was the case for tagging, the differences are often non significant.The performances on Arabic are different, where no significant advantage was observed when using backtracking.The reason for this is the agglutinative nature of Arabic and the tokenization conventions of UD that tokenizes agglutinated pronouns.The effect of this tokenization is to increase the distances between content words.The most common pattern that triggers a backtrack in tagging consists in going back to the previous word in order to modify its part of speech.In the case of Arabic, if the target of the backtrack has an agglutinated pronoun, the tagger has to perform two successive BACK actions to realize the correc- tion, a pattern that is more difficult to learn.
The general conclusions that we can draw therefore is that reinforcement learning with backtrack yields the best performances for both the parser and the tagger (with the exception of Arabic), but there are no notable differences between supervised learning with a dynamic oracle and reinforcement learning (without backtrack).
The statistics on the situations in which BACK actions are performed have been displayed for four languages in Table 5.The table reveals some striking differences for two languages: German and Russian.For these languages, the ratio of BACK actions with respect to the total number of actions predicted, is equal to 8.3%  A closer look at the results shows that for these two languages, the machine learns a strategy that consists in provoking errors (it tags as punctuation linguistic tokens), in order to be able to correct them using a BACK action.This behavior is not due to linguistic reasons but rather to the size of the training corpora.
As one can see in Table 3, the training corpora for German and Russian are much larger than they are for other languages.Our hypothesis is that, when trained on a large training corpus, the machine has more opportunities to develop complex strategies such as provoking errors in order to correct them using a BACK action.Indeed, this phenomenon vanishes when we reduce the size of the train set.
In order to fight against this behavior, one can act on the reward function and decrease the reward of BACK action as the size of the training corpus increases.More investigations is needed to fully understand this behavior.The reason why the tagger chose to make regular mistakes: tagging as punctuation the words that are later corrected, is not clear.Our hypothesis is that this is an error that is easy to detect in order to predict a BACK action.
The same phenomenon (intentional errors) has been observed, to a lesser degree, on the parser.

Conclusions
We have proposed, in this article, an extension to transition based parsing that consists in allowing the parser to undo the last actions performed in order to explore alternative hypotheses.We have shown that this kind of model can be effectively trained using deep reinforcement learning.This work will be extended in several ways.
The first one concerns the disappointing results of the tagparser.As already mentioned in Section 4, many studies have shown that there is usually an advantage to jointly predict several types of linguistic annotations over predicted them separately.We did not observe this phenomenon in our backtracking tagparser.The problem could come from the structure of the backtracking machines.The structures used, illustrated in Figure 2 are just one possible architecture, others are possible, as for example dedicating different BACK states for the parser and the tagger.
The second direction concerns the integration of other NLP tasks in a single machine.Dary and Nasr (2021) showed that reading machines can take as input raw text and realize word segmentation, POS tagging, lemmatization, parsing and sentence segmentation.We would like to train such complex machine with reinforcement learning and backtracking to study whether the machine can backtrack across several linguistic levels.
A third direction concerns the processing of garden path sentences.Such sentences offer examples in which we expect a backtracking machine to backtrack.We have built a corpus of 54 garden path sentences in French, using four different syntactic patterns of different complexity level and organized the sentences in minimal pairs and have tested our machine on it.The results are mixed, in some cases, the machine behaves as expected, in other cases it does not.A detailed analysis showed that the machine usually backtracks on garden path sentences but has a tendency of not reanalysing the sentence as was expected.The problem seems to be of a lexical nature.Some words that should be attributed new POS tags in the reanalysis phase resist this reanalysis.This is probably due to their lexical representation.More work is needed to understand this phenomenon and find ways to overcome it.
The last direction is linked to the long term project of predicting regressive saccades.The general idea is to compare back movements predicted by our model and actual eye movement data and study whether these movements are correlated.
• Empty stack: target word is in the empty stack.
• No dep / gov: target word is the dependent / governor of a word without one.
• Not seen: target word is in right context, and has not been seen yet.
• Erased: target value has been erased after a BACK action.

Figure 1 :
Figure 1: Three simple reading machines.The top left machine performs POS tagging, the right top one, unlabeled dependency parsing and the bottom one performs the two tasks simultaneously.

Figure 2 :
Figure 2: Three backtracking machines based on the machines of Figure1

Figure 3 Figure 3 :
Figure3shows how the tagparsing machine of Figure2would ideally process the sentence the old man the boat, a classical garden path sentence for which, two words (old and man) should be reanalysed after the noun phrase the boat has been read.The figure describes the machine configuration each time it is in state BACK.Three tapes are

Figure 4 :
Figure 4: Probabilities of choosing the next action during RL.At random, following the oracle or the model.

Figure 5 :
Figure 5: Structure of the classifier.

Table 1 :
Performances of tagger, parser and tagparser under three learning regimes on our French corpus.

Table 2 :
Behavior comparison of three RLB machines.
Other Languages

Table 3 :
Corpora used for experiments on new languages, with the size of training, development and test sets (in tokens).

Table 4 :
Results for seven languages.
for German and to

Table 5 :
Statistics for BACK actions performed during tagging for four languages.8.2% for Russian, a figure that is far above what is observed for other languages.