Abstract
Greedy algorithms for NLP such as transition-based parsing are prone to error propagation. One way to overcome this problem is to allow the algorithm to backtrack and explore an alternative solution in cases where new evidence contradicts the solution explored so far. In order to implement such a behavior, we use reinforcement learning and let the algorithm backtrack in cases where such an action gets a better reward than continuing to explore the current solution. We test this idea on both POS tagging and dependency parsing and show that backtracking is an effective means to fight against error propagation.
1 Introduction
Transition-based parsing has become a major approach in dependency parsing, since the work of Yamada and Matsumoto (2003) and Nivre et al. (2004), for it combines linear time complexity and high linguistic performances. The algorithm follows a local and greedy approach to parsing that consists of selecting at every step of the parsing process the action that maximizes a local score, typically computed by a classifier. The action selected is greedily applied to the current configuration of the parser and yields a new configuration.
At training time, an oracle function transforms the correct syntactic tree of a sentence into a sequence of correct (configuration, action) pairs. These pairs are used to train the classifier of the parser. The configurations that do not pertain to the set of correct configurations are never seen during training.
At inference time, if the parser predicts and executes an incorrect action, it produces an incorrect configuration, with respect to the sentence being parsed, which might have never been seen during training, yielding a poor prediction of the next action to perform. Additionally, the parser follows a single hypothesis by greedily selecting the best scoring action. The solution built by the parser can be sub-optimal for there is no guarantee that the sum of the scores of the actions selected maximizes the global score.
These are well-known problems of transition- based parsing and several solutions have been proposed in the literature to overcome them, they will be briefly discussed in Section 2. The solution we propose in this article consists of allowing the parser to backtrack. At every step of the parsing process, the parser has the opportunity to undo its n previous actions to explore alternative solutions. The decision to backtrack or not is taken each time a new word is considered, before trying to process it, by giving the current configuration to a binary classifier, that will assign a score to the backtracking action. Traditional supervised learning is not suited to learn such a score, since the training data contains no occurrences of backtrack actions. In order to learn in which situation a backtrack action is worthy, we use reinforcement learning. During training, the parser has the opportunity to try backtracking actions and the training algorithm responds to this choice by granting it a reward. If the backtracking action is the adequate move to make in the current configuration, it will receive a positive reward and the parser will learn in which situation backtrack is adequate.
The work presented here is part of a more ambitious project which aims at modeling eye movements during human reading. More precisely, we are interested to predict regressive saccades: eye movements that bring the gaze back to a previous location in the sentence.
There is much debate in the psycholinguistic literature concerning the reasons of such eye movements (Lopopolo et al., 2019). Our position with respect to this debate is the one advocated by Rayner and Sereno (1994), for whom part of these saccades are linguistically motivated and happen in situations where the reader’s incremental comprehension of the sentence is misguided by an ambiguous sentence start, until reaching a novel word which integration will prove incompatible with the current understanding and will trigger a regressive saccade, as in garden path sentences. Our long-term project is to model regressive saccades with backtracking actions. Although this work enters in this long-term project, the focus of this article is on the nlp aspects of this program and we propose a way to implement backtracking in the framework of transition-based parsing. We will just mention some preliminary studies on garden path sentences in the Conclusion.
In order to move in the direction of a more cognitively plausible model, we add two constraints to our model.
The first one concerns the text window around the current word that the parser takes into account when predicting an action. This window can be seen as an approximation of the sliding window introduced by McConkie and Rayner (1975) to model the perceptual span of a human reader.1 Transition-based parsers usually allow taking into account the right context of the current word. The words in the right context constitute a rich source of information and yield better predictions of the next action to perform.
In our model, the parser does not have access to this right context, simulating the fact that a human reader has only a limited access to the right context (few characters to the right of the fixation point (McConkie and Rayner, 1976)). It is only after backtracking that a right context is available for it has been uncovered before backtrack took place.
The second constraint is incrementality. When performing several tasks, such as pos tagging and parsing, as will be done in the tagparser described in Section 3, these tasks are organized in an incremental fashion. At each step, a word is read, pos tagged, and integrated in the syntactic structure— a more cognitively plausible behavior than a sequential approach where the whole sentence is first pos tagged then parsed.
The structure of the paper is the following: In Section 2, we compare our work to other approaches in transition-based parsing which aim at proposing solutions to the two problems mentioned above. In Section 3, we describe the model that we use to predict the linguistic annotation and introduce the notion of a back action. In Section 4 we show how backtracking actions can be predicted using reinforcement learning. Section 5 describes the experimental part of the work and discusses the results obtained. Section 6 concludes the paper and presents different directions in which this work will be extended.
2 Related Work
Several ways to overcome the two limits of transition-based parsing mentioned in the Introduction, namely, training the parser with only correct examples and exploring only a single hypothesis at inference time, have been explored in the literature.
Beam Search
The standard solution to the single hypothesis search is beam search, which allows considering a fixed number of solutions in parallel during parsing. Beam search is a general technique that has been applied to transition-based parsing in many studies, including Zhang and Clark (2008), Huang and Sagae (2010), and Zhang and Nivre (2012). They show that exploring a fixed number of solutions increases the linguistic performances over a single hypothesis parser. In this work, we do not use a beam search algorithm, but we do explore several solutions in a non-parallel fashion, using backtracking.
Dynamic Oracles
A first way to overcome the lack of exploration during training problem is the proposition by Goldberg and Nivre (2012) to replace the standard oracle of transition-based parsing by a dynamic oracle that is able to determine the optimal action a to perform for an incorrect configuration c. During training, the dynamic oracle explores a larger part of the configuration space than the static oracle and produces for an incorrect configuration c an optimal action a. The pair (c,a) is given as training example to the classifier, yielding a more robust classifier that is able to predict the optimal action in some incorrect configurations. Ballesteros et al. (2016) show that the principle of the dynamic oracle can be adapted to train the greedy Stack-LSTM dependency parser of Dyer et al. (2015), improving its performance.
Yu et al. (2018) describe a method for training a small and efficient neural network model that approximates a dynamic oracle for any transition system. Their model is trained using reinforcement learning.
In this paper, we use dynamic oracles in two different ways. First, as baselines for models allowing some exploration during training. Second, in the definition of the immediate reward function, as explained in Section 4.
Reinforcement Learning
A second answer to the lack of exploration problem is reinforcement learning. The exploitation–exploration trade-off of reinforcement learning allows, at training time, the model to explore some incorrect configurations and to learn a policy that selects in such cases the action that maximizes the long-term reward of choosing this action. Reinforcement learning is well suited for transition-based algorithms.
To the best of our knowledge, two papers directly address the idea of training a syntactic transition-based parser with reinforcement learning: Zhang and Chan (2009) and Lê and Fokkens (2017).
Zhang and Chan (2009) cast the problem, as we do, as a Decision Markov Process but use a Restricted Boltzmann Machine in order to compute scores of actions and use the SARSA algorithm to compute an optimal policy. In our case, we use deep Q-learning based on a multilayer perceptron, as described in Section 4. In addition, their immediate reward function is based on the number of arcs in the gold tree of a sentence that are absent in the tree being built. We also use an immediate reward function, but it is based on the number of arcs in the gold tree that can no longer be built given the current tree being built, an idea introduced in the dynamic oracle of Goldberg and Nivre (2012).
Two major differences distinguish our approach and the work of Lê and Fokkens (2017). The first is the idea of pre-training a parser in a supervised way and then fine-tuning its parameters using reinforcement learning. In our case, the parser is trained from scratch using reinforcement learning. The reason for this difference is related to our desire to learn how to backtrack: It is difficult to make the parser learn to backtrack when it has been initially trained not to do so (using standard supervised learning). The second major difference is the use of a global reward function, which is computed after the parser has parsed a sentence. In our case, as mentioned above, we use an immediate reward. The reason for this difference is linked to pretraining. Since we do not pre-train our parser, and allow it to backtrack, granting a reward at the end of the sentence does not allow the parser to converge since reaching the end of the sentence is almost impossible using a non pre-trained parser. Other less fundamental differences distinguish our approach, such as the use of Q-learning, in our case, to find an optimal policy, where they use a novel algorithm called Approximate Learning Gradient, based on the fact that their model’s output is a probability distribution over parsing actions (as in standard supervised learning). Another minor difference is the exploration of the search space in the training phase. They sample the next action to perform using the probability distribution computed by their model, while we use an adaptation of ε-greedy that we describe in Section 4.
Reinforcement learning has also been used to train a transition-based model for other tasks. Naseem et al. (2019) present a method to fine-tune the Stack-LSTM transition-based AMR parser of Ballesteros and Al-Onaizan (2017) with reinforcement learning, using the Smatch score of the predicted graph as reward. For semantic dependency parsing, Kurita and Søgaard (2019) found that fine-tuning their parser with a policy gradient allows it to develop an easy-first strategy, reducing error propagation.
The fundamental difference between all papers cited above and our work is the idea of adding backtracking in a greedy transition-based model. We use reinforcement learning as a means to achieve the exploration necessary to this goal. In term of parsing performance, our method will fare lower than the state of the art in transition-based parsing because our parser is constrained to not see words beyond the current word, a constraint that comes from the fact that our long-term goal is not to improve parsing performance but to find a natural way to encourage a parser to simulate regressive saccades observed during human reading.
3 Backtracking Reading Machines
Our model is an extension of the Reading Machine, a general model for nlp proposed in Dary and Nasr (2021) that generalizes transition-based parsing to other nlp tasks. A Reading Machine is a finite automaton whose states correspond to linguistic levels. There can be, for example, one state for pos tagging, one state for lemmatization, one state for syntactic parsing, and so on. When the machine is in a given state, an action is predicted, which generally writes on an output tape a label corresponding to the prediction just made.2 There are usually as many output tapes as there are levels of linguistic predictions and at least one input tape that usually contains the words of the sentence to process.3 Predictions are realized by classifiers that take as input the configuration of the machine and compute a probability distribution over the possible actions. Configurations are complex objects that describe the aspects of the machine that are useful in order to predict the next action to perform. Among the important elements of a configuration for the rest of the paper, we can cite the current state of the machine, the word index (noted as wi) that is the position of the word currently processed, and the history, a list of all actions performed so far.
The text is read word by word, a window of an arbitrary size centered on wi defines the part of the text that can be used to make predictions on the current word.
Figure 1 shows the architecture of three simple machines. The two machines in the top part of the figure realize a single task. The machine on the left part realizes pos tagging. It is made of a single state and has a series of transitions that loop on the state. The machine has one input tape from which words are read and one output tape, on which predicted pos are written. Each transition is labeled with a tagging action of the form pos(p) that simply writes the pos tag p on the pos output tape at the word index position.
The machine on the right implements an arc- eager transition-based parser that produces unlabeled dependency trees. It has the same simple structure than the tagging machine. Its transitions are labeled with the four standard arc-eager actions of unlabeled transition-based parsing: left, reduce, shift, and right. The machine has two input tapes, one for words and one for pos tags, and one output tape on which it writes the index of the governor of the current word when left or right actions are predicted.
The machine on the bottom part of the figure, which we call a tagparser, realizes the two tasks simultaneously in an incremental fashion. When in state pos, the machine tags the current word then control jumps to the parser in order to attach the current word to the syntactic structure built so far or to store it in a stack. Once this is done, control returns to state pos to tag the next word. The reason why the transitions labeled right and shift lead to state pos is that these two actions increase the word index wi and initiate the processing of a new word. The machine has one input tape for words and two output tapes, one for pos tags and one for the governor position.
We augment the machines described above in order to allow them to undo some preceding actions. This ability relies on three elements: (a) the definition of a new action, called back, that undoes a certain number of actions; (b) the history of the actions performed so far in order to decide which actions to undo; and (c) the definition of undoing an action.
Undoing an action amounts to recovering the configuration that existed before the action was performed. This is quite straightforward for tagging and parsing actions.4
Three backtracking machines, based on the machines of Figure 1, are shown in Figure 2. They all have an extra state, named back, and two extra transitions. When in state back, the machine predicts one of the two actions back or ¬back. When action ¬back is selected, control jumps either to state pos or syn, depending on the machine, and the machine behaves like the simple machines of Figure 1. Action ¬back does not modify the configuration of the machine. This is not the only possible architecture for backtracking machines, this point will be briefly addressed in the Conclusion.
If action back is selected, the last actions of the history are undone until a ¬back action is reached. This definition of action back allows undoing all actions that are related to the previous word. After back has been applied, the configuration of the machine is almost the same as the configuration it was in before processing the current word. There is, however, a major difference: It now has access to the following word. Otherwise, the machine would deterministically predict the action it has predicted before. One can notice that the transition labeled back in the machine loops on a single state, this feature allows the machine to perform several successive back actions.
Figure 3 shows how the tagparsing machine of Figure 2 would ideally process the sentence the old man the boat, a classical garden path sentence for which, two words (old and man) should be re-analyzed after the noun phrase the boat has been read. The figure describes the machine configuration each time it is in state back. Three tapes are represented: the input tape that contains tokens, the pos tape, and the parsing tape that contains the index of the governor of the current word. The figure also represents, at the bottom, the ba array, which is described below, as well as the sequence of actions predicted since the last visit to state back. The current word appears in boldface. The figure shows two successive occurrences of a back action, after the second determiner is read, leading to the re-analysis of the word old that was tagged adj and the word man that was tagged noun.
A backtracking machine as the one described above can run into infinite loops: Nothing prevents it to repeat endlessly the same sequence of actions. One can hope that, during training, such behavior leads to poor performance and is ruled out. But there is no guarantee that this will be the case. Furthermore, we would like to prevent the machine for exploring, at inference time, the whole (or a large part of the) configuration space. In order to do so, we introduce a constraint on the number of times a back action is taken when parsing a sentence. A simple way to introduce such a constraint is to limit to a given constant k the number of authorized back actions per word. This feature is implemented by introducing an array ba of size n, where n is the number of words of the sentence to process. Array ba is initialized with zeros, and every time a back action is predicted in position i, the value of ba[wi] is incremented. When the machine is in state back and ba[wi] is equal to k, performing a back action is not permitted. A ¬back action is forced, bypassing the classifier that decides whether to backtrack or not.
The introduction of array ba and the parameter k defines an upper bound on the size of the action sequence for a sentence of length n. This upper bound is equal to 3 nk + 2 n for the tagger, 4 nk + 3 n for the parser, and 5 nk + 4 n for the tagparser.5 As one can notice, linearity is preserved. In our experiments, we chose k = 1.
4 Training
Reading Machines, as introduced by Dary and Nasr (2021), are trained in a supervised learning fashion. Given data annotated at several linguistic levels, an oracle function decomposes it into a sequence of configurations and actions (, , , ,…,cn, an). This sequence of configurations and actions constitute the training data of the classifiers of the machine to train: Pairs (ci,ai) are presented iteratively to the classifier during the training stage. A backtracking Reading Machine cannot be trained this way since there are no occurrences of back actions in the data. In order to learn useful occurrences of such actions, the training process should have the ability to generate some back actions and be able to measure if this action was beneficial.
In order to implement such a behavior, we use reinforcement learning (rl). We cast our problem as a Markov Decision Process (mdp). In an mdp, an agent (the parser) is in configuration ct at time t. It selects an action at from an action set (made of the tagging actions, the parsing actions and the back and ¬back actions) and performs it. We note C the set of all configurations and A the set of actions. The environment (the annotated data) responds to action at by giving a reward rt = r(ct,at) and by producing the succeeding configuration ct +1 = δ(ct,at). In our case, configuration ct +1 is deterministically determined by the structure of the machine. The reward function gives high reward to actions that move the parser towards the correct parse of the sentence. The fundamental difference with supervised training is that, during training, the agent is not explicitly told which actions to take, but instead must discover which action yields the most reward through trial and error. This feature gives the opportunity for the parser to try some back actions, provided that a reward can be computed for such actions.
It has been proven (Watkins and Dayan, 1992) that such iterative algorithm converges towards the q* function.
In order to store the values of Q, the algorithm uses a large table that has an entry for every (configuration, action) pair. In our case, there are far too many configurations to allocate such a table. Instead we use a simple form of deep Q learning (Mnih et al., 2013) and approximate function Q using a multilayer perceptron CQ.
CQ takes as input a configuration c and outputs a vector whose dimension is the number of different actions in the action set. The component of the vector corresponding to action a is the approximation of Q(c,a) computed by CQ. It is noted CQ(c,a).
Reward Functions
In rl, the training process is guided by the reward r(c,a) granted by the environment when action a is performed in configuration c. Defining a reward function for tagging action and parsing action is quite straightforward.
In the case of tagging, the reward should be high when the tag chosen for the current word is correct and low when it is not. A simple reward function is one that associates, for example, value 0 in the first case and − 1 in the second. More elaborate reward functions could be defined that penalize more some confusions (e.g., tagging a verb as a preposition).
In the case of parsing, a straightforward reward function will give a reward of zero for a correct action and a negative reward for an incorrect one. We use a slightly more complex function inspired by the dynamic oracle of Goldberg and Nivre (2012). This function, in the case of an incorrect action a, counts the number of dependencies of the correct analysis that cannot anymore be predicted due to a. The reward for a is the opposite of this number. Actions that cannot be executed, such as popping an empty stack, are granted a reward of − 1.5.
Defining the reward function for back actions is more challenging. When a back action is performed, a certain number of actions ai…ai +k are undone. Each of these actions was granted a reward ri…ri +k. Let’s call the opposite of the sum of these rewards (E ≥ 0). The larger E is, the more errors have been made. Let’s call φ(E) the function that computes the reward for executing a back action, given E. Formally, we want φ to respect the following principles:
Don’t execute a back action if there are no errors: φ(0) < 0.
φ(E) should be increasing with respect to E: the more errors, the more a back action should be encouraged.
φ(E) should not grow too fast with respect to E. Granting too much reward to back actions encourages the system to make errors in order to correct them with a highly rewarded back action.
It is the function that we use to compute the reward of a back action.
Exploring the Configuration Space
In order to learn useful back actions, the parser should try, in the training phase, to perform some back actions and update the classifier based on the reward it receives from the environment. One standard way to do that is to adopt an ε-greedy policy in which the model selects a random action with probability ε or the most probable action as predicted by the classifier with probability 1 − ε. This setup allows the system to perform exploitation (choosing the best action so far) as well as exploration (randomly choosing an action). We adopt a variant of this policy, based on two parameters ε and β with 0 ≤ ε ≤ 1, 0 ≤ β ≤ 1, and ε + β ≤ 1. As in standard ε-greedy policy, the agent chooses a random action with probability ε, it chooses the correct action as given by the oracle7 with probability β, and, finally, it chooses the most probable action as predicted by the classifier with probability 1 − (ε + β). Parameter β has been introduced in order to speed up training. In the beginning of the training process, the system is encouraged to follow the oracle. Then, as training progresses, the system relies more on its predicting capacity (exploitation augments) and less on the oracle. Figure 4 shows the evolution of theses parameters.
5 Experiments
Three machines were used in our experiments: a tagger, a parser and a tagparser, based on the architectures of Figure 2. Each of the three machines has been trained in three different learning regimes: Supervised Learning (sl) using a dynamic oracle, Reinforcement Learning without back actions (rl), and Reinforcement Learning with back actions (rlb).
5.1 Universal Dependencies Corpus
Our primary experiments were conducted on a French Universal Dependencies corpus (Zeman et al., 2021), more specifically the GSD corpus, consisting of 16,341 sentences and 400,399 words. The original split of the data was 88% train, 9% dev, and 3% test. The size of the test set being too small to obtain significant results, we decided to use a k-fold strategy, where all the data was first merged, randomly shuffled, and then split into ten folds, each fold becoming the test set of a new train/dev/test split of the data in the following proportions: 80%10%/10%.
Using the ten folds was unnecessary to obtain significant results; we therefore decided to limit ourselves to three folds. The size of the test set for which results are reported has a size of 4,902 sentences and 124,560 words.
5.2 Experimental Setup
Each machine consists of a single classifier, a multilayer perceptron, with a single hidden layer of size 3200. When the machine realizes several tasks, as in case of the tagparser, the classifier has one decision layer for each task. The output size of each decision layer is the number of actions of its corresponding task. A dropout of 30% is applied to the input vector and to the output of the hidden layer. The output of the hidden layer is given as input to a ReLU function. The structure of the classifier is represented in Figure 5.
The details of the features extracted from configurations and their encoding in the input layer of the classifier are detailed in Appendix B.
In the case of Supervised Learning, the machines are trained with a dynamic oracle. In the beginning of the training process, the machines are trained with a static oracle, for two epochs. Then every two epoch, the machines are used to decode the training corpus and for each configuration produced (which could be incorrect) the dynamic oracle selects the optimal action and these (configuration, action) pairs are used to train the classifier.
Training the machines in the rl regime is longer than training them in the sl regime. In the first case, 200 epochs were needed and 300 in the second. This difference is probably due to the larger exploration of the configuration space.
5.3 Results - Performance
The results for the three machines, under the three learning regimes, are shown in Table 1. pos tagging performance is measured with accuracy and displayed in column UPOS. Dependency parsing performance is measured with the unlabeled accuracy score (ratio of words that have been attached to the correct governor) and displayed in the UAS column. The p-value next to each score is a confidence metric indicating whether the score is significantly better than the one below (that’s why the last line is never given a p-value). This p-value has been estimated with a paired bootstrap resampling algorithm (Koehn, 2004) using the script (Popel et al., 2017) of the CoNLL 2018 shared task.
. | TAGGER . | PARSER . | ||
---|---|---|---|---|
Regime . | UPOS . | p val. . | UAS . | p val. . |
rlb | 97.65 | 0.000 | 88.21 | 0.000 |
rl | 96.84 | 0.000 | 86.60 | 0.037 |
sl | 96.11 | __ | 86.17 | __ |
TAGPARSER | ||||
Regime | UPOS | p val. | UAS | p val. |
rlb | 97.06 | 0.000 | 87.85 | 0.001 |
rl | 96.73 | 0.090 | 87.12 | 0.211 |
sl | 96.59 | __ | 86.94 | __ |
. | TAGGER . | PARSER . | ||
---|---|---|---|---|
Regime . | UPOS . | p val. . | UAS . | p val. . |
rlb | 97.65 | 0.000 | 88.21 | 0.000 |
rl | 96.84 | 0.000 | 86.60 | 0.037 |
sl | 96.11 | __ | 86.17 | __ |
TAGPARSER | ||||
Regime | UPOS | p val. | UAS | p val. |
rlb | 97.06 | 0.000 | 87.85 | 0.001 |
rl | 96.73 | 0.090 | 87.12 | 0.211 |
sl | 96.59 | __ | 86.94 | __ |
The table shows the same pattern for the three machines: The rlb regime gets higher results than the simple rl regime, which is itself better than the sl regime. Two important conclusions can be drawn from these results. The first is that rlb regime is consistently better than sl: Backtracking machines make fewer errors than machines trained in supervised mode. At this point we do not know whether this superiority comes from reinforcement learning or the addition of a back action. In fact, previous experiments in Zhang and Chan (2009) and Lê and Fokkens (2017) showed that reinforcement learning (without backtracking) can lead to better results than supervised learning. The comparison of rlb and rl shows that most of the performance boost comes from backtracking.
The results of Table 1 also show that the tagparser gets better results than single-task machines (the tagger and the parser) when trained with supervised learning. Note that this comparison is possible because the parser was not given gold PoS as input, but instead the ones predicted by the tagparser. These results are in line with the work of Bohnet and Nivre (2012) and Alberti et al. (2015) that show that joint prediction of pos tags and syntactic tree improves the performances of both. However, this is not true when the machines are trained with reinforcement learning. In this case the parser and the tagger get better results than the tagparser. One reason that could explain this difference is the size of the configuration space of the tagparser that is an order of magnitude larger than those of the tagger or the parser. We will return to this point in the Conclusion.
5.4 Results - Statistics
One can gain a better understanding of the effect of the back actions performed by the three machines with the statistics displayed in Table 2. Each column of the table concerns one machine trained in rlb mode. The first line shows the total number of actions predicted while decoding the test set, the second line shows, the number of errors made, and the third line shows, the number of back actions predicted. Lines 4 and 5 give the precision and recall of the back actions. The precision is the ratio of back actions that were predicted after an error was made and the recall is the ratio of errors after which a back action was predicted. These two figures measure the error detection capabilities of the back actions prediction mechanism. In the case of the parser, the precision is equal to 76.86%, which means that 76.86% of the back actions were predicted after an error was made and 24.52% (recall) of the errors provoked a back action prediction. The recall constitutes an upper bound of the errors that could be corrected. The four last lines break down the back actions predicted into four categories. CC is the case where a back action was predicted after a correct action, but did not change the action, E→E is the case where a back action was predicted after an error, but the error was not corrected, either the same erroneous action was predicted or another erroneous one was predicted. E→C is the case where a back action was predicted after an error and has corrected it, while C→E is the case where a correct action was replaced by an incorrect one after a back action.
. | PARSER . | TAGGER . | TAGPARSER . |
---|---|---|---|
#Actions | 115,588 | 79,620 | 153,764 |
#Errs | 3,597 | 1,323 | 6,249 |
#Backs | 1,063 | 891 | 4,491 |
bPrec | 76.86% | 68.46% | 73.48% |
bRec | 24.52% | 46.49% | 61.72% |
CC | 18.07% | 28.65% | 56.89% |
EE | 41.39% | 26.09% | 23.46% |
CE | 05.15% | 02.79% | 02.81% |
EC | 35.39% | 42.47% | 16.83% |
. | PARSER . | TAGGER . | TAGPARSER . |
---|---|---|---|
#Actions | 115,588 | 79,620 | 153,764 |
#Errs | 3,597 | 1,323 | 6,249 |
#Backs | 1,063 | 891 | 4,491 |
bPrec | 76.86% | 68.46% | 73.48% |
bRec | 24.52% | 46.49% | 61.72% |
CC | 18.07% | 28.65% | 56.89% |
EE | 41.39% | 26.09% | 23.46% |
CE | 05.15% | 02.79% | 02.81% |
EC | 35.39% | 42.47% | 16.83% |
Several conclusions can be drawn from these statistics.
First, backtracking corrects errors. For the three machines, there are many more cases where an error is corrected rather than introduced after a back action was predicted (EC » CE). This means that the difference in scores that we observed between rl and rlb in Table 1 can indeed be attributed to backtracking.
Second, backtracking is conservative. The number of predicted back actions is quite low (around 1% of the actions for the tagger and the parser and around 3% for the tagparser) and the precision is quite high. The machines do not backtrack very often and they usually do it when errors are made. This is the kind of behavior we were aiming for. It can be modified by changing the reward function φ of the back action.
Third, tagging errors are easier to correct than parsing errors. The comparison of columns 2 and 3 (parser and tagger) shows that the tagger has a higher recall than the parser, tagging errors are therefore easier to detect. This comparison also shows that EC is higher for the tagger than it is for the parser, tagging errors are therefore easier to correct.
Lastly, the poor performance of the tagparser does not come from the fact that it does not backtrack. It actually does backtrack around three times as often as the parser or the tagger. But it has a hard time correcting the errors; most of the time, it reproduces errors made before. We will return to this point in the Conclusion.
5.5 Results on Other Languages
In order to study the behavior of backtracking on other languages, we have trained and evaluated our system on six other languages from various typological origin: Arabic (ar), Chinese (zh), English (en), German (de), Romanian (ro), and Russian (ru). The experimental setup for these languages is different from the one we have used for French: we have used the original split in train, development, and test sets, as defined in the Universal Dependencies corpora. We report in Table 3 the corpora used for each language as well as the size of the training, development, and test sets. We did not run experiments on the tagparser for it gave poor results on our experiments on French. Additionally, for a sanity check, we have rerun experiments on French data, using the original split in order to make sure that the difference of experimental conditions did not yield important differences in the results. The results of these experiments can be found in Table 4. The table indicates p-values of the difference between one system and the next best performing one. The system with the worse performances is therefore not associated with a p-value.
Lang. . | Corpus . | Train . | Dev . | Test . |
---|---|---|---|---|
ar | PADT | 254,400 | 34,261 | 32,132 |
zh | GSD | 98,616 | 12,663 | 12,012 |
en | GUM | 81,861 | 15,598 | 15,926 |
fr | GSD | 364,349 | 36,775 | 10,298 |
de | HDT | 2,753,627 | 319,513 | 326,250 |
ro | RRT | 185,113 | 17,074 | 16,324 |
ru | SynTag | 871,526 | 118,692 | 117,523 |
Lang. . | Corpus . | Train . | Dev . | Test . |
---|---|---|---|---|
ar | PADT | 254,400 | 34,261 | 32,132 |
zh | GSD | 98,616 | 12,663 | 12,012 |
en | GUM | 81,861 | 15,598 | 15,926 |
fr | GSD | 364,349 | 36,775 | 10,298 |
de | HDT | 2,753,627 | 319,513 | 326,250 |
ro | RRT | 185,113 | 17,074 | 16,324 |
ru | SynTag | 871,526 | 118,692 | 117,523 |
. | TAGGER . | PARSER . | ||
---|---|---|---|---|
Regime . | UPOS . | p val. . | UAS . | p val. . |
English-GUM | ||||
rlb | 94.99 | 0.00 | 79.96 | 0.00 |
rl | 93.53 | 0.00 | 72.97 | |
sl | 92.63 | 73.12 | 0.43 | |
French-GSD | ||||
rlb | 97.53 | 0.00 | 87.70 | 0.17 |
rl | 96.64 | 0.10 | 86.63 | |
sl | 96.25 | 86.97 | 0.33 | |
German-HDT | ||||
rlb | 97.88 | 0.00 | 93.00 | 0.00 |
rl | 97.29 | 0.47 | 91.26 | |
sl | 97.28 | 91.31 | 0.35 | |
Romanian-RRT | ||||
rlb | 97.07 | 0.00 | 85.40 | 0.26 |
rl | 96.28 | 0.03 | 84.70 | |
sl | 95.77 | 85.00 | 0.32 | |
Russian-SynTagRus | ||||
rlb | 98.41 | 0.00 | 86.59 | 0.00 |
rl | 97.93 | 0.01 | 85.25 | |
sl | 97.80 | 85.27 | 0.46 | |
Chinese-GSD | ||||
rlb | 93.01 | 0.00 | 71.70 | 0.00 |
rl | 91.61 | 0.23 | 64.63 | |
sl | 91.30 | 65.79 | 0.13 | |
Arabic-PADT | ||||
rlb | 96.43 | 0.16 | 83.81 | 0.47 |
rl | 96.20 | 0.03 | 83.77 | |
sl | 95.74 | 83.95 | 0.38 |
. | TAGGER . | PARSER . | ||
---|---|---|---|---|
Regime . | UPOS . | p val. . | UAS . | p val. . |
English-GUM | ||||
rlb | 94.99 | 0.00 | 79.96 | 0.00 |
rl | 93.53 | 0.00 | 72.97 | |
sl | 92.63 | 73.12 | 0.43 | |
French-GSD | ||||
rlb | 97.53 | 0.00 | 87.70 | 0.17 |
rl | 96.64 | 0.10 | 86.63 | |
sl | 96.25 | 86.97 | 0.33 | |
German-HDT | ||||
rlb | 97.88 | 0.00 | 93.00 | 0.00 |
rl | 97.29 | 0.47 | 91.26 | |
sl | 97.28 | 91.31 | 0.35 | |
Romanian-RRT | ||||
rlb | 97.07 | 0.00 | 85.40 | 0.26 |
rl | 96.28 | 0.03 | 84.70 | |
sl | 95.77 | 85.00 | 0.32 | |
Russian-SynTagRus | ||||
rlb | 98.41 | 0.00 | 86.59 | 0.00 |
rl | 97.93 | 0.01 | 85.25 | |
sl | 97.80 | 85.27 | 0.46 | |
Chinese-GSD | ||||
rlb | 93.01 | 0.00 | 71.70 | 0.00 |
rl | 91.61 | 0.23 | 64.63 | |
sl | 91.30 | 65.79 | 0.13 | |
Arabic-PADT | ||||
rlb | 96.43 | 0.16 | 83.81 | 0.47 |
rl | 96.20 | 0.03 | 83.77 | |
sl | 95.74 | 83.95 | 0.38 |
The results obtained on French are lower than the results obtained using the k-fold strategy. But the drop is moderate—0.3% for the tagger and 0.52% for the parser—and could be explained simply by the difference of the test corpora on which the systems were evaluated.
We observe more or less the same pattern for the new languages: The highest performance is reached by reinforcement learning with backtrack (rlb), for both the tagger and the parser. The second-best performing systems for tagging are usually trained with reinforcement learning but differences are usually non-significant. In the case of the parser, the second-best performing systems are trained in a supervised regime, but as was the case for tagging, the differences are often non significant. The performance on Arabic is different, where no significant advantage was observed when using backtracking. The reason for this is the agglutinative nature of Arabic and the tokenization conventions of UD that tokenizes agglutinated pronouns. The effect of this tokenization is to increase the distances between content words. The most common pattern that triggers a backtrack in tagging consists in going back to the previous word in order to modify its part of speech. In the case of Arabic, if the target of the backtrack has an agglutinated pronoun, the tagger has to perform two successive back actions to realize the correction, a pattern that is more difficult to learn.
The general conclusions that we can draw therefore is that reinforcement learning with backtrack yields the best performance for both the parser and the tagger (with the exception of Arabic), but there are no notable differences between supervised learning with a dynamic oracle and reinforcement learning (without backtrack).
The statistics on the situations in which back actions are performed have been displayed for four languages in Table 5. The table reveals some striking differences for two languages: German and Russian. For these languages, the ratio of back actions with respect to the total number of actions predicted is equal to 8.3% for German and to 8.2% for Russian, a figure that is far above what is observed for other languages. A closer look at the results shows that for these two languages, the machine learns a strategy that consists in provoking errors (it tags as punctuation linguistic tokens), in order to be able to correct them using a back action. This behavior is not due to linguistic reasons but rather to the size of the training corpora. As one can see in Table 3, the training corpora for German and Russian are much larger than they are for other languages. Our hypothesis is that, when trained on a large training corpus, the machine has more opportunities to develop complex strategies such as provoking errors in order to correct them using a back action. Indeed, this phenomenon vanishes when we reduce the size of the train set. In order to fight against this behavior, one can act on the reward function and decrease the reward of back action as the size of the training corpus increases. More investigation is needed to fully understand this behavior.
Lang . | en . | de . | ru . | ar . |
---|---|---|---|---|
#Act | 31,852 | 652,500 | 234,658 | 56,528 |
#Errs | 1,097 | 54,173 | 19,374 | 1,093 |
#Backs | 691 | 53,573 | 19,435 | 310 |
bPrec | 72.94% | 95.88% | 95.64% | 58.06% |
bRec | 45.94% | 94.83% | 95.94% | 16.47% |
CC | 24.31% | 03.77% | 04.21% | 39.35% |
EE | 27.06% | 07.35% | 05.40% | 29.03% |
CE | 02.75% | 00.35% | 00.16% | 02.58% |
EC | 45.88% | 88.53% | 90.23% | 29.03% |
Lang . | en . | de . | ru . | ar . |
---|---|---|---|---|
#Act | 31,852 | 652,500 | 234,658 | 56,528 |
#Errs | 1,097 | 54,173 | 19,374 | 1,093 |
#Backs | 691 | 53,573 | 19,435 | 310 |
bPrec | 72.94% | 95.88% | 95.64% | 58.06% |
bRec | 45.94% | 94.83% | 95.94% | 16.47% |
CC | 24.31% | 03.77% | 04.21% | 39.35% |
EE | 27.06% | 07.35% | 05.40% | 29.03% |
CE | 02.75% | 00.35% | 00.16% | 02.58% |
EC | 45.88% | 88.53% | 90.23% | 29.03% |
The reason why the tagger chose to make regular mistakes—tagging as punctuation the words that are later corrected—is not clear. Our hypothesis is that this is an error that is easy to detect in order to predict a back action.
The same phenomenon (intentional errors) has been observed, to a lesser degree, on the parser.
6 Conclusions
We have proposed, in this article, an extension to transition-based parsing that consists of allowing the parser to undo the last actions performed in order to explore alternative hypotheses. We have shown that this kind of model can be effectively trained using deep reinforcement learning.
This work will be extended in several ways.
The first concerns the disappointing results of the tagparser. As already mentioned in Section 4, many studies have shown that there is usually an advantage to jointly predict several types of linguistic annotations over predicted them separately. We did not observe this phenomenon in our backtracking tagparser. The problem could come from the structure of the backtracking machines. The structures used, illustrated in Figure 2, are just one possible architecture, others are possible, as for example dedicating different back states for the parser and the tagger.
The second direction concerns the integration of other nlp tasks in a single machine. Dary and Nasr (2021) showed that reading machines can take as input raw text and realize word segmentation, pos tagging, lemmatization, parsing, and sentence segmentation. We would like to train such a complex machine with reinforcement learning and backtracking to study whether the machine can backtrack across several linguistic levels.
A third direction concerns the processing of garden path sentences. Such sentences offer examples in which we expect a backtracking machine to backtrack. We have built a corpus of 54 garden path sentences in French, using four different syntactic patterns of different complexity level and organized the sentences in minimal pairs and have tested our machine on it. The results are mixed; in some cases, the machine behaves as expected, in other cases it does not. A detailed analysis showed that the machine usually backtracks on garden path sentences but has a tendency of not reanalyzing the sentence as was expected. The problem seems to be of a lexical nature. Some words that should be attributed new pos tags in the reanalysis phase resist this reanalysis. This is probably due to their lexical representation. More work is needed to understand this phenomenon and find ways to overcome it.
The last direction is linked to the long-term project of predicting regressive saccades. The general idea is to compare back movements predicted by our model and actual eye movement data and study whether these movements are correlated.
Appendix A: Time Complexity of Backtracking
In arc-eager dependency parsing, a sentence of length n is processed in 2n actions: an action that pushes a word on the stack (shift or right) and an action that removes it from the stack (reduce or left). In our backtracking tagparser, we must add one tagging action and one ¬back per word.
Therefore, without applying any back action, the number of actions needed to process a sentence of size n is 4n: for each word a ¬back, a pos action, a push action and a pop action.
Let si be the length of the action sequence taking place when processing the ith word of the sentence. The sum of these lengths is also the number of actions to process the whole sentence, therefore .
Now, in the worst case scenario where back is applied k times per word, the total number of actions is 5nk + 4n, which is the sum of:
k back actions per word: nk.
Initial application of the si sequences: 4n.
The k re-processing of the sequences: .
Appendix B: Features of the Classifiers
The input layer of the classifiers described in Section 5 is a vector of features extracted from the current configuration. Features are represented as randomly initialized and learnable embeddings of size 128, with the exception of words, that are represented by fastText pretrained word embeddings of size 300 (Bojanowski et al., 2017). Four embedding spaces are used: words, pos tags, letters, and actions.
The features are the following:
pos tags and form of the words in a window of size [-2,2] centered on the current word, with the addition, for the parser and tagparser, of the pos tag of the governor, the rightmost and the leftmost dependents of the three topmost stack elements.
The history of the 10 last actions performed.
Prefix and suffix of size 4 for the current word.
For backtracking machines, a binary feature indicating whether or not a back is allowed.
When the value of a feature is not available, it is replaced by a special learnable embedding, representing the reason of unavailability. The following situation are distinguished:
Out of bounds: target word is before the first or after the last word of the sentence.
Empty stack: target word is in the empty stack.
No dep / gov: target word is the dependent / governor of a word without one.
Not seen: target word is in right context, and has not been seen yet.
Erased: target value has been erased after a back action.
Acknowledgments
We would like to thank the action editor as well as the anonymous reviewers, for their detailed and thoughtful insights, which helped us improve on our work substantially.
Notes
It is actually a rough approximation of the sliding window because it defines a span over words and not characters.
Some actions can be complex and change, for example, the state of an internal stack. All details concerning the Reading Machine can be found in Dary and Nasr (2021).
For the sake of simplicity, we consider in this paper that the text to process has already been segmented into sentences and tokenized into words, even if the Reading Machine allows performing these two operations.
In order to be undone, actions reduce and left also need to store stack elements that have been popped.
See Appendix A for details.
This is actually a simplified form of the Bellman optimality equation, due to the fact that our mdp is deterministic: Applying action a in configuration c yields configuration δ(c,a) with probability 1.
¬back in the case of the back state.
References
Author notes
Action Editor: Yusuke Miyao