Abstract
We introduce a greedy transition-based parser that learns to represent parser states using recurrent neural networks. Our primary innovation that enables us to do this efficiently is a new control structure for sequential neural networks—the stack long short-term memory unit (LSTM). Like the conventional stack data structures used in transition-based parsers, elements can be pushed to or popped from the top of the stack in constant time, but, in addition, an LSTM maintains a continuous space embedding of the stack contents. Our model captures three facets of the parser's state: (i) unbounded look-ahead into the buffer of incoming words, (ii) the complete history of transition actions taken by the parser, and (iii) the complete contents of the stack of partially built tree fragments, including their internal structures. In addition, we compare two different word representations: (i) standard word vectors based on look-up tables and (ii) character-based models of words. Although standard word embedding models work well in all languages, the character-based models improve the handling of out-of-vocabulary words, particularly in morphologically rich languages. Finally, we discuss the use of dynamic oracles in training the parser. During training, dynamic oracles alternate between sampling parser states from the training data and from the model as it is being learned, making the model more robust to the kinds of errors that will be made at test time. Training our model with dynamic oracles yields a linear-time greedy parser with very competitive performance.
1. Introduction
Natural language parsing can be formulated as a series of decisions that read words in sequence and incrementally combine them to form syntactic structures; this formulation is known as transition-based parsing, and is often coupled with a greedy inference procedure (Yamada and Matsumoto 2003; Nivre 2003, 2004, 2008). Greedy transition-based parsing is attractive because the number of operations required to build any projective parse tree is linear in the length of the sentence, making greedy versions of transition-based parsing computationally efficient relative to graph- and grammar-based alternative formalisms, which usually require solving superlinear search problems. The challenge in transition-based parsing is modeling which action should be taken in each of the states encountered as the parsing algorithm progresses.2
Because the parser state involves the complete sentence—which is unbounded in length—the representational challenge faced by the modeler is to find a finite encoding of an infinite space. This challenge is usually dealt with by making assumptions about state equivalence by selecting features of the state believed by the model-builder to be most informative (e.g., the next three words in the sequence, the roots of the two most recently built subtrees and their left- and right-most children, and so on).3
In this article, we advocate a recursive approach, in which complex parser states are represented as the composition of simpler ones, and these are constructed incrementally as parsing proceeds. Thus, we operate in a paradigm in which the model builder designs recursive architectures that automatically discover effective views of the complete parser's evolving state rather than explicitly engineering which features of the state to focus on. This construction enables us to avoid making explicit state-equivalence assumptions. We present one such architecture based on advances in recursively defined neural networks. Our state representation integrates information from a wide range of sources:
- •
the complete remaining unprocessed words in the input buffer,
- •
the complete history of parser transition actions on this sequence, and
- •
the complete contents of the stack of partially constructed syntactic structures.
Our model is related to recent work that uses neural networks in dependency parsing (Chen and Manning 2014; Weiss et al. 2015; Zhou et al. 2015; Andor et al. 2016). That work can broadly be understood as replacing a conventional linear classifier with a neural network classifier but still only considering a “narrow view” of the parser state, whereas our work uses recursively defined networks to incorporate sensitivity to the complete parser state.
Using recursive/recurrent neural networks to compute representations of unboundedly large histories has been proposed previously (Henderson 2003; Titov and Henderson 2007b; Stenetorp 2013; Yazdani and Henderson 2015). Our innovation is to make the architecture of the neural network identical with the structure of stack data structures used to store the parser state. To do so, the technical challenge we must solve is being able to access, in constant time, a fixed-size representation of a stack as it evolves during the course of parsing. The technical innovation that lets us do this is a variation of recurrent neural networks with long short-term memory units (LSTMs), which we call stack LSTMs (Section 2). These can be understood as LSTMs augmented with a stack pointer that is manipulated by push and pop operations (in contrast to classic LSTMs that only ever read inputs sequentially).
Our parsing model uses three stack LSTMs to construct the parser state representation: one containing the unprocessed input tokens, one containing the stack of partial syntactic trees, and one containing the history of parse actions (Section 3). Because the stack of partial syntactic trees may contain both individual tokens and partial syntactic structures, representations of individual tree fragments are computed compositionally with recursive neural networks, similar to Paulus et al. (2014).
The parser depends on vector representations of word tokens. Such representations are most commonly learned as a dictionary lookup function (i.e., a map of each word type to a vector), but they can take a parameterized form that is, for example, sensitive to orthographic or morphological properties of a word's surface form (Botha and Blunsom 2014). For languages with rich inflective morphology, surface clues can be quite helpful for parsing (Ballesteros 2013). In Section 4, we compare the classical lookup-table approach to a recently proposed model that combines representations of individual characters (using sequential LSTMs) to obtain a representation of the word. With no explicit morphological annotation, we find that the latter representation gives a large performance increase when parsing Statistical Parsing of Morphologically Rich Languages (SPMRL) data sets in languages with agglutinative morphology or extensive case systems (Seddah et al. 2013; Seddah, Kübler, and Tsarfaty 2014), and performs as well as the table method in analytic languages. We further find that, without part-of-speech tags, this technique benefits all languages, and that in some cases this approach obviates the need for explicit part-of-speech information.
This article also explores two different approaches to training. Transition-based parsers are trained to optimize the score of the correct parsing action, given a parser state. The simplest strategy for choosing pairs of states and actions for training is one that follows a deterministic static oracle that maps each dependency tree from a treebank into a sequence of states and actions. However, this strategy only generates states that result from a history of correct parsing actions. As a result, models learned with this kind of training may not be robust to errors made at test time. We therefore also investigate a training procedure that also aims to teach the parser to make good predictions in sub-optimal states, facilitated by the use of dynamic oracles (Goldberg and Nivre 2012). The experiments in this article demonstrate that by coupling stack LSTM-based global state representations with dynamic oracle training, parsing with greedy decoding can achieve state-of-the-art parsing accuracies, rivaling the accuracies of previous parsers that rely on exhaustive search procedures.
This article is an extension of publications by Dyer et al. (2015) and Ballesteros et al. (2015). It contains a more detailed background about LSTMs and parsing, a discussion about the effect of random initialization, more extensive experiments with standard data sets, experiments including explicit morphological information that yields much higher results than the ones reported before, and a new training algorithm following dynamic oracles that yield state-of-the-art performance while maintaining a fast parsing speed with a greedy model. An open-source implementation of our parser, in all its different instantiations, is available from https://github.com/clab/lstm-parser.
2. Stack LSTMs
In this section we provide a review of LSTMs (Section 2.1), the core component of our parsing algorithm, and then define stack LSTMs (Section 2.2).
Notation. We follow the convention that vectors are written with lowercase, boldface letters (e.g., v or vw); matrices are written with uppercase, boldface letters (e.g., M, Ma, or Mab), and scalars are written as lowercase letters (e.g., s or qz). Structured objects such as sequences of discrete symbols are written with lowercase, bold, italic letters (e.g., w refers to a sequence of input words). We use a semicolon to denote vector concatenation (e.g., [a;b]). Discussion of dimensionality is deferred to the experiments section (Section 6).
2.1 Long Short-Term Memories
That is, the vector xt is “read” in and a linear map is applied to it alongside the previous state vector ht−1, then passed through a nonlinear function (here, a component-wise logistic sigmoid, denoted σ). The state may then be further transformed to produce an output, either at each timestep or just at the end of the sequence.
The parameters of the RNN—here, Wx, Wh, and d—are estimated to minimize a loss function on the training data. The gradient for this loss tends to “vanish” (go to zero) as we trace it back to earlier iterates in a long sequence. This means that long-range dependencies along the input sequence are hard to capture in a learned RNN (Bengio, Simard, and Frasconi 1994).
To improve the representational capacity of LSTMs (and RNNs generally), LSTMs can be stacked in “layers” (Pascanu et al. 2013). In these architectures, the input LSTM at higher layers at time t is the value of ht computed by the lower layer (and xt is the input at the lowest layer).
2.2 Stack Long Short-Term Memories
Conventional LSTMs model sequences in a left-to-right order.4 Our innovation is to augment the LSTM with a “stack pointer.” Like a conventional LSTM, new inputs are always added in the right-most position, but in stack LSTMs, the current location of the stack pointer determines which cell in the LSTM provides ct−1 and ht−1 when computing the new memory cell contents.
In addition to adding elements to the end of the sequence, the stack LSTM provides a pop operation that moves the stack pointer to the previous element (i.e., the previous element that was extended, not necessarily the next-right-most element). Thus, the LSTM can be understood as a stack implemented so that contents are never overwritten; that is, push always adds a new entry at the end of the list that contains a back-pointer to the previous top, and pop only updates the stack pointer.5 This control structure is schematized in Figure 1.
By querying the output vector to which the stack pointer points (i.e., hTOP), a continuous-space “summary” of the contents of the current stack configuration is available. We refer to this value as the “stack summary.” Note that this structure allows a given sequence history to have various different “continuations” that are independent from each other but are all linked to the same history. Signals from each continuation can then be backpropagated to the same shared history. Although elements near the top of the stack are likely to influence the stack summary the most, the LSTM has additional flexibility allowing it to learn to extract information from arbitrary points in the stack (Hochreiter and Schmidhuber 1997).
Although this architecture is novel, it is closely related to the recurrent neural network pushdown automaton of Das et al. (1992), which added an external stack memory to an RNN; the continuous stack transducers of Grefenstette et al. (2015); or the one proposed by Joulin and Mikolov (2015). Our formulation preserves the discrete push and pop operations of conventional stack data structures, whereas prior work uses continuous-valued relaxations of these.
3. Dependency Parser
We now turn to the problem of learning representations of dependency parser states. We preserve the standard data structures of a transition-based dependency parser, namely, a buffer of words to be processed (B) and a stack (S) of partially constructed syntactic elements. Each stack element is augmented with a continuous-space vector embedding representing a word and, in the case of S, any of its syntactic dependents. Additionally, we introduce a third stack (A) to represent the history of transition actions taken by the parser.6 Each of these stacks is associated with a stack LSTM that provides an encoding of its current contents. The full architecture is illustrated in Figure 2, and we will review each of the components in turn.
3.1 Parser Operation
The dependency parser is initialized by pushing the words and their representations (we discuss word representations in Section 4) of the input sentence in reverse order onto B such that the first word is at the top of B and the root symbol is at the bottom, and S and A each contain an empty-stack token. At each timestep, the parser computes a composite representation of the stack states (as determined by the current configurations of B, S, and A) and uses that to predict an action to take, which updates the stacks. Processing completes when B is empty (except for the empty-stack symbol), S contains two elements, one representing the full parse tree headed by the root symbol and the other the empty-stack symbol,7 and A is the history of transition operations taken by the parser.
3.2 Composition Functions
Recursive neural network models enable complex phrases to be represented compositionally in terms of their parts and the relations that link them (Hermann and Blunsom 2013; Socher et al. 2011, 2013a, 2013b). We follow previous work in embedding dependency tree fragments that are present in the stack S in the same vector space as the token embeddings discussed in Section 4.
3.3 Transition Systems
Our parser implements various transition systems. The original version (Dyer et al. 2015) implements arc-standard (Nivre 2004) in its most basic set-up. In Ballesteros et al. (2015), we augmented the arc-standard system with the swap transition of Nivre (2009), allowing for nonprojective parsing. For the dynamic oracle training strategy, we moved to an arc-hybrid system (Kuhlmann, Gómez-Rodríguez, and Satta 2011) for which an efficient dynamic oracle is available. It is worth noting that the stack LSTM parameterization is mostly orthogonal to the transition system being used (providing that the system can be expressed as operating on stacks), making it easy to substitute transition systems.
3.3.1 Arc-Standard
The arc-standard transition inventory (Nivre 2004) is given in Figure 4. The shift transition moves a word from the buffer to the stack, and the reduce-left and reduce-right transitions pop the two items from the top of the stack, make one of them the parent of the other, and push the resulting subtree back to the stack. The arc-standard system allows building all and only projective trees. In order to parse nonprojective trees, this can be combined with the pseudo-projective approach (Nivre and Nilsson 2005) or follow what is presented in Section 3.3.2.
3.3.2 Arc-Standard with Swap
In order to deal with nonprojectivity, the arc-standard system can be augmented with a swap transition (Nivre 2009). The swap transition removes the second-to-top item from the stack and pushes it back to the buffer, allowing for the creation of nonprojective trees. We only use this transition when the training data set contains nonprojective trees. The inclusion of the swap transition requires breaking the linearity of the stack by removing tokens that are not at the top of the stack; however, this is easily handled with the stack LSTM. Figure 4 shows how the parser is capable of moving words from the stack (S) to the buffer (B), breaking the linear order of words. Because a node that is swapped may have already been assigned as the head of a dependent, the buffer (B) can now also contain tree fragments.
3.3.3 Arc-Hybrid
For the dynamic oracle training scenario, described in Section 5.2, we switch to the arc-hybrid transition system, which is amenable to an efficient dynamic oracle (Goldberg and Nivre 2013). The arc-hybrid system is summarized in Figure 5. The shift and reduce-right transitions are the same as in arc-standard. However, the reduce-left transition pops the top of the stack and attaches it as a child of the first item in the buffer. Although it is possible to extend the arc-hybrid system to support nonprojectivity by adding a swap transition, this extension would invalidate an important guarantee enabling efficient dynamic oracles.10 We therefore restrict the dynamic-oracle experiments to the fully projective English and Chinese treebanks. In order to parse nonprojective trees, this can be combined with the pseudo-projective approach (Nivre and Nilsson 2005).
4. Word Representations
Our parser's architecture makes heavy use of word representations. We describe two variants. The first is a “standard” approach in which each word is represented as a vector, which is pretrained on unannotated data (Section 4.1). The second is a character-based representation, in which the word's representation is highly dependent on its orthography. These character-based representations are trained without additional resources and work well for implicitly capturing the morphology of words (Section 4.2).
4.1 Standard Word Representations
This representation lets us deal flexibly with out-of-vocabulary (OOV) words—those that are OOV in the very limited parsing data but present in the pretraining language model (LM), and words that are OOV in both. To ensure we have estimates of the OOVs in the parsing training data, we stochastically replace (with probability 0.5) each singleton word type in the parsing training data with the UNK token in each training iteration.
Pretrained Word Embeddings. There are several options for creating word embeddings, meaning numerous pretraining options for are available. However, for syntax modeling problems, embedding approaches that discard order perform less well (Bansal, Gimpel, and Livescu 2014); therefore, we used a variant of the skip n-gram model introduced by Ling et al. (2015a), named “structured skip n-gram,” where a different set of parameters is used to predict each context word depending on its position relative to the target word.
4.2 Modeling Characters Instead of Words
Note that under this representation, OOV words are treated as bidirectional LSTM encodings and thus they will be “close” to other words that the parser has seen during training, ideally close to their more frequent morphological relatives. We conjecture that this will give a clear advantage over a single “UNK” token for all the words that the parser does not see during training, as done in Section 4.1 and other parsers without additional resources. In Section 6 we confirm this hypothesis.
Learned Word Representations. Figure 8 visualizes a sample of the character-based bidirectional LSTMs learned representations (using the model referred to as Chars in Section 6.4). Clear clusters of past tense verbs, gerunds, and other syntactic classes are visible. The colors in the figure represent the most common POS tag for each word.
5. Training Procedure
The parser is trained to maximize the conditional probability of taking a “correct” action at each parsing state. (The definition of what constitutes a “correct” action is the major difference between static oracle and dynamic oracle training.) We begin with the simpler strategy of training with a static oracle (Section 5.1), then turn to dynamic oracles (Section 5.2).
Regardless of the oracle, our training implementation constructs a computation graph (nodes that represent values, linked by directed edges from each function's inputs to its outputs) for the negative log probability for the oracle transition sequence as a function of the current model parameters and uses forward- and backpropagation to obtain the gradients with respect to the model parameters (LeCun et al., 1998, section 4). The computations for a single parsing model were run on a single thread on a CPU. Using the dimensions discussed in the next section, we required between 8 and 12 hours to reach convergence on a held-out development set.11
5.1 Static Oracle Training
With a static oracle, the training procedure computes a canonical reference series of transitions for each gold parse tree. It then runs the parser through this canonical sequence of transitions, while keeping track of the state representation pt at each step t, as well as the distribution over transitions p(zt∣pt) that is predicted by the current classifier for the state representation. Once the end of the sentence is reached, the parameters are updated towards minimizing the cross-entropy of the reference transition sequence (Equation (2)) by minimizing the cross-entropy (i.e., maximizing the log-likelihood) of the correct transition p(zt∣pt) at each state along the path.
5.2 Training with Exploration using Dynamic Oracles
In the static oracle case, the parser is trained to predict the best transition to take at each parsing step, assuming all previous transitions were correct. Because the parser is likely to make mistakes at test time and encounter states it has not seen during training, this training criterion is problematic (Daumé III, Langford, and Marcu, 2009; Ross, Gordon, and Bagnell, 2011; Goldberg and Nivre, 2012, 2013, inter alia). Instead, we would prefer to train the parser to behave optimally also after making a mistake (under the constraints that it cannot backtrack or fix any previous decision). We thus need to include in the training examples states that result from wrong parsing decisions, together with the optimal transitions to take in these states. To this end we reconsider which training examples to show, and what it means to behave optimally on these training examples. The dynamic oracles framework for training with exploration, suggested by Goldberg and Nivre (2012, 2013), provides answers to these questions.
Although the application of dynamic oracle training is relatively straightforward, some adaptations were needed to accommodate the probabilistic training objective. These adaptations mostly follow Goldberg (2013).
5.2.1 Dynamic Oracles
A dynamic oracle is the component that, given a gold parse tree, provides the optimal set of possible actions to take under a given parser state. In contrast to static oracles that derive a canonical sequence for each gold parse tree and say nothing about parsing states that do not stem from this canonical path, the dynamic oracle is well-defined for states that result from parsing mistakes, and may produce more than a single gold action for a given state. Under the dynamic oracle framework, a parsing action is said to be optimal in a given state if the best tree that can be reached after taking the action is no worse (in terms of accuracy with respect to the gold tree) than the best tree that could be reached prior to taking that action.
Goldberg and Nivre (2013) define the arc-decomposition property of transition systems, and show how to derive efficient dynamic oracles for transition systems that are arc-decomposable (see footnote 10). Unfortunately, the arc-standard transition system does not have this property. Although it is possible to compute dynamic oracles for the arc-standard system (Goldberg, Sartorio, and Satta 2014), the computation relies on a dynamic programming algorithm that is polynomial in the length of the stack. As the dynamic oracle has to be queried for each parser state seen during training, the use of this dynamic oracle will make the training times several times longer. We chose instead to switch to the arc-hybrid transition system (Kuhlmann, Gómez-Rodríguez, and Satta 2011), which is very similar to the arc-standard system but is arc-decomposable and hence admits an efficient O(1) dynamic oracle, resulting in only negligible burden on the training time. We implemented the dynamic oracle to the arc-hybrid system as described by Goldberg (2013).
5.2.2 Training with Exploration
In order to expose the parser to configurations that are likely to result from incorrect parsing decisions, we make use of the probabilistic nature of the classifier. During training, instead of following the gold action, we sample the next transition according to the output distribution the classifier assigns to the current configuration. Another option, taken by Goldberg and Nivre, is to follow the one-best action predicted by the classifier. However, initial experiments showed that the one-best approach did not work well. Because the neural-network classifier becomes accurate early on in the training process, the one-best action is likely to be correct, and the parser is then exposed to very few error states in its training process. By sampling from the predicted distribution, we are effectively increasing the chance of straying from the gold path during training, while still focusing on mistakes that receive relatively high parser scores. We believe further formal analysis of this method will reveal connections to reinforcement learning and, perhaps, other methods for learning complex policies.
Taking this idea further, we could increase the number of error-states observed in the training process by changing the sampling distribution so as to bias it toward more low-probability states. We do this by raising each probability to the power of α (0 < α ≤ 1) and re-normalizing. This transformation keeps the relative ordering of the events, while shifting probability mass towards less frequent events. As we show subsequently, this turns out to be very beneficial for the configurations that make use of external embeddings. Indeed, these configurations achieve high accuracies and shared class distributions early on in the training process.
6. Experiments
After describing the implementation details of our optimization procedure (Section 6.1) and the data used in our experiments (Section 6.2), we turn to four sets of experiments:
- 1.
First, we assess the quality of our greedy, global-state stack LSTM parsers on a wide range of data sets, showing that it is highly competitive with the state of the art (Section 6.3).
- 2.
We compare the performance of the two different word representations (Section 4) across different languages, finding that they are quite beneficial in some cases and harmless in most others (Section 6.4).
- 3.
We compare dynamic oracles to static oracles, showing that training with dynamic oracles attains even better performance (Section 6.5).
- 4.
We compare our parser with other existing methods on the CoNLL-2009 data sets (Section 6.6).
6.1 Optimization Procedure Details
Parameter optimization was performed using stochastic gradient descent with an initial learning rate of η0 = 01; the learning rate was updated on each pass through the training data as ηt = η0/(1 + ρt), with ρ = 0.1 and where t is the number of epochs completed. No momentum was used. To mitigate the effects of “exploding” gradients, we clipped the ℓ2 norm of the gradient to 5 before applying the weight update rule (Graves 2013; Sutskever, Vinyals, and Le 2014). An ℓ2 penalty of 1 × 10−6 was applied to all weights.
Matrix and vector parameters were initialized with uniform samples in , where r and c were the number of rows and columns in the structure (Glorot and Bengio 2010). We use mini-batches of 1,000 sentences each and update every 25 iterations.
Dimensionality. The full version of our parsing model sets dimensionalities as follows. LSTM hidden states (dout) are of size 100, and we use two layers of LSTMs for each stack. Embeddings of the parser actions used in the composition functions (dstate) have 16 dimensions, and the output embedding (dout) size is 20 dimensions. Pretrained word embeddings (ddlm) have 100 dimensions (English) and 80 dimensions (Chinese), and the learned word embeddings (dword) have 32 dimensions. Part of speech and morphological tag embeddings (dpos and dmorph) have 12 dimensions, when included. The character-based embeddings are of size 100, when included. The combined representations of the inputs have 100 dimensions.
These dimensions were chosen based on intuitively reasonable values (words should have higher dimensionality than parsing actions, POS tags, and relations; LSTM states should be relatively large), and it was confirmed on development data that they performed well.12 Future work might more carefully optimize the parser's hyperparameters. A variety of automated techniques have been developed for hyperparameter optimization, including ones that enable evidence-sharing across related tasks (Yogatama and Mann 2014; Yogatama, Kong, and Smith 2015). Our architecture strikes a balance between minimizing computational expense and finding solutions that work.
Pretrained Word Vectors. The hyperparameters of the structured skip n-gram model are the same as in the skip n-gram model defined in the widely used word2vec implementation (Mikolov et al. 2013). We set the window size to 5, used a negative sampling rate to 10, and ran 5 epochs through unannotated corpora described in Section 6.2.
6.2 Data
We use different data depending on the targeted experiment; in this section we describe the data sets that we used to test the basic model, the character-based representations that are geared towards morphology, and the dynamic oracles. Finally, we also report results with the CoNLL 2009 data sets (Hajič et al. 2009) to make a complete comparison with similar systems.
6.2.1 Data for the Simplest Model that Uses Word Representations and Static Oracle for Training
We used the same data set-up as Chen and Manning (2014), namely, an English task and a Chinese task. This baseline configuration was chosen because they likewise used a neural parameterization to predict actions in an arc-standard transition-based parser.
- •
For English, we used the Stanford Dependency (SD) treebank (Marneffe, MacCartney, and Manning 2006) used in Chen and Manning (2014), which is the closest model published, with the same splits.13 The part-of-speech tags are predicted by using the Stanford POS tagger (Toutanova et al. 2003) with an accuracy of 97.3%. This treebank contains a negligible number of nonprojective arcs (Chen and Manning 2014).
- •
For Chinese, we use the Penn Chinese Treebank 5.1 (CTB5) following Zhang and Clark (2008),14 with gold part-of-speech tags, which is also the same as in Chen and Manning (2014).
6.2.2 Data to Test the Character-Based Representations and Static Oracle for Training
For the character-based representations we applied our model to the treebanks of the SPMRL Shared Task (Seddah et al. 2013; Seddah, Kübler, and Tsarfaty 2014): Arabic (Maamouri et al. 2004), Basque (Aduriz et al. 2003), French (Abeillé, Clément, and Toussenel 2003), German (Seeker and Kuhn 2012), Hebrew (Sima'an et al. 2001), Hungarian (Vincze et al. 2010), Korean (Choi 2013), Polish (Świdziński and Woliński 2010), and Swedish (Nivre, Nilsson, and Hall 2006). For all the corpora of the SPMRL Shared Task, we used predicted POS tags as provided by the shared task organizers.15 We also ran the experiment with the Turkish dependency treebank16(Oflazer et al. 2003) of the CoNLL-X Shared Task (Buchholz and Marsi 2006) and we use gold POS tags when used as it is common with the CoNLL-X data sets. In addition to these, we include the English and Chinese data sets described in Section 6.2.1.
6.2.3 Data for the Dynamic Oracle
Because the arc-hybrid transition-based parsing algorithm is limited to fully projective trees, we use the same data as in Section 6.2.1, which makes it comparable with the basic model that uses standard word representations and a static oracle arc-standard algorithm.
6.2.4 CoNLL-2009 Data
We also report results with all the CoNLL 2009 data sets (Hajič et al. 2009) to make a complete comparison with similar systems both for static and dynamic oracles.
6.3 Experiments with Static Oracle and Standard Word Representations
We report results on five experimental configurations per language, as well as the Chen and Manning (2014) baseline. These are: the full stack LSTM parsing model (S-LSTM), the stack LSTM parsing model without POS tags (−POS), the stack LSTM parsing model without pretrained language model embeddings (−pretraining), the stack LSTM parsing model that uses just head words on the stack instead of composed representations (−composition), and the full parsing model where rather than an LSTM, a simple recurrent neural network in which normal sigmoidal hidden units are used instead of the LSTM, but augmented with a stack pointer (S-RNN).
6.3.1 Results
Following Chen and Manning (2014), we exclude punctuation symbols for evaluation. Table 1 shows parsing scores comparable to Chen and Manning (2014),17 and we show that our model is better than their model on both the development set and the test set.
Overall, our parser substantially outperforms the baseline neural network parser of Chen and Manning (2014), both in the full configuration and in the various ablated conditions we report. The one exception to this is the −POS condition for the Chinese parsing task, in which we underperform their baseline (which used gold POS tags), although we do still obtain reasonable parsing performance in this limited case. We note that predicted POS tags in English add very little value—suggesting that we can think of parsing sentences directly without first tagging them. We also find that using composed representations of dependency tree fragments outperforms using representations of head words alone. Finally, we find that whereas LSTMs outperform classical RNNs, the latter are still quite capable of learning good representations.
Training and Testing Times. Training is performed with mini-batches and requires around 20 hours to achieve convergence in the English data set although it depends on the effect of initialization; the parser achieves an end-to-end runtime of 40,539.8 milliseconds to parse the entire English test data (2.4k sentences) on a single CPU core. Other data sets require similar runtime per sentence.
Effect of Initialization. The parser uses random seeds to create the mini-batches that it uses to train the model, which has an effect on the parsing results. In order to see the effect of initialization, we trained the whole S-LSTM parser with POS tags, pretrained word embeddings, and composition functions on the English SD data set, for the same amount of time (around 20 hours) using different random seeds. Figure 9 shows the variation due to initialization for 100 different random seeds and their results on the English test set. We observe that even though the results are very similar with a minimum of 92.3 and a maximum of 93.1, there are differences. The modal result is 92.8.18
Effect of Beam Size. Beam search was determined to have minimal impact on scores (absolute improvements of ≤ 0.3% were observed with small beams). Therefore, all results we report used greedy decoding—Chen and Manning (2014) likewise only report results with greedy decoding. This finding is in line with previous work that generates sequences from recurrent networks (Grefenstette et al. 2014), although Vinyals et al. (2015),19 Zhou et al. (2015), and Weiss et al. (2015) did report much more substantial improvements with beam search on their parsers.
6.4 Experiments with Character-Based Word Representations
In order to isolate the improvements provided by the LSTM encodings of characters described in Section 4.2, we run the stack LSTM parser in the following configurations:
- •
Words: words only, as in Section 4.1 (but without POS tags)
- •
Chars: character-based representations of words with bidirectional LSTMs, as in Section 4.2 (but without POS tags)
- •
Words + POS: words and POS tags (Section 4.1)
- •
Chars + POS: character-based representations of words with bidirectional LSTMs plus POS tags (Section 4.2)
- •
Words + POS + Morph: words, POS tags and morphological tags (Section 4.1)
- •
Chars + POS + Morph: character-based representations of words with bidirectional LSTMs plus POS tags and morphological tags (Section 4.2)
None of the experimental configurations include pretrained word-embeddings or any additional data resources. All experiments include swap transition, meaning that nonprojective trees could be produced in any language. Finally, in addition to the dimensionality described in Section 6.1, the character-based representations of words have 100 dimensions.
6.4.1 Results and Discussion
Tables 2 and 3 show the results of the parsers for the development sets and the final test sets, respectively. Most notable are improvements for agglutinative languages—Basque, Hungarian, Korean, and Turkish—both when POS tags are included and when they are not. Consistently, across all languages, Chars outperforms Words, suggesting that the character-level LSTMs are learning representations that capture similar information to parts of speech. On average, Chars is on par with Words + POS, and the best average of labeled attachment scores is achieved with Chars + POS.
UAS: . | ||||||
---|---|---|---|---|---|---|
Language . | Words . | Chars . | Words + POS . | Chars + POS . | Words + POS + Morph . | Chars + POS + Morph . |
Arabic | 86.14 | 87.20 | 87.44 | 87.07 | 87.35 | 86.78 |
Basque | 78.42 | 84.97 | 83.49 | 85.58 | 87.20 | 86.90 |
French | 84.84 | 86.21 | 87.00 | 86.33 | 87.19 | 86.57 |
German | 88.14 | 90.94 | 91.16 | 91.23 | 91.44 | 91.45 |
Hebrew | 79.73 | 79.92 | 81.99 | 80.76 | 82.10 | 80.69 |
Hungarian | 72.38 | 80.16 | 78.47 | 80.85 | 80.29 | 80.55 |
Korean | 78.98 | 88.98 | 87.36 | 89.14 | 87.48 | 89.04 |
Polish | 73.29 | 85.69 | 89.32 | 88.54 | 90.34 | 89.36 |
Swedish | 73.44 | 75.03 | 80.02 | 78.85 | 80.14 | 79.79 |
Turkish | 71.10 | 74.91 | 77.13 | 77.96 | 79.28 | 79.18 |
Chinese | 79.43 | 80.36 | 85.98 | 85.81 | – | – |
English | 91.64 | 91.98 | 92.94 | 92.49 | – | – |
Average | 79.79 | 83.86 | 85.19 | 85.38 | – | – |
LAS: . | ||||||
Language . | Words . | Chars . | Words + POS . | Chars + POS . | Words + POS + Morph . | Chars + POS + Morph . |
Arabic | 82.73 | 84.34 | 84.81 | 84.36 | 84.71 | 83.95 |
Basque | 67.08 | 78.22 | 74.31 | 79.52 | 81.50 | 80.85 |
French | 80.32 | 81.70 | 82.71 | 81.51 | 82.68 | 82.12 |
German | 85.36 | 88.68 | 89.04 | 88.83 | 89.42 | 89.31 |
Hebrew | 69.42 | 70.58 | 74.11 | 72.18 | 73.98 | 72.37 |
Hungarian | 62.14 | 75.61 | 69.50 | 76.16 | 75.55 | 75.85 |
Korean | 67.48 | 86.80 | 83.80 | 86.88 | 83.78 | 86.92 |
Polish | 65.13 | 78.23 | 81.84 | 80.97 | 84.62 | 82.89 |
Swedish | 64.77 | 66.74 | 72.09 | 69.88 | 73.13 | 70.87 |
Turkish | 53.98 | 62.91 | 62.30 | 62.87 | 71.42 | 71.05 |
Chinese | 75.64 | 77.06 | 84.36 | 84.10 | – | – |
English | 88.60 | 89.58 | 90.63 | 90.08 | – | – |
Average | 71.89 | 78.37 | 79.13 | 79.78 | – | – |
UAS: . | ||||||
---|---|---|---|---|---|---|
Language . | Words . | Chars . | Words + POS . | Chars + POS . | Words + POS + Morph . | Chars + POS + Morph . |
Arabic | 86.14 | 87.20 | 87.44 | 87.07 | 87.35 | 86.78 |
Basque | 78.42 | 84.97 | 83.49 | 85.58 | 87.20 | 86.90 |
French | 84.84 | 86.21 | 87.00 | 86.33 | 87.19 | 86.57 |
German | 88.14 | 90.94 | 91.16 | 91.23 | 91.44 | 91.45 |
Hebrew | 79.73 | 79.92 | 81.99 | 80.76 | 82.10 | 80.69 |
Hungarian | 72.38 | 80.16 | 78.47 | 80.85 | 80.29 | 80.55 |
Korean | 78.98 | 88.98 | 87.36 | 89.14 | 87.48 | 89.04 |
Polish | 73.29 | 85.69 | 89.32 | 88.54 | 90.34 | 89.36 |
Swedish | 73.44 | 75.03 | 80.02 | 78.85 | 80.14 | 79.79 |
Turkish | 71.10 | 74.91 | 77.13 | 77.96 | 79.28 | 79.18 |
Chinese | 79.43 | 80.36 | 85.98 | 85.81 | – | – |
English | 91.64 | 91.98 | 92.94 | 92.49 | – | – |
Average | 79.79 | 83.86 | 85.19 | 85.38 | – | – |
LAS: . | ||||||
Language . | Words . | Chars . | Words + POS . | Chars + POS . | Words + POS + Morph . | Chars + POS + Morph . |
Arabic | 82.73 | 84.34 | 84.81 | 84.36 | 84.71 | 83.95 |
Basque | 67.08 | 78.22 | 74.31 | 79.52 | 81.50 | 80.85 |
French | 80.32 | 81.70 | 82.71 | 81.51 | 82.68 | 82.12 |
German | 85.36 | 88.68 | 89.04 | 88.83 | 89.42 | 89.31 |
Hebrew | 69.42 | 70.58 | 74.11 | 72.18 | 73.98 | 72.37 |
Hungarian | 62.14 | 75.61 | 69.50 | 76.16 | 75.55 | 75.85 |
Korean | 67.48 | 86.80 | 83.80 | 86.88 | 83.78 | 86.92 |
Polish | 65.13 | 78.23 | 81.84 | 80.97 | 84.62 | 82.89 |
Swedish | 64.77 | 66.74 | 72.09 | 69.88 | 73.13 | 70.87 |
Turkish | 53.98 | 62.91 | 62.30 | 62.87 | 71.42 | 71.05 |
Chinese | 75.64 | 77.06 | 84.36 | 84.10 | – | – |
English | 88.60 | 89.58 | 90.63 | 90.08 | – | – |
Average | 71.89 | 78.37 | 79.13 | 79.78 | – | – |
UAS: . | ||||||
---|---|---|---|---|---|---|
Language . | Words . | Chars . | Words + POS . | Chars + POS . | Words + POS + Morph . | Chars + POS + Morph . |
Arabic | 85.21 | 86.08 | 86.05 | 86.07 | 86.35 | 85.79 |
Basque | 77.06 | 85.19 | 82.92 | 85.22 | 86.87 | 86.02 |
French | 83.74 | 85.34 | 86.15 | 85.78 | 86.81 | 86.21 |
German | 82.75 | 86.80 | 87.33 | 87.26 | 87.40 | 87.48 |
Hebrew | 77.62 | 79.93 | 80.68 | 80.17 | 81.36 | 79.83 |
Hungarian | 72.78 | 80.35 | 78.64 | 80.92 | 81.07 | 81.06 |
Korean | 78.70 | 88.39 | 86.85 | 88.30 | 86.84 | 88.24 |
Polish | 72.01 | 83.44 | 87.06 | 85.97 | 88.93 | 88.44 |
Swedish | 76.39 | 79.18 | 83.43 | 83.24 | 83.18 | 82.04 |
Turkish | 71.70 | 76.32 | 75.32 | 76.34 | 79.52 | 78.28 |
Chinese | 79.01 | 79.94 | 85.96 | 85.30 | – | – |
English | 91.16 | 91.47 | 92.57 | 91.63 | – | – |
Average | 79.01 | 85.36 | 84.41 | 84.68 | – | – |
LAS: . | ||||||
Language . | Words . | Chars . | Words + POS . | Chars + POS . | Words + POS + Morph . | Chars + POS + Morph . |
Arabic | 82.05 | 83.41 | 83.46 | 83.40 | 83.83 | 83.11 |
Basque | 66.61 | 79.09 | 73.56 | 78.61 | 80.83 | 79.92 |
French | 79.22 | 80.92 | 82.03 | 81.08 | 82.65 | 81.97 |
German | 79.15 | 84.04 | 84.62 | 84.49 | 84.79 | 84.86 |
Hebrew | 68.71 | 71.26 | 72.70 | 72.26 | 73.67 | 72.06 |
Hungarian | 61.93 | 75.19 | 69.31 | 76.34 | 76.66 | 76.45 |
Korean | 67.50 | 86.27 | 83.37 | 86.21 | 83.41 | 86.25 |
Polish | 63.96 | 76.84 | 79.83 | 78.24 | 82.67 | 82.21 |
Swedish | 67.69 | 71.19 | 76.40 | 74.47 | 76.42 | 74.41 |
Turkish | 54.55 | 64.34 | 61.22 | 62.28 | 70.93 | 68.28 |
Chinese | 74.79 | 76.29 | 84.40 | 83.72 | – | – |
English | 88.42 | 88.94 | 90.31 | 89.44 | – | – |
Average | 71.22 | 78.15 | 78.43 | 79.21 | – | – |
UAS: . | ||||||
---|---|---|---|---|---|---|
Language . | Words . | Chars . | Words + POS . | Chars + POS . | Words + POS + Morph . | Chars + POS + Morph . |
Arabic | 85.21 | 86.08 | 86.05 | 86.07 | 86.35 | 85.79 |
Basque | 77.06 | 85.19 | 82.92 | 85.22 | 86.87 | 86.02 |
French | 83.74 | 85.34 | 86.15 | 85.78 | 86.81 | 86.21 |
German | 82.75 | 86.80 | 87.33 | 87.26 | 87.40 | 87.48 |
Hebrew | 77.62 | 79.93 | 80.68 | 80.17 | 81.36 | 79.83 |
Hungarian | 72.78 | 80.35 | 78.64 | 80.92 | 81.07 | 81.06 |
Korean | 78.70 | 88.39 | 86.85 | 88.30 | 86.84 | 88.24 |
Polish | 72.01 | 83.44 | 87.06 | 85.97 | 88.93 | 88.44 |
Swedish | 76.39 | 79.18 | 83.43 | 83.24 | 83.18 | 82.04 |
Turkish | 71.70 | 76.32 | 75.32 | 76.34 | 79.52 | 78.28 |
Chinese | 79.01 | 79.94 | 85.96 | 85.30 | – | – |
English | 91.16 | 91.47 | 92.57 | 91.63 | – | – |
Average | 79.01 | 85.36 | 84.41 | 84.68 | – | – |
LAS: . | ||||||
Language . | Words . | Chars . | Words + POS . | Chars + POS . | Words + POS + Morph . | Chars + POS + Morph . |
Arabic | 82.05 | 83.41 | 83.46 | 83.40 | 83.83 | 83.11 |
Basque | 66.61 | 79.09 | 73.56 | 78.61 | 80.83 | 79.92 |
French | 79.22 | 80.92 | 82.03 | 81.08 | 82.65 | 81.97 |
German | 79.15 | 84.04 | 84.62 | 84.49 | 84.79 | 84.86 |
Hebrew | 68.71 | 71.26 | 72.70 | 72.26 | 73.67 | 72.06 |
Hungarian | 61.93 | 75.19 | 69.31 | 76.34 | 76.66 | 76.45 |
Korean | 67.50 | 86.27 | 83.37 | 86.21 | 83.41 | 86.25 |
Polish | 63.96 | 76.84 | 79.83 | 78.24 | 82.67 | 82.21 |
Swedish | 67.69 | 71.19 | 76.40 | 74.47 | 76.42 | 74.41 |
Turkish | 54.55 | 64.34 | 61.22 | 62.28 | 70.93 | 68.28 |
Chinese | 74.79 | 76.29 | 84.40 | 83.72 | – | – |
English | 88.42 | 88.94 | 90.31 | 89.44 | – | – |
Average | 71.22 | 78.15 | 78.43 | 79.21 | – | – |
Even if our character-based representations are capable of encoding the same kind of information, existing POS tags suffice for high accuracy. However, the POS tags in treebanks for morphologically rich languages do not seem to be enough. This is evidenced by the results including explicit morphological features, which outperforms the ones without them in most cases, and outperforms the results using character-based representations. It is interesting to see that for some languages, such as German, Hungarian, Korean, and Swedish, the character-based representations seem to be enough to approach or even surpass the results of the parser by using rich morphological features. In Korean, Chars is the best model, which suggests that the character-based word representations encode as much useful information for parsing for that particular language as is currently available.
Swedish, English, and French use suffixes for the verb tenses and number,20 and Hebrew uses a root-and-pattern system. Even for Chinese, which is not morphologically rich, Chars shows a benefit over Words, perhaps by capturing regularities in syllable structure within words.
6.4.2 Out-of-Vocabulary Words
The character-based representation for words is notably beneficial for OOV words. We tested this specifically by comparing Chars to a model in which all OOVs are replaced by the string “UNK” during parsing. This always has a negative effect on LAS (average − 4.5 points, − 2.8 UAS). Figure 10 shows how this drop varies with the development OOV rate across treebanks; the most extreme is Korean, which drops 15.5 LAS. A similar, but less pronounced, pattern was observed for models that include POS.
Interestingly, this artificially impoverished model is still consistently better than Words for all languages (e.g., for Korean, by 4 LAS). This implies that not all of the improvement is due to OOV words; statistical sharing across orthographically close words is beneficial as well.
Finally, Table 4 shows the results obtained when combining the pretrained word embeddings of Section 6.3 with the character-based representations of this section for the English and Chinese Treebanks. We also run the experiments by including and excluding part-of-speech tags to see the effect of the character-based representations when they are combined with pretrained word embeddings. The main conclusion is that when we combine character-based representations with pretrained word embeddings and part-of-speech tags, they do not have the same effect as when we only apply character-based embeddings; however, the model learns better when the part-of-speech tags or the pretrained word embeddings are not included. And finally, the model that incorporates character-based embeddings and pretraining is a bit better than the model that uses pretrained embeddings and part-of-speech tags.
6.4.3 Comparison with State-of-the-Art Parsers
Table 5 shows a comparison with state-of-the-art parsers. We include greedy transition-based parsers that, like ours, do not apply beam search. For all the SPMRL languages we show the results of Ballesteros (2013), who reported results after carrying out a careful automatic morphological feature selection experiment. For Turkish, we show the results of Nivre et al. (2006), who also carried out a careful manual morphological feature selection. Our parser outperforms these in most cases. For English and Chinese, we report our results of the best parser on the develpment set in Section 6.3—which is Words + POS but with pretrained word embeddings.
We also show the best reported results on these data sets. For the SPMRL data sets, the best performing system of the shared task is either Björeklund et al. (2013) or Björkelund et al. (2014), which are better than our system. Note that the comparison is harsh to our system, which does not use unlabeled data nor any combination of different parsers. For Turkish, our system provides the best results ever reported in this data set. For English, we report Weiss et al. (2015), and for Chinese, we report our results in Section 6.3—which is Words + POS but with pretrained word embeddings and dynamic oracles (see next section).
6.5 Experiments with Dynamic Oracles
Table 6 shows the results of the parser with arc-hybrid and dynamic oracles for the English data set (with Stanford dependencies) and the Chinese CTB data sets with and without pretrained word embeddings, respectively. The table also shows the best result reported in Section 6.3 for the sake of comparison between static and dynamic training strategies.
The results achieved by the dynamic oracle for English are actually one of the best results ever reported in this data set, achieving 93.56 unlabeled attachment score. This is remarkable given that the parser uses a completely greedy inference procedure. Moreover, the Chinese numbers establish the state-of-the-art by using the same settings as in Chen and Manning (2014).
The error-exploring dynamic-oracle training always improves above static oracle training using the same transition system, but the arc-hybrid system slightly under-performs the arc-standard system when trained with static oracle. Transforming the sampling distribution towards preferring less probable events (α = 0.75) is especially beneficial when training with pretrained word embeddings.
6.6 Experiments with CoNLL 2009 Data Sets
In order to be able to compare with similar greedy parsers (Yazdani and Henderson 2015; Andor et al. 2016)21 we report the performance of the parser on the multilingual treebanks of the CoNLL 2009 Shared Task (Hajič et al. 2009). Because some of the treebanks contain nonprojective sentences and arc-hybrid does not allow nonprojective trees, we use the pseudo-projective approach (Nivre and Nilsson 2005). For all the experiments presented in this section we used the predicted part-of-speech tags provided by the CoNLL 2009 shared task organizers. In order to see if the pretrained word embeddings are useful for other languages we also include results with pretrained word embeddings for English, Chinese, German, and Spanish following the same training set-up as in Section 4; for English and Chinese we used the same pretrained word embeddings as in previous experiments, for German we pretrained embeddings using the monolingual training data from the WMT 2015 data set22 , and for Spanish we used the Spanish Gigaword version 3.
The results for the parser with character-based representations on these data sets (last line of the table) were published by Andor et al. (2016). In Zhang and Weiss (2016), it is also possible to find results of the same version of the parser on the Universal Dependency treebanks (Nivre et al. 2015).
Our parsers outperform the model by Yazdani and Henderson (2015). When comparing with Andor et al. (2016), it is worth noting that they use pretrained word embeddings, plus predicted morphological features, and the top-K POS tags from a CRF tagger instead of a simple predicted tags, so the results that are actually comparable are the ones in which we include pretrained word embeddings; our model is better than Andor et al. (2016), especially in the case of the dynamic oracles.
The results with Czech and the dynamic oracle model suggest that it is a much better strategy when the number of nonprojective trees is high; this result suggests that the training with exploration allows the parser to handle more complicated syntactic structures. Finally, the arc-standard parser enriched with the SWAP action performs similarly to the pseudo-projective approach for all languages, which demonstrates the effectiveness of both approaches when parsing nonprojective structures.
7. Related Work
Our approach ties together several strands of previous work. First, several kinds of stack memories have been proposed to augment neural architectures. Das, Giles, and Sun (1992) proposed a neural network with an external stack memory based on recurrent neural networks. In contrast to our model, in which the entire contents of the stack are summarized in a single value, in their model the network could only see the contents of the top of the stack. Mikkulainen (1996) proposed an architecture with a stack that had a summary feature, although the stack control was learned as a latent variable.
Several authors have used neural networks to predict parser actions in shift-reduce parsers. The earliest attempt we are aware of is Mayberry and Miikkulainen (1999). The recent resurgence of interest in neural networks has included several applications to transition-based dependency parsers (Andor et al. 2016; Yazdani and Henderson 2015; Weiss et al. 2015; Zhou et al. 2015; Chen and Manning 2014; Stenetorp 2013; Titov and Henderson 2007b). In most of these works, the conditioning structure is manually crafted and sensitive to only certain properties of the state, with the exception of Titov and Henderson (2007b), whereas we are conditioning on the global state. Like us, Stenetorp (2013) used recursively composed representations of the tree fragments (a head and its dependents). Titov and Henderson (2007b) used a generative latent variable based on incremental sigmoid belief networks to likewise condition on the global state. Neural networks have also been used to learn representations for use in phrase-structure parsing (Henderson 2003, 2004; Titov and Henderson 2007a; Socher et al. 2013; Le and Zuidema 2014). The work by Watanabe et al. (2015) is also similar to the work presented in this article but for phrase-structure parsing.
Our model's position relative to existing neural network parsers can be understood by analogy to neural language models. The context representation is either constructed subject to an n-gram window (Bengio et al. 2003; Mnih and Hinton 2007), or it may be constructed incrementally using recurrent neural networks to obtain potentially unbounded dependencies (Mikolov et al. 2010). Likewise, during parsing, the particular algorithm state (represented by the stack, buffer, and history of actions taken) is the context used to predict the parsing action. This may be represented as a finite vector by picking out certain features (Weiss et al. 2015; Chen and Manning 2014) or by recursively composing the representation of simpler elements as parsing proceeds (Henderson 2003, 2004; Titov and Henderson 2007b; this work).
LSTMs have recently been demonstrated as a mechanism for generating phrase structure parses. Vinyals et al. (2015) used a sequence-to-sequence architecture with attention to predict a sequence of tree building actions, conditional on the sentence being parsed. Other works have used LSTMs for better semantic representations (Tai, Socher, and Manning 2015), and others to solve similar tasks such as phrase-structure parsing (Dyer et al. 2016).
In terms of token representations, it is worth noting that character-based representations have been explored in other NLP tasks; for instance, dos Santos and Zadrozny (2014) and dos Santos and Guimarães (2015) learned character level neural representations for POS tagging and named entity recognition, obtaining an error reduction in both tasks. Their approach is similar to ours. Many approaches have used character-based models as additional features to improve existing models. For instance, Chrupala (2014) tried character-based recurrent neural networks to normalize tweets. Similarly, Botha and Blunsom (2014) show that stems, prefixes, and suffixes can be used to learn useful word representations. Our approach is learning similar representations by using the bidirectional LSTMs. Chen et al. (2015) propose joint learning of character and word embeddings for Chinese, claiming that characters contain rich internal information. Constructing word representations from characters using convolutional neural networks has also been proposed (Kim et al. 2016).
Tsarfaty (2006), Cohen and Smith (2007), Goldberg and Tsarfaty (2008), and Goldberg and Elhadad (2011) presented methods for joint segmentation and phrase-structure parsing, by segmenting the words in useful morphologically oriented units. Bohnet et al. (2013) presented an arc-standard transition-based parser that performs competitively for joint morphological tagging and dependency parsing for richly inflected languages, such as Czech, Finnish, German, Hungarian, and Russian.
Training greedy parsers on non-gold outcomes, facilitated by dynamic oracles, has been explored by several researchers in different ways (Goldberg and Nivre 2012, 2013; Goldberg, Sartorio, and Satta 2014; Honnibal, Goldberg, and Johnson 2013; Honnibal and Johnson 2014; Gómez-Rodríguez, Sartorio, and Satta 2014; Björkelund and Nivre 2015; Tokgöz and Eryiğit 2015; Gómez-Rodríguez and Fernández-González 2015; Vaswani and Sagae 2016). More generally, training greedy inference systems by paying attention to the expected classifier behavior during test time has been explored under the imitation learning and Learning to Search (SEARN) frameworks (Abbeel and Ng 2004; Daumé III and Marcu 2005; Vlachos 2012; He, Eisner, and Daume 2012; Daumé III, Langford, and Marcu 2009; Ross, Gordon, and Bagnell 2011). In the context of recurrent neural networks, the discrepancy between training time and test time prediction was recently explored by Bengio et al. (2015) for the task of language modeling. However, to the best of our knowledge this is the first time that such a training procedure is explored together with neural networks for parsing. The results show that the consistent gains in accuracy due to dynamic oracle training that were observed in previous work persist also for the very strong stack LSTM parser, although some care had to be taken to adapt the training procedures to the probabilistic nature of the objective, and to the high accuracy levels of the underlying model.
We conclude by remarking that stack memory offers intriguing possibilities for learning to solve general information processing problems (Mikkulainen 1996). Here, we learned from observable stack manipulation operations (i.e., supervision from a treebank), and the computed embeddings of final parser states were not used for any further prediction.
Such an extension of the work would make it an alternative to architectures that have an explicit external memory such as neural Turing machines (Graves, Wayne, and Danihelka 2014) and other memory networks (Weston, Chopra, and Bordes 2014; Kumar et al. 2015). However, as with those models, without supervision of the stack operations, formidable computational challenges must be solved (e.g., marginalizing over all latent stack operations), but sampling techniques and techniques from reinforcement learning have promise here (Zaremba and Sutskever 2015), making this an intriguing avenue for future work.
8. Conclusions
We presented stack LSTMs, recurrent neural networks for sequences, with push and pop operations, and used them to implement a state-of-the-art transition-based dependency parser. The parser uses a greedy learning strategy which potentially provides very high parsing speed while still achieving state-of-the-art results.
We demonstrate that word representations “composed” from representations of characters are sufficient for successful transition-based dependency parsing. The results are stronger for agglutinative languages, such as Basque, Hungarian, Korean, and Turkish. This outcome is important, because annotating morphological information is expensive; our model suggests that only dependency annotations are necessary and relevant morphological features can be learned from strings. Moreover, character-based representations are also a way of overcoming the out-of-vocabulary problem even in morphologically simpler languages, such as English. Additionally, inspired by Bohnet et al. (2013), an orthogonal experiment to the one presented in this paper would be to apply character-based representations to joint tagging and dependency parsing. We expect the character-based representations to be powerful enough to catch morphosyntactic information in a joint model. Similar improvements may be achieved in an out-of-domain scenario.
Even though the parser is greedy, it provides very consistent results comparable with the best parsers of the state-of-the-art. We even obtained further improvement as demonstrated with the experiments with the dynamic oracles, that provide a push over the baseline while still maintaining very fast (and greedy) parsing speed.
Finally, it is worth noting that a modified version of the same parser has been used to create a polyglot dependency parser (Ammar et al. 2016) with the universal dependency treebanks (Nivre et al. 2015), and the stack LSTMs have also been applied to solve other core natural language processing tasks, such as named entity recognition (Lample et al. 2016), phrase-structure parsing and language modeling (Dyer et al. 2016), and joint syntactic and semantic parsing (Swayamdipta et al. 2016).
Acknowledgments
We would like to thank Lingpeng Kong and Jacob Eisenstein for comments on an earlier version of this article and Danqi Chen for assistance with the parsing data sets. We would also like to thank Bernd Bohnet and Joakim Nivre for their help with the parsing algorithms. This work was sponsored in part by the U.S. Army Research Laboratory and the U.S. Army Research Office under contract/grant number W911NF-10-1-0533, and in part by NSF CAREER grant IIS-1054319. Miguel Ballesteros is supported by the European Commission under the contract numbers FP7-ICT-610411 (project MULTISENSOR) and H2020-RIA-645012 (project KRISTINA). Yoav Goldberg is supported by The Israeli Science Foundation (grant number 1555/15) and a Google Research Award.
Notes
The term “state” refers to the collection of previous decisions (sometimes called the history), resulting partial structures, which are stored in a stack data structure, and the words remaining to be processed.
These state equivalence assumptions are similar to the Markov assumptions made in modeling stochastic processes, according to which a potentially unbounded state history can be represented only in terms of a finite number of the most recent states.
Goldberg et al. (2013) propose a similar stack construction to prevent stack operations from invalidating existing references to the stack in a beam-search parser that must (efficiently) maintain a priority queue of stacks.
The A stack is only ever pushed to; our use of a stack here is purely for implementational and expository convenience.
The parser has the root as the last token to be included in the buffer, following Ballesteros and Nivre (2013).
In preliminary experiments, we tried several nonlinearities and found ReLU to work slightly better than the others.
In general, is the complete set of parser actions discussed in Section 3.3, but in some cases not all actions are available. For example, when S is empty and words remain in B, a shift transition is obligatory (Nivre 2008).
Specifically: For every possible parser configuration p and group of arcs A, if each arc in A can be derived from p, then a valid tree structure containing all of the arcs in A can also be derived from p. This is a sufficient condition, but whether it is necessary is unknown; hence the question of an efficient, O(1) dynamic oracle for the augmented system is an open research problem.
Software for replicating the experiments is available from https://github.com/clab/lstm-parser.
We did perform preliminary experiments with LSTM states of 32, 50, and 80, but the other dimensions were our initial guesses.
Training: 02-21. Development: 22. Test: 23.
Training: 001–815, 1001–1136. Development: 886–931, 1148–1151. Test: 816–885, 1137–1147.
The POS tags were calculated with MarMot tagger (Mueller, Schmid, and Schütze 2013) by the best performing system of the SPMRL Shared Task (Björkelund et al. 2013). Arabic: 97.38. Basque: 97.02. French: 97.61. German: 98.10. Hebrew: 97.09. Hungarian: 98.72. Korean: 94.03. Polish: 98.12. Swedish: 97.27.
Because the Turkish dependency treebank does not have a development set, we extracted the last 150 sentences from the 4,996 sentences of the training set as a development set.
The results reported by Dyer et al. (2015) were obtained with an earlier version of the neural network machinery and are similar, though not exactly the same, as those reported in this article.
Tense and number features provide little improvement in a transition-based parser, compared with other features such as case, when the POS tags are included (Ballesteros 2013).
We report the performance of these parsers in the most comparable set-up, that is, with beam = 1 or greedy.
References
Author notes
IBM T. J. Watson Research Center, 1101 Kitchawan Road, Route 134, Yorktown Heights, NY 10598. E-mail: [email protected].
Language Technologies Institute, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA. E-mail: [email protected]. Now affiliated with DeepMind, London, UK.
Computer Science Department, Bar-Ilan University, Ramat Gan, 5290002 Israel. E-mail: [email protected].
Computer Science & Engineering, University of Washington, Box 352350, Seattle, WA 98195, USA. E-mail: [email protected].
This work was done while Miguel was at Universitat Pompeu Fabra and Carnegie Mellon University.