Surface Statistics of an Unknown Language Indicate How to Parse It

We introduce a novel framework for delexicalized dependency parsing in a new language. We show that useful features of the target language can be extracted automatically from an unparsed corpus, which consists only of gold part-of-speech (POS) sequences. Providing these features to our neural parser enables it to parse sequences like those in the corpus. Strikingly, our system has no supervision in the target language. Rather, it is a multilingual system that is trained end-to-end on a variety of other languages, so it learns a feature extractor that works well. We show experimentally across multiple languages: (1) Features computed from the unparsed corpus improve parsing accuracy. (2) Including thousands of synthetic languages in the training yields further improvement. (3) Despite being computed from unparsed corpora, our learned task-specific features beat previous work’s interpretable typological features that require parsed corpora or expert categorization of the language. Our best method improved attachment scores on held-out test languages by an average of 5.6 percentage points over past work that does not inspect the unparsed data (McDonald et al., 2011), and by 20.7 points over past “grammar induction” work that does not use training languages (Naseem et al., 2010).


Introduction
Dependency parsing is one of the core natural language processing tasks. It aims to parse a given sentence into its dependency tree: a directed graph of labeled syntactic relations between words. Supervised dependency parsers-which are trained using a "treebank" of known parses in the target language-have been very successful (McDonald, 2006;Nivre, 2008;Kiperwasser and Goldberg, 2016). By contrast, the progress of unsupervised dependency parsers has been slow, and they have apparently not been used in any downstream NLP systems (Mareček, 2016). An unsupervised parser does not have access to a treebank, but only to a corpus of unparsed sentences in the target language.
Unsupervised parsing has been studied for decades. The most common approach is grammar induction (Lari and Young, 1990;Carroll and Charniak, 1992;. Grammar induction induces an explicit grammar from the unparsed corpus, such as a probabilistic context-free grammar (PCFG), and uses that to parse sentences of the language. This approach has encountered two major difficulties: • Search error: Most formulations of grammar induction involve optimizing a highly non-convex objective function such as likelihood. The optimization is typically NP-hard (Cohen and , and approximate local search methods tend to get stuck in local optima. • Model error: Likelihood does not correlate well with parsing accuracy anyway (Smith, 2006, Figure 3.2). Likelihood optimization seeks latent trees that help to predict the observed sentences, but these unsupervised trees may use a non-standard syntactic analysis or even be optimized to predict nonsyntactic properties such as topic. We seek a standard syntactic analysis-what  calls the MATCHLINGUIST task.
We address both difficulties by using a supervised learning framework-one whose objective function is easier to optimize and explicitly tries to match linguists' standard syntactic analyses.
Our approach is inspired by Wang and Eisner (2017), who use an unparsed but tagged corpus to predict the fine-grained syntactic typology of a language. For example, they may predict that about 70% of the direct objects fall to the right of the verb. Their system is trained on a large number of (unparsed corpus, true typology) pairs, each representing a different language. With this training, it can generalize to predict typology from the unparsed corpus of a new language. Our approach is similar except that we predict parses rather than just a typology. In both cases, the system is trained to optimize a task-specific quality measure. The system's parameterization can be chosen to simplify optimization (strikingly, the training objective could even be made convex by using a conditional random field architecture) and/or to incorporate linguistically motivated features.
The positive results of Wang and Eisner (2017) demonstrate that there are indeed surface clues to syntactic structure in the input corpus, at least if it is POS-tagged (as in their work and ours). However, their method only found global typological information: it did not establish which 70% of the direct objects fell to the right of their verbs, let alone identify which nouns were in fact direct objects of which verbs. That requires a token-level analysis of each sentence, which we undertake in this paper. Again, the basic idea is that instead of predicting interpretable typological properties of a language as Wang and Eisner (2017) did, we will predict a language-specific version of the scoring function that a parser uses to choose among various actions or substructures.

Unsupervised Parsing with Supervised Tuning
Our fundamental question is whether gold part-ofspeech (POS) sequences carry useful information about the syntax of a language. 1 As we will show, the answer is yes, and the information can be extracted and used to obtain actual parses. This is the same question that has been implicitly asked by previous papers in the unsupervised parsing tradition (see §5). Unsupervised parsing of gold POS sequences is an artificial task, to be sure. 2 Nonetheless, it is a starting point for more ambitious settings that would learn from words and real-world grounding (with or without the POS tags). Even this starting point has proved surprisingly difficult over decades of research, so it has not been clear whether the POS sequences even contain the necessary information.
Yet this task-like others that engineers, linguists, or human learners might face-might be solvable with general knowledge about the distribution of human languages. An experienced linguist can sometimes puzzle out the structure of a new language. The reader may be willing to guess a parse for the gold POS sequence VERB DET NOUN ADJ DET NOUN. After all, adjectives usually attach to nouns (Naseem et al., 2010), and the adjective in this example seems to attach to the first noun-not to the second, since determiners usually fall at the edge of a noun phrase. Meanwhile, the sequence's sole verb is apparently followed by two noun phrases, which suggests either VSO (verb-subject-object) or VOS orderand VSO is a good guess as it is more common (Dryer and Haspelmath, 2013). Observing a corpus of additional POS sequences might help resolve the question of whether this language is primarily VSO or VOS, for example, by guessing that short noun phrases in the corpus (for example, unmodified pronouns) are more often subjects.
Thus, we propose to solve the task by training a kind of "artificial linguist" that can do such analysis on corpora of new languages. This is a general approach to developing an unsupervised method for a specific type of dataset: tune its structure and hyperparameters so that it works well on actual datasets of that sort, and then apply it to new datasets. For example, consider clustering-the canonical unsupervised problem. What constitutes a useful cluster depends on the type of data and the application. Basu et al. (2013) develop a text clustering system specifically to aid teachers. Their "Powergrading" system can group all the student-written answers to a novel question, having been trained on human judgments of answer similarity for other questions. Their novel questions are analogous to our novel languages: their unsupervised system is specifically tailored to match teachers' semantic similarity judgments within any corpus of student answers, just as ours is tailored to match linguists' syntactic judgments within any corpus of human-language POS sequences. Other NLP work on supervised tuning of unsupervised learners includes strapping (Eisner and Karakos, 2005;Karakos et al., 2007), which tunes with the help of both real and synthetic datasets, just as we will ( §3).
Are such systems really "unsupervised"? Yes, in the sense that they are able to discover desirable structure in a new dataset. Unsupervised learners are normally crafted using assumptions about the data domain. Their structure and hyperparameters may have been manually tuned to produce pleasing results for typical datasets in that domain. In the domain of POS corpora, we simply scale up this practice to automatically tune a large set of parameters, which later guide our system's search for linguist-approved structure on each new humanlanguage dataset. Our system should be regarded as "supervised" if the examples are taken to be entire languages: after all, we train it to map unlabeled corpora to usefully labeled corpora. But once trained, it is "unsupervised" if the examples are taken to be the sentences within a given corpus: by analyzing the corpus, our system figures out how to map sentences of that language to parses, without any labeled examples in that language.

Data
We use two datasets in our experiment: UD: Universal Dependencies version 1.2 (Nivre et al., 2015) A collection of 37 dependency treebanks of 33 languages, tokenized and annotated with a common set of POS tags and dependency relations. 3 In principle, our trained system could be applied to predict UD-style dependency relations in any tokenized natural-language corpus with UD-style POS tags.
GD: Galactic Dependencies version 1.0 (Wang and Eisner, 2016) A collection of dependency treebanks for 53,428 synthetic languages (of which we will use a subset). A GD treebank is generated by starting with some UD treebank and stochastically permuting the child subtrees of nouns and/or verbs to match their orders in other UD treebanks. For example, one of the GD treebanks reflects what the English UD treebank might have looked like if English had been both VSO (like Irish) and postpositional (like Japanese). This typologically diverse collection of resource-rich synthetic languages aims to propel the development of NLP systems that can handle diverse natural languages, such as multilingual parsers and taggers.

Why Synthetic Training Languages?
We hope for our system to do well, on average, at matching real linguist-parsed corpora of real human languages. We therefore tune its parameters Θ on such treebanks. UD provides training examples actually drawn from that distribution D over treebanks-but alas, rather few. Thus to better estimate the expected performance of Θ under D, we follow Wang and Eisner (2017) and augment our training data with GD's synthetic treebanks.
Ideally we would have sampled these synthetic treebanks from a careful estimateD of D: for example, the mean of a Bayesian posterior for D, derived from prior assumptions and UD evidence. However, such adventurous "extrapolation" of unseen languages would have required actually constructing such an estimateD-which would embody a distribution over semantic content and a full theory of universal grammar! The GD treebanks were derived more simply and more conservatively by "interpolation" among the actual UD corpora. They combine observed parse trees (which provide attested semantic content) with stochastic word order models trained on observed languages (which attempt to mimic attested patterns for presenting that content). GD's sampling distributionD still offers moderately varied synthetic datasets, which remain moderately realistic, as they are limited to phenomena observed in UD.
As Wang and Eisner (2016) pointed out, synthetic examples have been used in many other supervised machine learning settings. A common technique is to exploit invariance: if real image z should be classified as a cat, then so should a rotated version of image z. Our technique is the same! We assume that if real corpus u should be parsed as having certain dependencies among the word tokens, then so should a version of corpus u in which those tokens have been systematically permuted in a linguistically plausible way. 4 This is analogous to how rotation sytematically transforms the image (rotating all pixels through the same angle) in a physically plausible way (as real objects do rotate relative to the camera). The systematicity is needed to ensure that the task on syn-thetic data is feasible. In our case, the synthetic corpus then provides many sentences that have been similarly permuted, which may jointly provide enough clues to guess the word order of this synthetic language (for example, VSO vs. VOS in §2) and thus recover the dependencies. See Wang and Eisner (2018, §2) for related discussion.
With enough good synthetic languages to use for training, even nearest-neighbor could be an effective method. That is, one could obtain the parser for a test corpus simply by copying the trained parser for the most similar training corpus (under some metric). Wang and Eisner (2016) explored this approach of "single-source transfer" from synthetic languages. Yet with only thousands of synthetic languages, perhaps no single training corpus is sufficiently similar. 5 To draw on patterns in many training corpora to figure out how to parse the test corpus, we will train a single parser that can handle all of the training corpora (Ammar et al., 2016), much as we trained our typological classifier in earlier work (Wang and Eisner, 2017).

Task Formulation
An unsupervised parser for language is built without any gold parse trees for . However, we assume a corpus u of unparsed but POS-tagged sentences of is available. From u, we will extract statistics T(u) that are informative about the syntactic structure of , to guide us in parsing POStagged sentences of .
Overall, our approach is to train a "languageagnostic" parser-one that does not know what language it is parsing in. It produces a parse treê y = Parse Θ (x; u) from a sentence x, constructing T(u) as an intermediate quantity that carries (for example) typological information about . The parameters Θ are shared by all languages, and determine how to construct and use T. To learn them, we will allow to range over training languages, and then test our ability to parse when ranges over novel test languages. Our Parse Θ (x; u) system has two stages. First it uses a neural network to compute T(u) ∈ R m , a vector that represents the typological properties of and resembles the language embedding of Ammar et al. (2016). Then it parses sentence x while taking T(u) as an additional input. We will give details of these two components in §6 and §7.
We assume in this paper that the input sentence x is given as a POS sequence: that is, our parser is delexicalized. This spares us from also needing language-specific lexical parameters associated with the specific vocabulary of each language, a problem that we leave to future work.
We will choose our universal parameter values by minimizing an estimate of their expected loss, where L train is a collection of training languages (ideally drawn IID from the distribution D of possible human languages) for which some syntactic information is available. Specifically, each training language has a treebank (x ( ) , y ( ) ), where x ( ) is a collection of POS-tagged sentences whose correct dependency trees are given by y ( ) . Each also has an unparsed corpus u ( ) (possibly equal to x ( ) or containing x ( ) ). We can therefore define the parser's loss on training language as where loss(. . .) is a task-specific per-sentence loss (defined in §8.1) that evaluates the parser's output y on sentence x against x's correct tree y.

Per-Language Learning
Many papers rely on some universal learning procedure to determine T(u) (see §4) for a target language. For example, T(·) may be the Expectation-Maximization (EM) algorithm, yielding a PCFG T(u) that fully determines a CKY parser (Carroll and Charniak, 1992;. Since EM and CKY are fixed algorithms, this approach has no trainable parameters. Grammar induction tries to turn an unsupervised corpus into a generative grammar. The approach of the previous paragraph is often modified to reduce model error or search error ( §1). To reduce model error, many papers have used dependency grammar, with training objectives that incorporate notions like lexical attraction (Yuret, 1998) and grammatical bigrams (Paskin, 2001(Paskin, , 2002. The dependency model with valence (DMV)  was the first method to beat a simple right-branching heuristic. Headden III et al. (2009) andSpitkovsky et al. (2012) made the DMV more expressive by considering higher-order valency or punctuation. To reduce search error, strategies for eliminating or escaping local optima have included convexified objectives (Wang et al., 2008;Gimpel and Smith, 2012), smart initialization Mareček and Straka, 2013), search bias Eisner, 2005, 2006;Naseem et al., 2010;Gillenwater et al., 2010), branch-and-bound search (Gormley and Eisner, 2013), and switching objectives (Spitkovsky et al., 2013).
Unsupervised parsing (which is also our task) tries to turn the same corpus directly into a treebank, without necessarily finding a grammar. We discuss some recent milestones here. Grave and Elhadad (2015) propose a transductive learning objective for unsupervised parsing, and a convex relaxation of it. (Jiang et al. (2017) combined that work with grammar induction.) Martínez Alonso et al. (2017) create an unsupervised dependency parser that is formally similar to ours in that it uses cross-linguistic knowledge as well as statistics computed from a corpus of POS sequences in the target language. However, its cross-linguistic knowledge is hand-coded: namely, the set of POS-to-POS dependencies that are allowed by the UD annotation scheme, and the typical directions for some of these dependencies. The only corpus statistic extracted from u is whether ADP-NOMINAL or NOMINAL-ADP bigrams are more frequent, 6 which distinguishes prepositional from postpositional languages. The actual parser starts by identifying the head word as the most "central" word according to a PageRank (Page et al., 1999) analysis of the graph of candidate edges, and proceeds by greedily attaching words of decreasing PageRank at lower depths in the tree.

Multi-Language Learning
This approach parses a "target" language using the treebanks of other resource-rich languages as "source" languages. There are two main variants.
Memory-based. This method trains a supervised parsing model on each source treebank. It uses these (delexicalized) source-language models to help parse the target sentence, favoring sources that are similar to the target language. A common 6 In our notation of §6.1, below, this asks whether t∈{NOUN,PRON,PROPN} π w t|ADP is greater for w = 1 or w = −1.
similarity measure (Rosa and Žabokrtský, 2015a) considers the probability of the target language's POS-corpus u under a trigram language model of source-language POS sequences. Single-source transfer (SST) (Rosa and Žabokrtský, 2015a;Wang and Eisner, 2016) simply uses the parser for the most similar source treebank. Multi-source transfer (MST) (Rosa and Žabokrtský, 2015a) parses the target POS sequence with each of the source parsers, and then combines these parses into a consensus tree using the Chu-Liu-Edmonds algorithm (Chu, 1965;Edmonds, 1967). As a faster variant, model interpolation (Rosa and Žabokrtský, 2015b) builds a consensus model for the target language (via a weighted average of source models' parameters), rather than a consensus parse for each target sentence separately.
Memory-based methods require storing models for all source treebanks, which is expensive when we include thousands of GD treebanks ( §3).
Model-based. This method trains a single language-agnostic model. McDonald et al. (2011) train a delexicalized parser on the concatenation of all source treebanks, achieving a large gain over grammar induction. This parser can learn universals such as the preference for determiners to attach to nouns (which was hard-coded by Naseem et al. (2010)). However, it is expected to parse a sentence x without being told the language or even a corpus u, possibly by guessing properties of the language from the configurations it encounters in the single sentence x alone.
Further gains were achieved (Naseem et al., 2012;Täckström et al., 2013b;Zhang and Barzilay, 2015;Ammar et al., 2016) by providing the parser with about 10 typological properties of x's language-for example, whether direct objects generally fall to the right of the verb-as listed in the World Atlas of Linguistic Structures (Dryer and Haspelmath, 2013).
However, relying on WALS raises some issues. (1) The unknown language might not be in WALS. 7 (2) Some typological features are missing for some languages. (3) All the WALS features are categorical values, which loses useful information about tendencies (for example, how often the canonical word order is violated). (4) Not all WALS features are useful-only 56 of them pertain to word order, and only 8 of those have been used in past work. (5) With a richer parser (a stack LSTM dependency parser), WALS features do not appear to help at all on unknown languages (Ammar et al., 2016, footnote 30).

Exploiting Parallel Data
Some other work on generalizing from source to target languages assumes the availability of source-target parallel data, or bitext. Two uses: Induction of multilingual word embeddings. Similar to universal POS tags, multilingual word embeddings serve as a universal representation that bridges the lexical differences among languages. Guo et al. (2016) proposed two approaches: (1) Training a variant of the skip-gram model (Mikolov et al., 2013) by using bilingual sets of context words. (2) Generating the embedding of each target word by averaging the embeddings of the source words to which it is aligned.
Annotation projection. Given aligned bitext, one can generate an approximate parse for a target sentence by "projecting" the parse tree of the corresponding source sentence. A target-language parser can then be trained from these approximate parses. The idea was originally proposed by Yarowsky et al. (2001), and then applied to dependency parsing on low-resource languages (Hwa et al., 2005;Ganchev et al., 2009;Smith and Eisner, 2009;Tiedemann, 2014, inter alia). McDonald et al. (2011) extend this approach to multiple source languages by projected transfer. Later work in this vein mainly tries to improve the approximate parses, including translating the source treebanks into the target language with an off-the-shelf machine translation system , augmenting the trees with weights (Agić et al., 2016), and using only partial trees with high-confidence alignments Collins, 2015, 2017;Lacroix et al., 2016).

Situating Our Work
Our own approach can be categorized as modelbased multi-language learning with no parallel text or target-side supervision. However, we also analyze an unparsed corpus u of the target language, as the per-language systems of §5.1 do. Our analysis of u does not produce a specialized target grammar or parser, but only extracts a target vector T(u) to be fed to the language-agnostic parser. The analyzer is trained jointly with the parser, over many languages. 6 The Typology Component Wang and Eisner (2017) extract typological properties of a language from its POS-tagged corpus u, in effect predicting syntactic structure from superficial features. Like them, we compute a hidden layer T(u) using a standard multilayer perceptron architecture, for example,

Design of the Surface Features π(u)
To define π(u), we used development data to select the following fast but effective subset of the features proposed by Wang and Eisner (2017).
Hand-engineered features. Given a token j in a sentence, let its right window R j be the sequence of POS tags p j+1 , . . . , p j+w (padding the sentence as needed with # symbols). w is the window size. Define g w (t | j) ∈ [0, 1] to be the fraction of words in R j tagged with t. Now, given a corpus u, define where j ranges over tokens of u. The unigram prevalence π w t measures the frequency of t overall, while the bigram prevalence π w t|s measures the frequency with which t can be found to the left 672 of an average s tag (in a window of size w). For each of these quantities, we have a corresponding mirror-image quantity (denoted by negating w) by computing it on a reversed version of the corpus.
The final hand-engineered π(u) includes: • π w t , for each tag type t and each w ∈ {1, 3, 8, 100}. This quantity measures how frequently t appears in u.
• π w t|s //π w t and π −w t|s //π −w t , for each tag type pair s, t and each w ∈ {1, 3, 8, 100}. We define x//y = min(x/y, 1) to bound the feature values for better generalization. Notice that if w = 1, the log of π w t|s /π w t is the bigram pointwise mutual information. Each matched pair of these quantities is intuitively related to the word order typology-for example, if ADPs are more likely to have closely following than closely preceding NOUNs (π w NOUN|ADP //π w NOUN > π −w NOUN|ADP //π −w NOUN ), the language is more likely to be prepositional than postpositional.
Neural features. In contrast, our neural features automatically learn to extract arbitrary predictive configurations. As Figure 2 shows, we encode each POS-tagged sentence u i ∈ u using a recurrent neural network, which reads one-hot POS embeddings from left to right, then outputs its final hidden state vector f i as the encoding. The final neural π(u) is the average encoding of all sentences (average-pooling): that is, the average of all sentence-level configurations. We specifically use a gated recurrent unit (GRU) network (Cho et al., 2014). The GRU is jointly trained with all other parameters in the system so that it focuses on detecting word-order properties of u that are useful for parsing.

The Parsing Architecture
To construct Parse(x; u), we can extend any statistical parsing architecture Parse(x) to be sensitive to T(u). For our experiments, we extend the delexicalized graph-based implementation of the BIST parser (Kiperwasser and Goldberg, 2016)-an arc-factored dependency model with neural context features extracted by a bidirectional LSTM. This recent parser was the state of the art when it was published.
Given a POS-sentence x and a corpus u, our where, letting a range over the arcs in tree y, With this definition, the argmax in (4) is computed efficiently by the algorithm of Eisner (1996). s(·) is a neural scoring function on vectors, where V is a matrix, b V is a bias vector, and v is a vector, all being parameters in Θ.
The function φ(a; x, u) extracts the feature vector of arc a given x and u. BIST scores unlabeled arcs, so a denotes a pair (i, j)-the indices of the parent and child, respectively. We define φ(a; x, u) = [B(x, i; T(u)); B(x, j; T(u))] (7) which concatenates contextual representations of tokens i and j. B(x, i) is itself a concatenation of the hidden states of a left-to-right LSTM and a right-to-left LSTM (Graves, 2012) when each has read sentence x up through word i (really POS tag i). These LSTM parameters are included in Θ.
The POS tags in x are provided to the LSTMs as one-hot vectors. Crucially, T(u) is also provided to the LSTM at each step, as shown in Figure 3.
After selecting the best tree via equation (4), we use each arc's φ vector again to predict its label. This yields the labeled treeŷ = Parse Θ (x; u).
The only extension that this makes to BIST is  where s i,j in each cell is the arc score s(φ(a; x, T(u)) from equation (6). The root of the tree is always position 0, where x 0 is a distinguished "root" symbol that is prepended to the input sentence.
to supply T(u) to the BiLSTM. 8 This extension is not a significant slowdown at test time, since T(u) only needs to be computed once per test language, not once per test sentence. Since T(u) can be computed for any novel language at test time, this differs from the "many languages, one parser" architecture (Ammar et al., 2016), in which a testtime language must have been seen at training time or at least must have known WALS features.
Product of experts. We also consider a variant of the function (6) for scoring arc a, namely where s H (a) and s N (a) are the scores produced by separately trained systems using, respectively, the hand-engineered and neural features from §6.1. Hyperparameter λ ∈ [0, 1] is tuned on dev data.

Training Objective
We exactly follow the training method of Kiperwasser and Goldberg (2016), who minimize a structured max-margin hinge loss (Taskar et al., 2004;McDonald et al., 2005;LeCun et al., 2007). 8 An alternative would be to concatenate T(u) with the representation computed by the BiLSTM. This gets empirically worse results, probably because the BiLSTM does not have advance knowledge of language-specific word order as it reads the sentence. We also tried an architecture that does both, with no notable improvement.
We want the correct tree y to beat each tree y by a margin equal to the number of errors in y (we count spurious edges). Formally, loss(x, y; u) is given by where a ranges over the arcs of a tree y, and 1 a / ∈y is an indicator that is 1 if a / ∈ y. Thus, this loss function is high if there exists a tree y that has a high score relative to y yet low precision. 9 The training algorithm makes use of lossaugmented inference (Taskar et al., 2005), a variant on the ordinary inference of (4). The most violating tree y (in the max y above) is computed again by an arc-factored dependency algorithm (Eisner, 1996), where the score of any candidate arc a is s(φ(a; x, u)) + 1 a / ∈y . Actually, the above method would only train the score function to predict the correct unlabeled tree as above (since a ranges over unlabeled arcs as before). In practice, we also jointly train the labeler to predict the correct labels on the gold arcs, using a separate hinge-loss objective. Because these two components share parameters through φ(a; x, u), this is a multi-task learning problem.

Training Algorithm
Augment training data. Unlike ordinary NLP problems whose training examples are sentences, each training example in equation (1) is an entire language. Unfortunately, UD ( §3) only provides a few dozen languages-presumably not enough to generalize well to novel languages. We therefore augment our training dataset L train with thousands of synthetic languages from the GD dataset ( §3), as already discussed in §3.1.
Stochastic gradient descent (SGD). 10 Treating each language as a single large example during training would lead to slow SGD steps. Instead, we take our SGD examples to be individual sentences, by regarding equations (1)-(2) together as an objective averaged over sentences. Each example (x, y, u) is sampled hierarchically, by first drawing a language from L train and setting u = u ( ) , then drawing the sentence (x, y) uniformly from (x ( ) , y ( ) ). We train using mini-batches of 100 sentences; each mini-batch can mix many languages.
Encourage real languages. To sample from L train , we first flip a coin with weight β ∈ [0, 1] to choose "real" vs. "synthetic," and then sample uniformly within that set. Why? The test sentences will come from real languages, so the synthetic languages are out-of-domain. Including them reduces variance but increases bias. We raise β to keep them from overwhelming the real languages.
Sample efficiently. The sentences (x, y) are stored in different files by language. To reduce disk accesses, we do not visit a file on each sample. Rather, for each language , we maintain in memory a subset of (x ( ) , y ( ) ), obtained by reservoir sampling. Samples from (x ( ) , y ( ) ) are drawn sequentially from this "chunk," and when it is used up we fetch a new chunk. We also maintain u ( ) and the hand-engineered features from π(u ( ) ) in memory.

Basic Setup
Our data split follows that of Wang and Eisner (2017), as shown in Table 2, 11 which has 18 training languages (20 treebanks) and 17 test languages. All hyperparameters are tuned via 5-fold cross-validation on the 20 training treebanks-that is, we evaluate each fold (4 treebanks) using the model trained on the remaining folds (16 treebanks). However, a model trained on a treebank of language is never evaluated on another treebank of language . We selected the hyperparameters that maximized the average unlabeled attachment score (UAS) (Kübler et al., 2009), which is the evaluation metric that is reported by most previ-11 However, as we are interested in transfer to unseen languages, our Table 2 follows the principle of Eisner and Wang (n.d.) and does not test on the Finnishftb or Latin treebanks because other treebanks of those languages appeared in training data. Specifically, Latinitt and Latinproiel fall in the same training folds as French and Italian, respectively. For the same reason, Table 2 does not show cross-validation development results on these Latin treebanks-nor on the Ancient Greekgrc and Ancient Greekgrc_proiel treebanks, which fall in the same training folds as Czech and Danish, respectively. ous work on unsupervised parsing. We also report labeled attachment score (LAS). 12 When augmenting the data, the 16 training treebanks are "mixed and matched" to get GD treebanks for 16×17×17 = 4624 additional synthetic training languages (Wang and Eisner, 2016, §5).
The next sections analyze these cross-validation results. Finally, §9.8 will evaluate on 15 previously unseen languages (excluding Latin and Finnish ftb ) with our model trained on all 18 training languages (20 treebanks for UD, plus 20×21× 21 = 8840 when adding GD) with the hyperparameters that achieved the best average unlabeled attachment score during cross-validation.
The UD and GD corpora provide a train/dev/test split of each treebank, denoted as (x train , y train ), (x dev , y dev ) and (x test , y test ). Throughout this paper, for both training and testing languages, we take (x ( ) , y ( ) ) = (x train , y train ). We take u ( ) to consist of all x train sentences with ≤ 40 tokens. Table 1 shows the cross-validation parsing results over different systems discussed so far. For each architecture, we show the best average unlabeled attachment score (the UAS column) chosen by cross-validation, and the corresponding labeled attachment score (the LAS column). In brief, the main sources of improvement are twofold:

Comparison Among Architectures
Synthetic languages. We observe that +GD consistently outperforms UD across all architectures. It even helps with the baseline system that we tried, which simply ignores the target corpus u ( ) . In that system (similar to McDonald et al. (2011)), the BiLSTM may still manage to extract -specific information from the single sentence x ∈ x ( ) that it is parsing. 13 The additional GD training languages apparently help it learn to do so in a way that generalizes to new languages. To better understand the trend, we study how the performance varies when more synthetic languages are used. As shown in Figure 4, when β = 1, all the training languages are sampled from real languages. By gradually increasing the proportion of GD languages (reducing β from §8.2), the baseline UAS increases dramatically from 63.95 to 67.97. However, if all languages are uniformly sampled (β = 16 4624+16 ≈ 0.003) or only synthetic languages are used (β = 0), the UAS falls back slightly to 67.42 or 67.36. The best β value is 0.2, which treats each real language as 0.2/16 0.8/4624 ≈ 72 times more helpful than each synthetic language, yet 80% of the training data is contributed by synthetic languages. β = 0.2 was also optimal for the non-baseline systems in Table 1.
Unparsed corpora. The systems that exploit unparsed corpora consistently outperform the baseline system in both the UD and +GD conditions. To investigate, we examine the impact of reducing u ( ) when parsing a held-out language . We used the system in row N and column +GD of Table 1, which was trained on full-sized u corpora. When testing on a held-out language , we compute T(u ( ) ) using only a random size-t subset of u ( ) . As shown in Figure 5, the system does not need a very large unparsed corpus-most of the benefit is obtained by t = 256. Nonetheless, a larger corpus always achieves a better and more stable performance.

Comparison to SST
Besides Baseline, another directly comparable approach is SST ( §5.2). As shown in Table 1, SST gives a stronger baseline on the UD column-as good as H+N. However, this advantage does not carry over to the +GD column, meaning that SST cannot exploit the extra training data. Wang and Eisner (2016, Figure 5) already found that GD languages provide diminishing benefit to SST as more UD languages get involved. 14 For H+N, however, the extra GD languages do help to identify the truly useful surface patterns in u.
We also considered trying model interpolation (Rosa and Žabokrtský, 2015b). Unfortunately, as mentioned in §5.2, this method is impractical with GD languages, because it requires storing 4624 ( §9.1) additional local models. Nonetheless, we can estimate an "upper bound" on how well the interpolation might do. Our upper bound is SST where an oracle is used to choose the source language; Rosa and Žabokrtský (2015b) found that in practice, this does better than interpolation. This approximate upper bound is 68.03 of UAS and 52.10 of LAS, neither of which is significantly better than H+N on UD, but both of which are significantly outperformed by H+N on +GD.

Oracle Typology vs. Our Learned T(u)
The results in Table 1 demonstrate that we learned to extract features T(u), from the unparsed target corpus u, that improve the baseline parser. We consider replacing T(u) by an oracle that has access to the true syntax of the target language. We consider two different oracles, T D and T W .
T D is the directionalities typology that was studied by Liu (2010) and used as a training target by Wang and Eisner (2017). Specifically, T D ∈ [0, 1] 57 is a vector of the directionalities of each type of dependency relation; it specifies what fraction of direct objects fall to the right of the verb, and so on. 15 In principle, this should be very helpful for parsing, but it must be extracted from a treebank, which is presumably unavailable for unknown languages.
We also consider T W -the WALS featuresas the typological classification given by linguists. This resembles the previous multi-language learning approaches (Naseem et al., 2012;Täckström et al., 2013b;Zhang and Barzilay, 2015;Ammar et al., 2016) that exploited the WALS features. In particular, we use 81A, 82A, 83A, 85A, 86A, 87A, 88A and 89A-a union of WALS features used by those works. In order to derive the WALS features for a synthetic GD language, we first copy the features from its substrate language 14 The number of real treebanks in our cross-validation setting is 16, greater than the 10 in Wang and Eisner (2016). 15 The directionality of a relation a in language is given count (a) , where count ( a →) is the count of a-relations that point from left to right, and count (a) is the count of all arelations. (Wang and Eisner, 2016). We then replace the 81A, 82A, 83A features-which concern the order between verbs and their dependents-by those of its V-superstrate language 16 (if any). We replace 85A, 86A, 87A, 88A and 89A-which concern the order between nouns and their dependentsby those of its N-superstrate language (if any).
As a pleasant surprise, we find that our best system (H+N) is competitive with both oracle methods. It outperforms both of them on both UAS and LAS, and the improvements are significant and substantial in 3 of these 4 cases. Our parser has learned to extract information T(u) that is not only cheap (no treebank needed), but also at least as useful as "gold" typology for parsing.

Selected Hyperparameter Settings
For the rest of the experiments, we use the H+N system, as it wins under cross-validation on both UD and +GD (Table 1). This is a combination via (8) of the best H system and the best N system under cross-validation, with the mixture hyperparameter λ also chosen by cross-validation.
For both UD and +GD, cross-validation selected 125 as the sizes of the LSTM hidden states and 100 as the sizes of the hidden layers for scoring arcs (the length of v in equation (6)).
Hyperparameters for UD. The H system computes T(u) with a 1-layer network (as in equation (3)), with hidden size h = 128 and ψ = tanh as the activation function. For the N system, T(u) is a 1-layer network with hidden size h = 64 and ψ = sigmoid as the activation function. The size of the hidden state of GRU as shown in Figure 2 is 128. The mixture weight for the final H+N system is λ = 0.5.
Hyperparameters for +GD. The H system computes T(u) with a 2-layer network (as shown in Figure 1), with h = 128 and ψ = sigmoid for both hidden layers. For N, T(u) is a 1-layer network with hidden size h = 64 and ψ = sigmoid. The size of the hidden state of GRU is 256. Both H and N set β = 0.2 (see §8.2). The mixture weight for the final H+N system is λ = 0.4.

Performance on Noisy Tag Sequences
We test our trained system in a more realistic scenario where both u and x for held-out languages  Figure 6: Performance on noisy input over 16 training languages. Each dot is an experiment annotated by the number of sentences used to train the tagger. (The rightmost "∞" point uses gold tags instead of a tagger, which is the result from Table 1.) The x-axis gives the average accuracy of the trained RDRPOSTagger. The y-axis gives the average parsing performance.
consist of noisy POS tags rather than gold POS tags. Following Wang and Eisner (2016, Appendix B), at test time, the gold POS tags in a corpus are replaced by a noisy version produced by the RDRPOSTagger (Nguyen et al., 2014) trained on a subset of the original gold-tagged corpus. 17 Figure 6 shows a linear relationship between the performance of our best model (H+N with +GD) and the noisiness of the POS tags, which is controlled by altering the amount of training data. With only 100 training sentences, the performance suffers greatly-the UAS drops from 70.65 to 51.57. Nonetheless, even this is comparable to Naseem et al. (2010) on gold POS tags, which yields a UAS of 50.00. That system was the first grammar induction approach to exploit knowledge of the distribution of natural languages, and remained state-of-the-art (Noji et al., 2016)

Analysis by Dependency Relation Type
Figure 7 breaks down the results by dependency relation type-showing that using u and synthetic data improves results almost across the board.
We also notice large differences between labeled and unlabeled F1 scores for some relations, especially rarer ones. In other words, the system mislabels the arcs that it correctly recov-ers. (Remember from §9.2 that the hyperparameters were selected to maximize unlabeled scores (UAS) rather than labeled (LAS).) Figure 8 gives the label confusion matrix. While the dark NONE column shows that arcs of each type are often missed altogether (recall errors), the dark diagonal shows that they are usually labeled correctly if found. That said, it is relatively common to confuse the different labels for nominal dependents of verbs (nsubj, dobj, nmod). We suspect that lexical information could help sort out these roles via distributional semantics. Some other mistakes arise from discrepancies in the annotation scheme. For example, neg can be easily confused with advmod, as some languages (for example, Spanish) use ADV instead of PART for negations.

Final Evaluation on Test Data
In all previous sections, we evaluated on the 16 languages in the training set by cross-validation. For the final test, we combine all the 20 treebanks and train the system with the hyperparameters given in §9.5, then test on the 15 unseen test languages. Table 2 displays results on these 15 test languages (top) as well as the cross-validation results on the 16 languages (bottom).
We see that we improve significantly over baseline on almost every language. Indeed, on the test languages, +T(u) improves both UAS and LAS by > 3.5 percentage points on average. The improvement grows to > 5.6 if we augment the training data as well (+GD, meaning +T(u)+GD).
One disappointment concerns the added benefit on the LAS of +GD over just +T(u): while this data augmentation helped significantly on nearly every one of the 16 development languages, it produced less consistent improvements on the test languages and hurt some of them. We suspect that this is because we tuned the hyperparameters to maximize UAS, not LAS ( §9.2). As a result, while the average benefit across our 15 test languages was fairly large, this sample was not large enough to establish that it was significantly greater from 0, that is, that future test languages would also see an improvement from data augmentation.
We also notice that there seems to be a small difference between the pattern of results on development versus test languages. This may simply reflect overfitting to the development languages, but we also note that the test languages (chosen by  Wang and Eisner (2016)) tended to have considerably smaller unparsed corpora u, so there may be a domain mismatch problem. To ameliorate this problem, one could include training examples with versions of u that are truncated to lengths seen in test data (cf. Figure 5). One could also include the size |u| explicitly in T(u).

Conclusion and Future Work
We showed how to build a "language-agnostic" delexicalized dependency parser that can better parse sentences of an unknown language by exploiting an unparsed (but POS-tagged) corpus of that language. Unlike grammar induction, which estimates a PCFG from the unparsed corpus, we train a neural network to extract a feature vector from the unparsed corpus that helps a subsequent neural parser. By end-to-end training on the treebanks of many languages (optionally including synthetic languages), our neural network can extract linguistic information that helps neural dependency parsing. Variants of our architecture are possible. In future work, the neural parser could use attention to look at individual relevant sentences of u, which are posited to be triggers in some theories of child grammar acquisition (Gibson and Wexler, 1994;Frank and Kapur, 1996). We could also try injecting T(u) into the neural parser by means other than concatenating it with the input POS embeddings. We might also consider parsing architec-tures other than BIST, such as the LSTM-Minus architecture for scoring spans (Cross and Huang, 2016), or the recent attention-based arc-factored model (Dozat and Manning, 2017). Finally, our approach is applicable to tasks other than dependency parsing, such as constituent parsing or semantic parsing-if suitable treebanks are available for many training languages.
For applied uses, it would be interesting to combine the unsupervised techniques of this paper with low-resource techniques that make use of some annotated or parallel data in the target language. It would also be interesting to include further synthetic languages that have been modified to better resemble the actual target languages, using the method of (Wang and Eisner, 2018).
It is important to relax the delexicalized assumption. As shown in §9.6, the performance of our system relies heavily on the gold POS tags, which are presumably not available for unknown languages. What is available is lexical information-which has proved to be very important for supervised parsing, and should help unsupervised parsers as well. As discussed in §9.7, some errors seem easily fixable by considering word distributions. In the future, we will explore ways to extend our cross-linguistic parser to work with word sequences rather than POS sequences, perhaps by learning a cross-language word representation that is shared among training and test languages (Ruder et al., 2017).  Each row is normalized to sum to 1 and represents a frequent gold relation. For example, the nsubj row shows how well we recovered the gold nsubj arcs; the (nsubj, dobj) entry shows p(predicted = dobj | gold = nsubj), which measures the fraction of nsubj relations that are recovered but mislabeled as dobj. The diagonal represents correct arcs: where dark, it indicates high labeled recall for that relation. The final column represents gold arcs that were not recovered with any label: where dark, it indicates low unlabeled recall for that relation. We show the top 20 relations sorted by gold frequency.
One takeaway message from this work is contained in our title. Surface statistics of a language-mined from the surface part-of-speech order-provide clues about how to find the underlying syntactic dependencies. Chomsky (1965) imagined that such clues might be exploited by a Language Acquisition Device, so it is interesting to know that they do exist.
Another takeaway message is that synthetic training languages are useful for NLP. Using synthetic examples in training is a way to encourage a system to be invariant to superficial variation. We created synthetic languages by varying the surface structure in a way that "should" preserve the deep structure. This allows our trained system to be invariant to variation in surface structure, just as object recognition wants to be invariant to an image's angle or lighting conditions ( §3.1).
Our final takeaway goes beyond language: one can treat unsupervised structure discovery as a supervised learning problem. As § §1-2 discussed, this approach inherits the advantages of supervised learning. Training may face an easier optimization landscape, and we can train the system to find the Table 2: Data splits and final evaluation on the 15 test languages (top), along with cross-validation results on the 16 development languages (bottom) grouped by 5 folds (separated by dashed lines). For languages with multiple treebanks, we identify them by subscripts. We use "Slavonic" for Old Church Slavonic. Column B is the baseline that doesn't use T(u) (McDonald et al., 2011). +T(u) is our H+N system, and +GD is that system when the training data is augmented with synthetic languages. In comparing among these three systems, we boldface the highest score as well as all scores that are not significantly worse (paired permutation test, p < 0.05). If a row is an average over many sentences of a single language, then each paired datapoint is a sentence, so a significant improvement should generalize to new sentences. But if a row is an average, then each paired datapoint is a language (as in Table 1), so a significant improvement should generalize to new languages.
specific kind of structure that we desire, using any features that we think may be discriminative.