Improving Low-Resource Cross-lingual Parsing with Expected Statistic Regularization

We present Expected Statistic Regulariza tion (ESR), a novel regularization technique that utilizes low-order multi-task structural statistics to shape model distributions for semi- supervised learning on low-resource datasets. We study ESR in the context of cross-lingual transfer for syntactic analysis (POS tagging and labeled dependency parsing) and present several classes of low-order statistic functions that bear on model behavior. Experimentally, we evaluate the proposed statistics with ESR for unsupervised transfer on 5 diverse target languages and show that all statistics, when estimated accurately, yield improvements to both POS and LAS, with the best statistic improving POS by +7.0 and LAS by +8.5 on average. We also present semi-supervised transfer and learning curve experiments that show ESR provides significant gains over strong cross-lingual-transfer-plus-fine-tuning baselines for modest amounts of label data. These results indicate that ESR is a promising and complementary approach to model-transfer approaches for cross-lingual parsing.1

Combined, these approaches have been shown to be particularly effective for cross-lingual syntactic analysis, as shown by UDify (Kondratyuk, 2019).
However, even with the improvements brought about by these techniques, transferred models still make syntactically implausible predictions on low-resource languages, and these error rates increase dramatically as the target languages become more distant from the source languages (He et al., 2019;Meng et al., 2019).In particular, transferred models often fail to match many loworder statistics concerning the typology of the task structures.We hypothesize that enforcing regularity with respect to estimates of these structural statistics -effectively using them as weak supervision -is complementary to current transfer approaches for low-resource cross-lingual parsing.
To this end, we introduce Expected Statistic Regularization (ESR), a novel differentiable loss that regularizes models on unlabeled target datasets by minimizing deviation of descriptive statistics of model behavior from target values.The class of descriptive statistics usable by ESR are expressive and powerful.For example, they may describe cross-task interactions, encouraging the model to obey structural patterns that are not explicitly tractable in the model factorization.Additionally, the statistics may be derived from constraints dictated by the task formalism itself (such as ruling out invalid substructures) or by numerical parameters that are specific to the target dataset distribution (such as relative substructure frequencies).In the latter case, we also contribute a method for selecting those parameters using small amounts of labeled data, based on the bootstrap (Efron, 1979).
Although ESR is applicable to a variety of problems, we study it using modern cross-lingual syntactic analysis on the Universal Dependencies data, building off of the strong model-transfer framework of UDify (Kondratyuk, 2019).We arXiv:2210.09428v1[cs.CL] 17 Oct 2022 show that ESR is complementary to transfer-based approaches for building parsers on low-resource languages.We present several interesting classes of statistics for the tasks and perform extensive experiments in both oracle unsupervised and realistic semi-supervised cross-lingual multi-task parsing scenarios, with particularly encouraging results that significantly outperform state-of-the-art approaches for semi-supervised scenarios.We also present ablations that justify key design choices.

Expected Statistic Regularization
We consider structured prediction in an abstract setting where we have inputs x ∈ X , output structures y ∈ Y, and a conditional model p θ (y|x) ∈ P with parameters θ ∈ Θ, where P is the distribution space and Θ is the parameter space.In this section we assume that the setting is semi-supervised, with a small labeled dataset D L and a large unlabeled dataset D U ; let D L = {(x i , y i )} m i=1 be the labeled dataset of size m and similarly define D U = {x i } m+n i=m+1 as the unlabeled dataset.Our approach centers around a vectorized statistic function f that maps unlabeled samples and models to real vectors of dimension d f : where D is the set of unlabeled datasets of any size, (i.e., D U ∈ D).The purpose of f is to summarize various properties of the model using the sample.For example, if the task is part-of-speech tagging, one possible component of f could be the expected proportion of NOUN tags in the unlabeled data D U .In addition to f , we assume that we are given vectors of target statistics t ∈ R d and margins of uncertainty σ ∈ R d as its supervision signal.We will discuss the details of f , t, and σ shortly but first describe the overall objective.

Semi-Supervised Objective
Given labeled and unlabeled data D L and D U , we propose the following semi-supervised objective O, which breaks down into a sum of supervised and unsupervised terms L and C: where α > 0 is a balancing coefficient.The supervised objective L can be any suitable supervised loss; here we will use the negative log-likelihood of the data under the model.Our contribution is the unsupervised objective C.
For C, we propose to minimize some distance function between the target statistics t and the value of the statistics f calculated using unlabeled data and the model p θ .( will also take into account the uncertainty margins σ.)A simple objective would be: This is a dataset-level loss penalizing divergences from the target level statistics.The problem with this approach is that this is not amenable to modern hardware constraints requiring SGD.Instead, we propose to optimize this loss in expectation over unlabeled mini-batch samples D k U , where k is the mini-batch size and D k U is sampled uniformly with replacement from D U .Then, C is given by: This objective penalizes the model if the statistic f , when applied to samples of unlabeled data D k U , deviates from the targets t and thus pushes the model toward satisfying these target statistics.
Importantly, the objective in Eq. 4 is more general than typical objectives in that the outer loss function does not necessarily break down into a sum over individual input examples-the aggregation over examples is done inside f : This generality is useful because components of f may describe statistics that aggregate over inputs, estimating expected quantities concerning samplelevel regularities of the structures.In contrast, the right-hand side of Eq. 5 is more stringent, imposing that the statistic be the same for all instances of x.In practice, this loss reduces noise compared to a per-sentence loss, as is shown in Section 5.3.1.

The Statistic Function f
In principle the vectorized statistic function f could be almost any function of the unlabeled data and model, provided it is possible to obtain its gradients w.r.t. the model parameters θ, however, in this work we will assume f has the following three-layer structure.
First, let g be another vectorized function of "sub-statistics" that may have a different dimensionality than f and takes individual x, y pairs as input: Then let ḡ be the expected value of g under the model p θ summed over the sample D U : Given ḡ, let the f 's j'th component be the result of an aggregating function h j : R dg → R on ḡ: The individual components g i will mostly be counting functions that tally various substructures in the data.The ḡi 's then are expected substructure counts in the sample, and the h j 's aggregate small subsets of these intermediate counts in different ways to compute various marginal probabilities.Again, in general f does not need to follow this structure and any suitable statistic function can be incorporated into the regularization term proposed in Eq. 4.
In some cases-when the structure of g does not follow the model factorization either additively or multipicatively-computation of the model expectation E p θ (y|x) [g(x, y)] in Eq. 7 is intractable.In these situations, standard Monte Carlo approximation breaks differentiability of the objective w.r.t. the model parameters θ and cannot be used.To remedy this, we propose to use the "Stochastic Softmax" differentiable sampling approximation from Paulus et al. (2020) to allow optimization of these functions.We propose several such statistics in the application (see Section 4.3).

The Distance Function
For the distance function , we propose to use a smoothed hinge loss (Girshick, 2015) that adapts with the margins σ.Letting f = f (D k U , p θ ), the i'th component of is given by: The total loss is then the sum of its components: We choose this function because it is robust to outliers, adapts its width to the margin parameter σ i , and expresses a preference for f i = t i (as opposed to max-margin losses).We give an ablation study in Section 5.3.2justifying its use.

Choosing the Targets and Margins
There are several possible approaches to choosing the targets t and margins σ, and in general they can differ based on the individual statistics.For some statistics it may be possible to specify the targets and margins using prior knowledge or formal constraints from the task.In other cases, estimating the targets and margins may be more difficult.Depending on the problem context, one may be able to estimate them from related tasks or domains (such as neighboring languages for cross-lingual parsing).Here, we propose a general method that estimates the statistics using labeled data, and is applicable to semi-supervised scenarios where at least a small amount of labeled data is available.
The ideal targets are the expected statistics under the "true" model p * are: where k is the batch size.We can estimate this expectation using labeled data D L and bootstrap sampling (Efron, 1979).Utilizing D L as a set of point estimates for p * , we sample B total minibatches of k labeled examples uniformly with replacement from D L and calculate the statistic f for each of these minibatch datasets.We then compute the target statistic as the sample mean: where we have slightly abused notation by writing f (D L ) to mean f computed using the inputs {x : (x, y) ∈ D L } and the point estimates p * (y|x) = 1, ∀(x, y) ∈ D L .
In addition to estimating the target statistics for small batch sizes, the bootstrap gives us a way to estimate the natural variation of the statistics for small sample sizes.To this end, we propose to utilize the standard deviations from the bootstrap samples as our margins of uncertainty σ: This allows our loss function to adapt to more or less certain statistics.If some statistics are naturally too variable to serve as effective supervision, they will automatically have weak contribution to and little impact on the model training.Now that we have described our general approach, in this section we lay out a proposal for applying it to cross-lingual joint POS tagging and dependency parsing.We choose to apply our method to this problem because it is an ideal testbed for controlled experiments in semi-supervised structured prediction.By their nature, the parsing tasks admit many types of interesting statistics that capture cross-task, universal, and language-specific facts about the target test distributions.
We evaluate in two different transfer settings: oracle unsupervised and realistic semi-supervised.In the oracle unsupervised settings, there is no supervised training data available for the target languages (and the L term is dropped from Eq. 3), but we use target values and margins calculated from the held-out supervised data.This setting allows us to understand the impact of our regularizer in isolation without the confounding effects of direct supervision or inaccuracte targets.In the semisupervised experiments, we vary the amounts of supervised data, and calculate the targets from the small supervised data samples.This is a realistic application of our approach that may be applied to low-resource learning scenarios.

Problem Setup and Data
We use the Universal Dependencies (Nivre, 2020) v2.8 (UD) corpus as data.In UD, syntactic annotation is formulated as a labeled bilexical dependency tree, connecting words in a sentence, with additional part-of-speech (POS) tags annotated for each word.The labeled tree can be broken down into two parts: the arcs that connect the head words to child words, forming a tree, and the dependency labels assigned to each of those arcs.Due to the definition of UD syntax, each word is the child of exactly one arc, and so both the attachments and labels can be written as sequences that align with the words in the sentence.
More formally then, for each labeled sentence x 1:n of length n, the full structure y is given by the three sequences y = (t 1:n , e 1:n , r 1:n ), where t 1:n , t i ∈ T are the POS tags, e 1:n , e i ∈ {1, . . ., n} are the head attachments, and r 1:n , r i ∈ R are the dependency labels.

The Model and Training
We now turn to the parsing model that is used as the basis for our approach.Though the general ideas of our approach are adaptable to other models, we choose to use the UDify architecture because it is one of the state-of-the-art multilingual parsers for UD.

The UDify Model
The UDify model is based on trends in stateof-the-art parsing, combining a multilingual pretrained transformer language model encoder (mBERT) with a deep biaffine arc-factored parsing decoder, following Dozat and Manning (2017).These encodings are additionally used to predict POS tags with a separate decoder.The full details are given in Kondratyuk (2019), but here it suffices to characterize the parser by its top-level probabilistic factorization: This model is scant on explicit joint factors, following recent trends in structured prediction that forgo higher-arity factors, instead opting for shared underlying contextual representations produced by a mBERT that implicitly contain information about the sentence and structure as a whole.This factorization will prove useful in Section 4.3 where it will allow us to compute many of the supervision statistics under the model exactly.

Training
The UDify approach to training is simple: it begins with a multilingual PLM, mBERT, then finetunes the parsing architecture on the concatenation of the source languages.With vanilla UDify, transfer to target languages is zero-shot.
Our approach begins with these two training steps from UDify, then adds a third: adapting to the target language using the target statistics and possibly small amounts of supervised data (Eq.3).

Typological Statistics as Supervision
We now discuss a series of statistics that we will use as weak supervision.Most of the proposed statistics describe various probabilities for different (but related) grammatical substructures and can ultimately be broken down into ratios of "count" functions (sums of indicators), which tally various types of events in the data.We propose statistics that cover surface level (POS-only), single-arc, two-arc, and single-head substructures, as well as conditional variants.Due to space constraints, we omit their mathematical descriptions.
Surface Level: One simple set of descriptive statistics are the unigram and bigram distributions over POS tags.POS unigrams can capture some basic relative frequencies, such as our expectation that nouns and verbs are common to all languages.POS bigrams will allow us to capture simple wordorder preferences.Single-Arc: This next set of statistical families all capture information about various choices in single-arc substructures.A single arc substructure carries up to 5 pieces of information: the arc's direction, label, and distance, as well as the tags for the head and child words.Various subsets of these capture differing forms of regularity, such as "the probability of seeing tag t h head an arc with label r in direction d.Universally Impossible Arcs: In addition to many single-arc variants, we also consider the specific subset of (head tag, label, child tag) singlearc triples that are never seen in the any UD data.These combinations, correspond to the impossible arrangements that do not "type-check" within the UD formalism and are interesting in that they could in principle be specified by a linguist without any labeled data whatsoever.As such, they represent a particularly attractive usecase of our approach, where a domain expert could rule out all invalid substructures dictated from the task formalism without the model having to learn it implicitly from the training data.With complex structures, this can be a large proportion of the possibilities: in UD we can rule out 93.2% (9,966/10,693) of the combinations.Two-Arc: We also consider substructures spanning two connected arcs in the tree.They may be useful because they cover many important typological phenomena, such as subject-object-verb ordering.They also have been known to be strong features in higher-order parsing models, such as the parser of Carreras (2007), but are also known to be intractable in non-projective parsers (Mc-Donald and Pereira, 2006).
Following McDonald and Pereira (2006), we distinguish between two different patterns of neighboring arcs: siblings and grandchildren.Sibling arc pairs consist of two arcs which share a sin-gle head word, while grandchild arc pairs share an intermediate word that is the child of one arc and the head of another.Head-Valency: One interesting statistic that does not fall into the other categories is the valency of a particular head tag.This corresponds to the count of outgoing arcs headed by some tag.We convert this into a probability by using a binning function that allows us to quantify the "probability that some tag heads between a and b children".Like the two-arc statistics, expected valency statistics are intractable under the model and we must approximate their computation.Conditional Variants: Further, each of these statistics can be described in conditional terms, as opposed to their full joint realizations.To do this, we simply divide the joint counts by the counts of the conditioned-upon sub-events.Conditional variants may be useful because they do not express preferences for probabilities of the sub-events on the right side of the conditioning bar, which may be hard to estimate.Average Entropy: In addition to the above proposed relative frequency statistics, we also include average per-token, per-edge, and MST tree entropies as additional regularization statistics that are always used.Though we do not show it here, each of these functions may be formulated as a statistic within our approach.The inclusion of these statistics amounts to a form of Entropy Regularization (Grandvalet and Bengio, 2004) that keep the models from optimizing the other ESR constraints with degenerate constant predictions (Mann and McCallum, 2010).

Oracle Unsupervised Experiments
We begin with oracle unsupervised transfer experiments that evaluate the potential of many types of statistics and some ablations.In this setting, we do not assume any labeled data in the target language, but do assume accurate target statistics and margins, calculated from held-out training data using the method of Section 3.This allows us to study the potential of our proposed ESR regularization term C on its own and without the confounds of supervised data or inaccurate targets.

Experimental Setup
Next we describe setup details for the experiments.These settings additionally apply to the rest of the experiments unless otherwise stated.

Datasets
In all experiments, the models are first initialized from mBERT, then trained using the UDify code (Kondratyuk, 2019) on 13 diverse treebanks, following Kulmizev et al. (2019); Ustun et al. (2020).This model, further referred to as UDPRE, is used as the foundation for all approaches.
As discussed in Kulmizev et al. (2019), these 13 training treebanks were selected to give a diverse sample of languages, taking into account factors such as language families, scripts, morphological complexity, and annotation quality.
We evaluate all proposed methods on 5 held-out languages, similarly selected for a diversity in language typologies, but with the additional factor of tranfser performance of the UDPRE baseline. 2 summary table of these training and evaluation treebanks is given in Table 1.

Apporaches
We compare our approach to two strong baselines in all experiments, based on recent advances in the literature for cross-lingual parsing.These baselines are implemented in our code so that we may fairly compare them in all of our experiments.
• UDPRE: The first baseline is the UDify (Kondratyuk, 2019) model-transfer approach.Multilingual model-transfer alone is currently one of the state-of-the-art approaches to cross-lingual parsing and is a strong baseline in its own right.
• UDPRE-PPT: We also apply the Parsimonious Parser Transfer (PPT) approach from Kurniawan et al. (2021).PPT is a nuanced self-training approach, extending Täckström et al. (2013), that encourages the model to concentrate its mass on its most likely predicted parses for the target treebank.We use their loss implementation, but apply it to our UDPRE base model (instead of their weaker base model) for a fair comparison, so this approach combines UDify with PPT.
In individual experiments we will specify the statistics used for regularization.

Training and Evaluation Details
For metrics, we report accuracy for POS tagging, coarse-grained labeled attachment score (LAS) for dependency trees, and their average as a single summary score.The metrics are computed using the official CoNLL-18 evaluation script. 3For all scenarios, we use early-stopping for model selection, measuring the POS-LAS average on the specified development sets.We tune learning rates and α for each proposed loss variant at the beginning of the first experiment with a low-budget grid search, using the settings that achieve best validation metric on average across the 5 language validation sets for all remaining experiments with that variant.We find generally that a base learning rate of 2 × 10 −5 and α = 0.01 worked well for all variants of our method.We train all models using AdamW (Loshchilov and Hutter, 2019) on a slanted triangular learning rate schedule (Devlin et al., 2019) with 500 warmup steps.Also, since the datasets vary in size, we normalize the training schedule to 25 epochs at 1000 steps per epoch.We use a batch size of 8 sentences for training and estimating statistic targets.When bootstrapping estimates for t and σ we use B = 1000 samples.

Assessing the Proposed Statistics
In this experiment we evaluate 32 types of statistics from Section 4.3 for transfer of the UDPRE model (pretrained on 13 languages) to the target languages.The purpose of this experiment is to get a sense of the effectiveness of each statistic for improving model-based cross-lingual transfer. 4To prevent overfitting to the test sets for later experiments, all metrics for this experiment are calculated on the development sets.Results: The results of the experiment are presented in Table 2, ranked from best to worst.Due to space constraints, we only show the top 10 statistics in addition to the Universal-Arc statistic.Generally we find that all of the 32 proposed statistics improve upon the UDPRE and UDPRE-PPT models on average, with many exhibiting large boosts.The best performing statistic concerns (Child Tag, Label, Direction) substructures, yielding an average improvement of +7.0 POS and Another interesting result is that several of the intractable two-arc statistics were among the best statistics overall, indicating that the use of the differentiable SST approximation does not preclude the applicability of intractable statistics.For example the directed grandchild statistic of cooccurrences of incoming and outgoing edges for certain tags was the second highest performing, with an average improvement of +7.0 POS accuracy and +8.5 LAS (21.3% average error rate reduction).

Results for the conditional variants (not shown)
were less positive.Generally, conditional variants were worse than their full joint counterparts (e.g., "Child | Label" and "Label | Child" are worse than "Child, Label"), performing worse in 15/16 cases.This makes sense, as we are using accurate statistics and full joints are strictly more expressive.
This experiment gives a broad but shallow view into the effectiveness of the various proposed statistics.In the rest of the experiments, we evaluate the following two variants in more depth: 1. ESR-CLD, which supervises target proportions for (Child Tag, Label, Direction) triples.This is the "Child, Label" row in Table 2.
2. ESR-UNIARC, which supervises the 9,966 universally impossible (Head Tag, Child Tag, Label) arcs that do not require labeled data to estimate.All of these combinations have targets values of t = 0 and margins σ = 0.This is the "Universal Arc" row in Table 2.
We choose these two because ESR-CLD is the best performing statistic overall and ESR-UNIARC is unique in that it does not require labeled data to estimate; we do not evaluate others because of cost considerations.

Ablation Studies
Next, we perform two ablation experiments to evaluate key design choices of the proposed approach.First, we evaluate the use of batch-level aggregation in the statistics before the loss, versus the more standard approach of loss-per-sentence.
In the second, we evaluate the proposed form of .We compare the two aggregation variants using the CLD (Child Tag, Label, Direction) statistic (ESR-CLD).We report test set results averaged over all 5 languages.We use the same hyperparameters selected in Section 5.2.

Batch-level Loss Ablation
In this ablation, we evaluate a key feature of our proposal-the aggregation of the statistic over the batch before loss computation Eq. 5 versus the more standard approach, which is to apply the loss per-sentence.The former, "Loss per batch", has the form: (t, σ, f (D U , p θ )) while the latter, "Loss per sentence", has the form: The significance of this difference is that "Loss per batch" allows for the variation in individual sentences to somewhat average out and hence is less noisy, while "Loss per sentence" requires that each sentence individually satisfy the targets.

Results:
The results are presented in Table 3.
From the table we can see that "Loss per batch" has an average POS of 79.9 and average LAS of 60.4, compared to "Loss per sentence" with average POS of 77.1 and LAS of 58.5, which amount to +2.8 POS and +1.9 LAS improvements.This indicates that applying the loss at the batch level confers an advantage over applying per sentence.

Smooth Hinge-Loss Ablation
Next, we evaluate the efficacy of the proposed smoothed hinge-loss distance function .We compare to using just L1 or L2 uninterpolated and with no margin parameters (σ = 0).We also compare to the "Hard L1", which is the max-margin hinge (t, σ, x) = max{0, |t − x| − σ}.We use the same experimental setup as the previous ablation.

Results:
The results are presented in Table 4.
From the table we can see that the Smooth L1 loss outperforms the other variants.

Realistic Semi-Supervised Experiments
The previous experiments considered an unsupervised transfer scenario without labeled data.In these next experiments we turn to a realistic semisupervised application of our approach where we have access to limited labeled data for the target treebank.

Learning Curves
In this experiment we present learning curves for the approaches, varying the amount of labeled data , 100, 500, 1000}.To make experiments realistic, we calculate the target statistics t and margins σ from the small subsampled labeled training datasets using Eqs.11 and 12.
We study two distinct settings.First, we study the multi-source domain-adaptation transfer setting, UDPRE.Second, we study our approach in a more standard semi-supervised scenario where we cannot utilize intermediate on-task pretraining and domain-adaption, instead learning on the target dataset starting "from scratch" with the pretrained PLM (MBERT).
We use the same baselines as before, but augment each with a supervised fine-tuning loss on the supervised data in addition to any unsupervised losses.We refer to these models as UDPRE-FT, UDPRE-FT-PPT, and UDPRE-FT-ESR.That is, models with FT in the name have some supervised fine-tuning in the target language.
In these experiments, we subsample labeled training data 3 times for each setting.We report averages over all 5 languages, 3 supervised subsample runs each, for a total of 15 runs per method and dataset size.We also use subsampled development sets so that model selection is more realistic. 5 We use the same hyperparameters as before, except we use 40 epochs with 200 steps per epoch as the training schedule, mixing supervised and unsupervised data at a rate of 1:4.

UDPRE Transfer
In this experiment, we evaluate in the multlingual transfer scenario by initializing from UDPRE.In addition to the two chosen realistic ESR variants, we also experiment with an "oracle" version of ESR-CLD, called ESR-CLD * , that uses target statistics estimated from the full training data.This allows us to see if small-sample estimates cause a degradation in performance compared to accurate large-sample estimates.Results: Learning curves for the different approaches, averaged over all 3 runs for all 5 languages (15 total), are given in Figure 1.ESR-UNIARC is much more effective in conjunction with fine-tuning.Compared to the unsupervised experiment in Section 5.2 where it ranked 25/32, the ESR-UNIARC statistic is much more competitive with the more detailed ESR-CLD statistics.One potential explanation is that without labeled data (as in Section 5.2) the ESR-UNIARC statistic is under-specified (the 727 allowed arcs are all free to take any value), whereas the inclusion of some labeled data in this experiment fills this gap by implicitly indicating target proportions for the allowed arcs.This suggests that an approach which combines UniArc constraints with elements of self-training (like PPT) that supervise the "free" non-zero combinations could potentially be a useful approach to zero-shot transfer.However, we leave this to future work.
Small-data estimates for ESR-CLD are as good as accurate estimates.Comparing ESR-CLD to the unrealistic ESR-CLD * , we find no significant difference between the two, indicating that, at least for the CLD statistic, using target estimates from small samples is as good as largesample estimates.This may be due in part to the margin estimates σ, which are wider for the small samples and somewhat mitigate their inaccuracies.
PPT adds little benefit to fine-tuning.Relative to UDPRE-FT, the UDPRE-FT-PPT baseline does not yield much gain, with a maximum average improvement of +0.3 POS and +0.7 LAS over all dataset sizes.This indicates that fine-tuning and PPT-style self-training may be redundant.

MBERT Transfer
In this experiment, we consider a counterfactual setting: what if the UD data was not a massively multilingual dataset where we can utilize multilingual model-transfer, and instead was an isolated dataset with no related data to transfer from?This situation reflects the more standard semisupervised learning setting, where we are given a new task, some labeled and unlabeled data, and must build a model "from scratch" on that data.
For this experiment, we repeat the learning curve setting from Section 6.1.1,but initialize our model directly with MBERT, skipping the intermediate UDPRE training.Results: Learning curves for the different approaches, averaged over all 3 runs for all 5 languages (15 total), are given in Figure 2. The results from this experiment are encouraging; ESR has even greater benefits when fine-tuning directly from MBERT than the previous experiment, indicating that our general approach may be even more useful outside of domain-adaptation conditions.

Low-Resource Transfer
In previous experiments, we limited the number of evaluation treebanks to 5 to allow for variation in other dimensions (i.e., constraint types, loss types, differing amounts of labeled data).In this experiment, we expand the number of treebanks and evaluate transfer performance in a low-resource setting with only |D train L | = 50 labeled sentences in the target treebank, comparing UDPRE, UDPRE-FT, and UDPRE-FT-ESR-CLD.As before, we subsample 3 small datasets per treebank and calculate the target statistics t and margins σ from these to make transfer results realistic.
We select evaluation treebanks according to the following criteria.For each unique language in UD v2.8 that is not one of the 13 training languages, we select the largest treebank, and keep it if has at least 250 train sentences and a development set, so that we can get reasonable variability in the subsamples.This process yields 44 diverse evaluation treebanks.Results: The results of this experiment are given in Table 5.From the table we can see the our approach ESR (UDPRE-FT-ESR-CLD) outperformed supervised fine-tuning (UDPRE-FT) in many cases, often by a large margin.On average, UDPRE-FT-ESR-CLD outperformed UDPRE-FT by +2.6 POS and +2.3 LAS across the 44 lan-guages.Further, UDPRE-FT-ESR-CLD outperformed zero shot transfer, UDPRE, by +10.0 POS and +14.7 LAS on average.Interestingly, we found that there were several cases of large performance gains while there were no cases of large performance declines.For example, ESR improved LAS scores by +17.3 for Wolof, +16.8 for Maltese, and +12.5 for Scottish Gaelic, and 9/44 languages saw LAS improvements ≥ +5.0, while the largest decline was only −2.5.Additionally, ESR improved POS scores by +20.9 for Naija, +11.2 for Welsh, and 9/44 languages saw POS improvements ≥ +5.0.
The cases of performance decline for LAS merit further analysis.Of the 20 languages with negative ∆ LAS, 18 of these are modern languages spoken in continental Europe (mostly Slavic and Romance), while only 5 of the 24 languages with positive ∆ LAS meet this criteria.We hypothesize that this tendency is be due to the training data used for pretraining MBERT, which was heavily skewed towards this category (Devlin et al., 2019).This suggests that ESR is particularly helpful in cases of transfer to domains that are underrepresented in pretraining.

Related Work
Related work generally falls into two categories: weak supervision and cross-lingual transfer.
Our work can be seen as an extension of GEC to more expressive expectations and to modern minibatch SGD training.There are a two more recent works that touch on these ideas, but both have significant downsides compared to our approach.Meng et al. (2019) use a PR approach inspired by Ganchev and Das (2013) for cross-lingual parsing, but must use very simple constraints and require a slow inference procedure that can only be used at test time.Noach and Goldberg (2019) utilize GEC with minibatch training, but focus on using related tasks for computing simpler constraints and do not adapt their targets to small batch sizes.
Cross-Lingual Transfer: Earlier trends in cross-lingual transfer for parsing used dexlexicalization (Zeman and Resnik, 2008;McDonald et al., 2011;Täckström et al., 2013) and then aligned multilingual word vector-based approaches (Guo et al., 2015;Ammar et al., 2016;Rasooli and Collins, 2017;Ahmad et al., 2019).With the rapid rise of language-model pretraining (Peters et al., 2018;Devlin et al., 2019;Liu et al., 2019b), recent research has focused on multilingual PLMs and multitask fine-tuning to achieve generalization in transfer.Wu and Dredze (2019) showed that a multilingual PLM afforded surprisingly effective cross-lingual transfer using only English as the fine-tuning language.Kondratyuk (2019) extended this approach by fine-tuning a PLM on the concatenation of all treebanks.Tran and Bisazza (2019), however, show that transfer to distant languages benefit less.

Conclusion
We have presented Expected Statistic Regularization, a general approach to weak supervision for structured prediction, and studied it in the context of modern cross-lingual multi-task syntactic parsing.We evaluated a wide range of expressive structural statistics in idealized and realistic trans-Table 5: Low-Resource Semi-Supervised Transfer Results.Transfer results for 44 unseen test languages using 50 labeled sentences in the target language, averaged over 3 subsampled datasets."FT" refers to the UDPRE-FT fine-tuning baseline, "ESR" refers to our UDPRE-ESR-CLD approach, and ∆ refers to the absolute difference of ESR minus FT.Best performing methods are bolded.Results are ordered from best to worst ∆ LAS.fer scenarios and have shown that the proposed approach is effective and complementary to the stateof-the-art model-transfer approaches.
For development sets we subsample the data to a size of |D dev L | = min(100, |D train L |), which reflects a 50/50 train/dev split until |D L | ≥ 200, at which point we maximize training data and only hold out 100 sentences for validation.
figure we can discern several encouraging results.ESR-CLD and ESR-UNIARC add significant benefit to fine-tuning for small data.Both variants significantly outperform the baselines at 50 and 100 labeled examples.For example, relative to UDPRE-FT, the ESR-CLD model yielded gains of +2 POS, +3.6 LAS at 50 examples and +1.8 POS, +3.2 LAS at 100 labeled examples.At 500 and 1000 examples, however, we begin to see diminishing benefits to ESR on top of fine-tuning.ESR-UNIARC is much more effective in conjunction with fine-tuning.Compared to the unsupervised experiment in Section 5.2 where it ranked 25/32, the ESR-UNIARC statistic is much more competitive with the more detailed ESR-CLD statistics.One potential explanation is that without labeled data (as in Section 5.2) the ESR-UNIARC statistic is under-specified (the 727 allowed arcs are all free to take any value), whereas the inclusion of some labeled data in this experiment fills this gap by implicitly indicating target proportions for the allowed arcs.This suggests that an approach which combines UniArc constraints with elements of self-training (like PPT) that supervise the "free" non-zero combinations could potentially be a useful approach to zero-shot transfer.However, we leave this to future work.Small-data estimates for ESR-CLD are as good as accurate estimates.Comparing ESR-CLD to the unrealistic ESR-CLD * , we find no significant difference between the two, indicating that, at least for the CLD statistic, using target estimates from small samples is as good as largesample estimates.This may be due in part to the margin estimates σ, which are wider for the small samples and somewhat mitigate their inaccuracies.PPT adds little benefit to fine-tuning.Relative to UDPRE-FT, the UDPRE-FT-PPT baseline does not yield much gain, with a maximum average improvement of +0.3 POS and +0.7 LAS over all dataset sizes.This indicates that fine-tuning and PPT-style self-training may be redundant.

Figure 1 :Figure 2 :
Figure 1: Multi-Source UDPRE Transfer Learning Curves.Baseline approaches are dotted, while ESR variants are solid.All curves show the average of 15 runs across 5 different languages with 3 randomly sampled labeled datasets per language.The plots indicate a significant advantage of ESR over the baselines in low-data regions.

Table 1 :
Training and evaluation treebank details.The final column shows UDPRE test set performance after UDify training (evaluation treebank performance is zero-shot).( * ): downsampled to the same 15k sentences as Ustun et al. (2020) to reduce training time and balance the data.

Table 2 :
Unsupervised Oracle Statistic Variant Results.(Top): Baseline methods that do not use ESR.(Bottom): Various statistics used by ESR as unsupervised loss on top of UDPRE.Scores are measured on target treebank development sets.Bold names mark statistics used in later experiments.( * ): All statistics with * are intractable and utilize the SST relaxation of Paulus et al. (2020).( †): All statistics with † also include directional information.

Table 3 :
Loss Aggregation Ablation Results.Loss per batch outperforms loss per sentence for both POS and LAS on average.

Table 4 :
Loss Function Ablation Results.The Smooth L1 loss outperforms the other simpler loss variants for both POS and LAS, averaged over 5 languages.