Joint Universal Syntactic and Semantic Parsing

While numerous attempts have been made to jointly parse syntax and semantics, high performance in one domain typically comes at the price of performance in the other. This trade-off contradicts the large body of research focusing on the rich interactions at the syntax-semantics interface. We explore multiple model architectures which allow us to exploit the rich syntactic and semantic annotations contained in the Universal Decompositional Semantics (UDS) dataset, jointly parsing Universal Dependencies and UDS to obtain state-of-the-art results in both formalisms. We analyze the behaviour of a joint model of syntax and semantics, finding patterns supported by linguistic theory at the syntax-semantics interface. We then investigate to what degree joint modeling generalizes to a multilingual setting, where we find similar trends across 8 languages.


Introduction
Given their natural expression in terms of hierarchical structures and their well-studied interactions, syntax and semantics have long been treated as parsing tasks, both independently and jointly. One would expect joint models to outperform separate or pipelined ones; however, many previous attempts have yielded mixed results, finding that while one level can be used as an additional signal to benefit the other (Swayamdipta et al., 2017(Swayamdipta et al., , 2018Johansson and Nugues, 2008), obtaining high performance in both syntax and semantics simultaneously is difficult (Krishnamurthy and Mitchell, 2014;Hajič et al., 2009).
A variety of tree-and graph-based representations have been devised for representing syntactic structure (e.g. varieties of constituency and dependency parse trees) as well as semantic structure, e.g. Abstract Meaning Representation (AMR; Banarescu et al., 2013), Universal Conceptual Cognitive Annotation (UCCA; Abend and Rappoport, 2013), and Semantic Dependency Parsing formalisms (SDP; Oepen et al., 2014aOepen et al., ,b, 2016. These semantic representations have varying degrees of abstraction from the input and syntax, ranging from being directly tied to the input tokens (e.g. SDP formalisms) to being heavily abstracted away from it (e.g. AMR, UCCA). Universal Decompositional Semantics (UDS;  falls between these extremes, with a semantic graph that is closely tied to the syntax while not being constrained to match the input tokens. Crucially, UDS graphs not only represent the predicateargument relationships in the input, but also host scalar-valued crowdsourced annotations encoding a variety of semantic inferences, described in §3. These provide another level for linguistic analysis and make UDS unique among meaning representations. Furthermore, as UDS graphs build on Universal Dependency (UD) parses, UDS is naturally positioned to take full advantage of the extensive and linguistically diverse set of UD annotations.
We extend the transductive sequence-to-graph UDS parser proposed by  to simultaneously perform Universal Dependencies (UD) and UDS parsing, finding that joint modeling offers concomitant benefits to both tasks. In particular, after exploring several multitask learning objectives and model architectures for integrating syntactic and semantic information, we obtain our best overall results with an encoder-decoder semantic parsing model, where the encoder is shared with a biaffine syntactic parser (Dozat and Manning, 2017). Because the UDS dataset is annotated on an existing UD corpus, our experiments isolate the effect of adding a semantic signal without the confound of additional data: all our monolingual systems are trained on the same set of sentences.
In contrast to previous work on joint syntaxsemantics parsing, we are able to achieve high performance in both domains with a single, unified model; this is particularly salient to UDS parsing, as the UD parse is a central part of a complete UDS analysis. Our best joint model's UDS performance beats the previous best by a large margin, while also yielding SOTA scores on a semantically valid subset of English Web Treebank (EWT). We introduce a model optimized for UD, which can obtain competitive UDS performance while matching the current SOTA UD parser on the whole of EWT. 1 We analyze these objectives and architectures with LSTM encoders and decoders as well as novel Transformer-based sequence-to-graph architectures, which outperform the LSTM variants. While previous work suggests that contextualized encoders largely obviate the need for an explicit syntactic signal in semantic tasks (Swayamdipta et al., 2019;Glavaš and Vulić, 2020), we find that syntactic (and semantic) annotations provide consistent performance gains even when such encoders are used. This suggests that the UD and UDS signals are complementary to the signal encoded by a pretrained encoder, and so we tune the encoder at various depths, further improving performance.
Building on this result, we leverage the shared multilingual representation space of XLM-R (Conneau et al., 2020) to examine UD parsing in 8 languages across 5 families and varying typological settings, where we demonstrate a cross-lingual benefit of UDS parsing on UD parsing.

Background and Related Work
In both language production and comprehension, syntax and semantics play complementary roles. Their close relationship has also been noted in language acquisition research, with the "semantic bootstrapping" hypothesis proposing that infants use semantic role information as an inductive bias for acquiring syntax (Pinker, 1979(Pinker, , 1984, 2 while Landau and Gleitman (1985); Gleitman (1990) and Naigles (1990) present evidence that infants use syntactic information when acquiring novel word meanings. Their connection was codified in Montague (1970)'s seminal work formalizing the link between syntactic structures and formal semantics. Broadly, their interactions can be split into "bottomup" constraints on the semantics of an utterance from its syntax, and "top-down" constraints on the syntax, based on the semantics. Despite their close empirical and theoretical ties, work on predicting syntax and semantic structures jointly has often struggled to attain high performance in one domain without compromising on performance in the other.
CCG-based parsing Following in the Montagovian tradition, several computational formalisms and models have focused on the syntax-semantics interface, including the Head-Driven Phrase Structure Grammar (Pollard and Sag, 1994) and Combinatory Categorical Grammar (CCG) (Steedman, 2000). In particular, CCG syntactic types can be paired with functional semantic types (e.g. λ calculus strings) to compositionally construct logical forms. Krishnamurthy and Mitchell (2014) model this process with a linear model over both syntactic derivations and logical forms, trained with a discriminative objective that combines direct syntactic supervision with distant supervision from the logical form, finding that while joint modeling is feasible, it slightly lowers the syntactic performance. By way of contrast, Lewis et al. (2015) find that a joint CCG and Semantic Role Labelling (SRL) dependency parser outperforms a pipeline baseline. The semantic signal can also be used to induce syntax without using syntactic supervision.
AMR parsing CCG approaches have also been applied to semantics-only AMR parsing (Artzi et al., 2015;Misra and Artzi, 2016;Beschke, 2019). Jointly modeling AMR and syntax,  induce a soft syntactic structure with a latent variable model, obtaining slight improvements over semantics-only models in low-resource settings.
SRL parsing SRL dependency parsing-the task of labeling an utterance's predicates and their respective arguments in a possibly non-projective directed acyclic graph (DAG)-is more akin to the UDS parsing task than CCG parsing and has an equally robust foundation of empirical results, having been the focus of several CoNLL shared tasksmost relevantly the 2008 and 2009 shared tasks, which were on joint syntactic and SRL dependency parsing (Surdeanu et al., 2008;Hajič et al., 2009). The upshot of these tasks was that a joint syntactic and semantic analysis could provide benefits over a separated system (Johansson and Nugues, 2008) but that in a multilingual setting, SRL-only systems slightly outperformed joint systems on average (Hajič et al., 2009). While the systems presented in these challenges used hand-crafted features, Swayamdipta et al. (2016) replicated many of their results in a neural setting. For SRL tagging, Strubell et al. (2018) introduce an end-to-end neural model that also uses UD parsing as a multitask intermediate task, akin to our intermediate model.
Syntactic scaffolds Like our models, work on syntactic scaffolds introduces a multitask learning (Caruana, 1997) framework, where a syntactic auxiliary task is introduced for the benefit of a semantic task; in contrast to the systems presented here, the syntactic task is treated as a purely auxiliary signal, with the model evaluation coming solely from the semantic task. Swayamdipta et al. (2017) first introduce the notion of a syntactic scaffold for framesemantic parsing, where a lightweight syntactic task (constituent labeling) is used as an auxiliary signal in a multitask learning setup to the benefit of the semantic task. Swayamdipta et al. (2018) introduce a similar syntactic scaffolding objective for three semantic tasks. However, Swayamdipta et al. (2019) find that the benefits of shallow syntactic objectives are largely eclipsed by the implicit information captured in contextualized encoders.

Data
A number of factors make the Universal Decompositional Semantics representation (UDS; White et al., 2020) particularly well-suited to our purposes, especially the existence of parallel manuallyannotated syntactic and semantic data. In UDS, a semantic graph is built on top of existing English Web Treebank (EWT; Bies et al., 2012) UD parses, which are mapped to nodes and edges in a semantic DAG via a set of deterministic rules (White et al., 2016;Zhang et al., 2017). This semantic graph, which represents the predicate-argument relationships in the text, is then annotated with crowdsourced scalar-valued attributes falling into the following categories: factuality (how likely a predicate is to have occurred), genericity (how general or specific a predicate/argument is), time (how long an event took), word sense (which word senses apply to a predicate/argument) and semantic protoroles, which break the traditional SRL ontologies into simpler "proto-agent" properties (e.g. volition, awareness, sentience) and "proto-patient" properties (e.g. change of state, change of location, being used). Note that while the semantic graph structure is tied to the syntax, the attribute values, encoding fine-grained, abstracted semantic inferences, are not. These attributes are unique among graphbased semantic representations. All of these properties are annotated on a scale of −3 to 3; for more details on the dataset, we refer the reader to  and , as well as Fig. 1. We train and evaluate on a semantically valid subset of EWT. 3 We similarly limit our baselines to these examples for our UD analysis, and release our cleaned UD dataset, for which we report state-of-the-art parsing performance.  Arborescence Following Zhang et al. (2019a) and , we convert the UDS graph into an arborescence, or tree. Reentrant semantic nodes are copied and co-indexed, so that a DAG can be recovered via a deterministic post-processing step. Each node is assigned a token label, taken from the token corresponding to the syntactic head of the semantic node (instance edges in Fig. 1). All syntactic nodes not assigned as labels to semantic nodes are included as nodes dominated by their corresponding semantic node (pink edges in Fig. 2). 4 An example conversion of the UDS graph in Fig. 1 can be found in Fig. 2. In the "semantics only" setting, only the semantic nodes are included in the graph. A pre-order traver-sal linearizes the arborescence into a sequence of nodes, indices, edge heads and labels, and the corresponding node and edge attributes.

Models
We build on the transductive parsing model presented by , which itself builds on the broad-coverage semantic parsing model of Zhang et al. (2019b) and relies heavily on AllenNLP (Gardner et al., 2018). The transductive parsing paradigm recasts graph-based parsing as a sequence-to-graph problem, using an attentional sequence-to-sequence model to transduce the input sentence into a set of nodes while incrementally predicting edge and edge labels for those nodes. The UDS semantic parser consists of 7 modules: The encoder module embeds the input features (type-level GloVe and contextualized word embeddings, POS tags, charCNN features) into a latent space, producing one vector per input token. BERT representations are pooled over subword units. 5 The decoder embedding module embeds the categorical information of the previous timestep (e.g. the token identity and index, the head token identity and index, the edge type) into a real space. The target node module builds node representations from the decoder embedding module's output. The target label module extends the Pointer-Generator network (See et al., 2017), which supports both generating new token labels from a vocabulary and copying tokens from the input, with a "target-copy" operation, additionally allowing the model to predict a token label by copying a previously predicted node, conditioned on a target node. This three-operation approach (i.e. generate, source-copy, target-copy) enables the parser to seamlessly handle lexicalized and nonlexicalized formalisms, while also natively supporting re-entrancy through the target-copy operation. The relation module is a graph-based dependency parser based on the parser presented by Dozat and Manning (2017) which uses separate head and dependency MLPs followed by a biaffine transformation to predict the dependency scores and labels between each node in a fully connected graph. For semantic parsing, we follow a greedy decoding strategy since the linearization of the arborescence implicitly enforces a well-formed output; this allows for single-step online decoding.
The node attribute module uses the node representations to predict whether each attribute applies to each node, and what its value should be. Deciding whether the attribute applies and the prediction of its value are performed by separate MLPs.
The edge attribute module is similar to the node attribute module, but passes a bilinear transformation of two node representations to the MLPs, which predict edge-level properties and masks.
In this work, we make several modifications to this model. Firstly, we replace the encoder and target node modules with Transformer-based architectures (Vaswani et al., 2017). Next, we introduce a second biaffine parser on top of the output of the encoder module, which is tasked with performing dependency parsing on the UD data. During training, we use greedy decoding, while at test time the Chu-Liu-Edmonds Maximum Spanning Tree algorithm (Chu, 1965;Edmonds, 1967) is used.
More specifically, the encoded input representations s t in Stengel-Eskin et al. (2020) were obtained from a stacked bidirectional LSTM. While the input to this module remains the same in our Transformer-based implementation, the representation s t is now given by the final layer of a multihead multi-layer Transformer model. Crucially, following Nguyen and Salazar (2019) we replace the layer normalization layer with a ScaleNorm layer and place it before the feed-forward network. The syntactic biaffine parser uses the encoder representations s to predict a head and head label for each token. This task is different from the semantic parsing task as the graph is lexicalized (i.e. there is a bijection between input tokens and graph nodes).
Similarly, the node representations z i are computed by a Transformer decoder with both selfattention (as in the encoder) and source-side attention. The first layer of node representations for the decoder is given by learned continuous embeddings of the head token and current token representation, their respective indices, and the relationship between them. During training, the gold nodes and heads are used (i.e. teacher forcing), and the attention is computed with an autoregressive mask, so that each token is only able to attend to tokens in its left context. We take z i = ScaleNorm(x L i ), x L i being the output of the final layer in a stack of L Transformer layers. We also follow Nguyen and Salazar in scaling the attention head weight initialization by a factor of k.

Experiment 1: Joint English Parsing
To determine the effect of jointly parsing the syntax and semantics, we consider a number of baselines and experimental settings. First, we contrast our re-implementation of Stengel-Eskin et al. (2020)'s LSTM-based model with their results. We then report the results of our Transformer-based model, described in §4. After establishing these baselines, we consider different methods of incorporating the syntactic signal into the model: Concat-before (CB): Here, we linearize the syntactic UD parse, which is a tree, via a pre-order traversal, yielding a sequence of nodes, edge heads, and edge labels, which we prepend to the corresponding sequences obtained by linearizing the semantic parse, separated by a special separation token. At inference time, we use this token to split the output into syntactic and semantic parses. Concat-after (CA): This setting is identical to the concat-before setting, except that the syntactic graph is appended at the end of the semantic sequence. These two settings incorporate some syntactic signal into the semantic parse, but do not exploit UD parsing's lexicalization assumption; thus we expect them to yield subpar UD results. Encoder-side (EN): We do make use of this assumption here, adding a biaffine parser to the encoder states s 1:T . We introduce an additional syntactic parsing objective, which allows us to take advantage of the strong lexicalized bias. However, the syntactic signal only enters the model implicitly via backprop, i.e. during the forward pass, the model has no access to syntactic information. Intermediate (IN): We incorporate the syntactic information by re-encoding the predicted syntactic parse and passing it to the decoder. Due to the close syntactic correspondence of the UDS semantic graph, we would expect that allowing the decoder to access the predicted dependency parse would benefit both the semantic parse as well as the syntactic parse. We enable this by concatenating edge information to s 1:T and linearly projecting it. Specifically, given edge head scores E ∈ R T ×T , where each row i is a distribution over possible heads for token i, the output of the parser's head MLP H ∈ R T ×d h , and the output of the parser's edge type MLP T ∈ R T ×dt , we compute the new encoder representations s as: Transformer Hyperparameters Unlike the LSTM-based model, which is fairly robust to hyperparameter changes, the Transformer-based architecture was found to be sensitive to such changes. We use a random search strategy (Bergstra and Bengio, 2012) with 40 replicants, tuning the number of layers l ∈ [6, 8, 12], the initialization scaling factor k ∈ [4, 32, 128, 512], the number of heads H ∈ [4,8], the dropout factor d ∈ [0.20, 0.33], and the number of warmup steps for the optimizer w ∈ [1000, 4000, 8000]. This was performed with the base model, with the best hyperparameters used in all other models.

Evaluation Metrics
UAS/LAS: Unlabeled Attachment Score (UAS) computes the fraction of tokens with correctly assigned heads in a dependency parse. Labeled Attachment Score (LAS) computes the fraction with correct heads and arc labels. Both are standard metrics for UD parsing. Pearson's ρ: For UDS attributes, we compute the Pearson correlation between the predicted attributes at each node and the gold annotations in the UDS corpus. This is obtained under an "oracle" setting, where the gold graph structure is provided. Attribute F1: Following the original description of semantic proto-roles as binary attributes (Dowty, 1991), we also measure whether the direction of the attributes matches that of the gold annotations, e.g. whether a predicate is likely factual (factualityfactual > θ) or not (factuality-factual < θ). 6 We tune θ per attribute type on validation data. It is abbreviated as F1 (attr), and along with ρ, measures performance on the attribute prediction task. S-score: Following the Smatch metric , which uses a hill-climbing approach to find an approximate graph matching between a reference and predicted graph, S-score (Zhang et al., 2017) provides precision, recall, and F1 score for nodes, edges, and attributes. Note that while S-score enables us to match scalar attributes jointly with nodes and edges, for the sake of clarity we have chosen to bifurcate the evaluations: S-score for nodes and edges only (functionally equivalent to Smatch), and ρ and F1 (attr) for attributes. We use two variants of S-score: one evaluates against full UDS arborescences with linearized syntactic subtrees included as children of semantic heads (abbreviated as F1 (syn)), while the semantics-only set-ting evaluates only on semantics nodes (F1 (sem)). This metric measures performance on the semantic graph structure prediction task.

Experiment 1: Results and Analysis
The Transformer outperforms the LSTM on UDS parsing. We first observe that, with modifications and tuning, the Transformer architecture strictly outperforms the LSTM despite the relatively low number of training examples (12.5K). Fig. 3, which corresponds to Table 1 rows 2 and 3, shows that the Transformer outperforms the LSTM on the S-score metric (with syntactic nodes included, following Stengel-Eskin et al. (2020)) as well as attribute F1 and Pearson's ρ. Note that in this figure, as well as the others in this section, the vertical axis is scaled to highlight relevant contrasts.   Table 1, shows that an LSTM encoder with a biaffine parser and no semantic decoder (LSTM + BI) outperforms both baselines (Chen and Manning, 2014;Dozat and Manning, 2017, C+M and D+M, respectively). Note that this model has no semantic signal, and is trained only on UD parsing. In the LSTM case, the addition of the UDS semantic signal via the encoder-side model described in §5 slightly lowers performance. However, this is not the case for the Transformer; the syntax-only Transformer (TFMR + BI) model outperforms the LSTM model, and is slightly outperformed by the joint syntax-semantics Transformer model.
Joint training has little impact on attribute metrics for non-baseline models. Fig. 5 shows Pearson's ρ and binarized attribute F1 (with θ tuned on the development set); this corresponds to the 2 nd two rows and final 10 rows of  Figure 4: LSTM and TFMR performance on English EWT UD parsing, contrasted with Chen and Manning (2014) and Dozat and Manning (2017) baselines. Models with semantic information (+ EN) outperform their syntax-only baselines (+ BI) out-perform their LSTM counterparts. In contrast, the addition of syntactic information through concatenation (concat-before, concat-after) seems to diminish the performance on these metrics.  Joint parsing slightly improves semantic structural performance. Fig. 6 shows the structural F1 computed by S score, where we observe that the LSTM's performance, which is lower than the Transformer's in the baseline setting, benefits most from the concatenation settings, while suffering under the encoder and intermediate settings.
By way of contrast, the Transformer, whose baseline performance is higher, benefits most from the encoder-side biaffine parsing setting, which also boasts the best UD performance (cf. Fig. 4) Figure 6: LSTM and TFMR S-score F1 (with syntax nodes included). While the concat-after setting offers S score improvements for both encoder/decoder types, the Model P (syn) R (syn) F1 (syn) P (sem) R (sem) F1 (sem) Attr. syntactic performance in this setting is very poor (< 60 UAS). The Transformer encoder-side multitask model is able to improve structural performance for the encoder-side while simultaneously boosting UAS and LAS (see Fig. 4). These results demonstrate that explicitly incorporating a syntactic signal into a transductive semantic parsing model can be done without damaging semantic performance, both for UDS attributes and structure. Perhaps more surprisingly, the semantic signal coming from the UDS attributes and structure improves the syntactic performance of the model when the syntactic model is able to take advantage of the lexicalized nature of UD. Note that due to the parallel nature of the UD and UDS data, we can conclude that the improvements here result from the additional structural signal, and not merely from the addition of more sentences. We see that for the concatenation settings, while the semantic structural performance may increase, the syntactic parsing results are dismal. This is true whether we concatenate the syntactic graph before or after the semantic one. This may be explained by the fact that, by using a transductive model for a lexicalized parsing task like UDS, we are complicating what the model needs to learn. Rather than simply labelling existing nodes, the model must reproduce these nodes via source-side copying.
While prima facie, we would expect the intermediate model to outperform the multitask encoderside model, as its decoder has explicit access to the syntactic parse, we see that this is not the case; it shows lower structural and attribute performance. This represents a direction for future work.
The Role of the Encoder The results in Fig. 5 show that the Transformer-based model has a heavy advantage over the LSTM-based model in terms of attribute prediction. This might be due to an improved ability by the Transformer encoder to incorporate signals from across the input, since the self-attention mechanism has equal access to all positions, while the BiLSTM has only sequential access which may become corrupted or washed out over longer distances. Given the highly contextual nature of UDS inferences, we would expect a model which better captures context to have a distinct advantage, as the crucial tokens for correctly inferring an attribute value may be found in a distant part of the input sentence: for example, if we wish to infer the factuality of "left" in the sentence "Bill eventually confessed to the officers that, contrary to his previous statements, Joan had left the party early," most of the signal would come from the token "confessed," producing a high score. The construction of UDS and its attributes' scalar nature allow us to test this hypothesis by examining the Pearson correlation between predicted and true attributes at different positions in the input. 7 In order to compare the correlations across sentences, we group the predicted and reference attributes into 10 percentile ranges, based on the ratio of the node position and the sentence length, i.e. what percentage into the sentence the node occurs. We then average all Pearson ρ values across all attribute types in each bin, obtaining average ρ values by sentence completion percentile. With this data, we can compare two models on a very fine-grained level, asking questions such as, "how much better does the Transformer model do than the BiLSTM on nodes that are between 0% and 10% into the sentence." Fig. 7 shows such comparisons between a unidirectional left-to-right LSTM encoder and the bidirectional LSTM, and between the bidirectional LSTM and the Transformer. While the unidirectional LSTM actually outperforms the BiLSTM in the central percentiles, it struggles near the edges. This could be explained on the left edge by a lack of right context, and on the right edge by difficulties with long-range dependencies. Furthermore, the Transformer outperforms the BiLSTM at all positional percentiles, but particularly in the central regions, suggesting that, while the BiLSTM is able to incorporate contextual information well at the edges of a sentence, the information is diluted in the central region, while the Transformer's selfattention mechanism is equally able to draw from arbitrary positions at all timesteps.

Experiment 2: Tuning BERT
The results in Fig. 4 and Fig. 6 not only demonstrate that the addition of one structural modality (i.e. syntax, semantics) can benefit the other, but also suggest that these signals are complementary to the signal already given by the input features, which include contextualized features obtained from BERT. This stands in contrast to previous results by Swayamdipta et al. (2019) that the benefits to be gained from multitask learning with shallow syntactic objectives are largely eclipsed by contextualized encoders. However, we note that our models require full UD parses in the multitask settings rather than a light scaffolding.
If indeed the combination of UDS and UD signals provides information not yet encoded in BERT, then fine-tuning BERT with these signals should yield additional benefits. Following observations that syntax and semantics are encoded to varying degrees at different depths in contextualized  Figure 8: Test metrics when freezing/tuning different levels of BERT. X-axis represents number of layers tuned (from the top layer). Tuning the encoder provides significant benefits over a frozen encoder (0 layers tuned) but the optimal number of tuned layers is not the full 12 layers.
encoders (Hewitt and Liang, 2019;Tenney et al., 2019;Jawahar et al., 2019) with syntactic information typically lower in the network, we explore the trade-off between freezing and tuning various layers of the BERT encoder. Specifically, we tune the top n layers, starting from a completely frozen encoder and moving to tuning all 12 layers. 8 Intuitively, one might expect to see a monotonic increase as the number of tuned layers increases, as each additional unfrozen layer provides the model with more capacity. However, the results presented in Fig. 8 show a more nuanced trend: while the performance across syntactic and semantic metrics increases up to a point, they begin to decrease again when additional layers are unfrozen. This may be due to data sparsity; given the relatively small size of the UDS corpus, the addition of too many parameters may encourage overfitting, resulting in decreased test performance. Note that the encoderside model all three panels of Fig. 8 is the same model, i.e. the best UAS, LAS, and S score performance is obtained by the same model, and that the performance of the joint model at any given tuning depth typically falls above that of the baseline.
In contrast to the findings of Glavaš and Vulić (2020), who conclude that the benefits of UD pretraining for semantic language understanding tasks are limited when using contextualized encoders, our results in §6 show a small but consistent positive effect of syntactic information on semantic parsing, as well as improved syntactic performance from a semantic signal. Furthermore, our results here show that the UD signal can actually be used to fine-tune a contextualized encoder, which benefits not only the UD parsing performance but also the UDS performance. In fact, after training and evaluating their model (which, to our knowledge, has the highest performance to date on EWT) on our cleaned subset of EWT, we find that our best performing UAS/LAS values,93.42 and 91.22,outperform their values of 92.83 and 90.11. These values also slightly outperform the syntax-only version of the same model, with the same amount of tuning. The tuned encoder-side model also provides the best semantic performance, with a max score of 91.82, compared to 90.04 in the TFMR + EN setting (cf. Table 1). Prepositional Phrase Attachment Ambiguity Looking at the results in Figs. 4 and 6, which are mirrored in the tuned results visualized in Fig. 8, a natural question to ask is where the syntactic performance gains are coming from in the encoderside model. One hypothesis, in line with literature on semantic bootstrapping, is that the semantic signal helps the model to discriminate between ambiguous parses. Consider, for example, the sentence "I shot an elephant in my pyjamas." Syntactically, there are two valid heads for the prepositional phrase (PP) "in my pyjamas", but the semantics of the phrase indicates to us that it is less likely to be attached to "elephant". Perhaps adding an explicit semantic signal, like that of UDS, would improve a syntactic parser's ability to disambiguate sentences like this. In order to test this hypothesis, we use a dataset introduced by Gardner et al. (2020), consisting of 300 sentences with UD annotations. 150 sentences were chosen from a combination of English UD treebanks with potential PP attachment ambiguities, 75 with PP with nominal heads, and 75 with verbal heads. Minimal semantic changes were then made to the sentence to switch the head (i.e. nominal heads were switched to verbal heads). For example, the sentence, "They demanded talks with local US commanders" becomes: "They demanded talks with great urgency" (noun to verb).
A model's performance on this dataset is measured not only it its raw performance on the unaltered sentences, but, crucially, by its performance on the altered ones. As the altered sentences are constructed to be different from those seen in the training set, we expect there to be a drop if the model has learned simple heuristics (e.g. always attach to a noun) rather than robust rules based on semantic understanding. An ideal model would have high performance and no drop. Fig. 9 compares the tuned syntax-only UD baseline (right column) and the tuned UDS parser with encoder-side parsing (left column) on this task, both for nounto-verb and verb-to-noun alterations. In all cases, we see a significant drop in performance on the  Figure 9: Encoder-side and syntax-only performance on sentences with PP attachment ambiguities. A joint syntax-semantics model is slightly more robust to manual adjustment of the prepositional head than a syntax-only model. altered examples. In the noun-to-verb case, we see that, while the syntax-only baseline's initial performance is higher, it experiences a larger drop in performance than the encoder-side model, with its performance on the altered dataset being lower on UAS and LAS. In the verb-to-noun case, while both models undergo roughly the same major performance loss in the altered context, the initial performance of the encoder-side model is higher. These results, taken together, suggest that the addition of the UDS signal may provide a small benefit when disambiguating PP attachment ambiguities. However, such ambiguities are fairly rare in UD corpora, and are thus unlikely to explain the whole difference between the models. To this end, we examine the difference in UAS performance between systems on the 10 most frequent relation types in Fig. 10. When comparing a joint UD-UDS parser, we see that small gains are realized for the most frequent relations, but some relations suffer minor losses as well. In contrast, when comparing the tuned and untuned systems, nearly all the most frequent relations see fairly large improvements. UDS attributes and UD relations The close link between UD parses and the UDS annotations in the dataset allows us not only to train multitask models for joint syntactic and semantic parsing, but also to inspect the interactions between syntactic relations and semantic attributes more closely. Each cell in Fig. 12 shows the Pearson ρ between true and predicted attributes for a variety of UDS annotations conditioned on UD dependency relations. The node attributes (annotated on semantic nodes in the UDS graph) are paired with the UD relation of the corresponding syntactic head node. Predictions are obtained from the best tuned model with encoder-side UD parsing (TFMR + EN), under an oracle decode of the graph structure.
We see variation across a given attribute and dependency relation. For example, factuality annotations display a high ρ value for root and conj annotations, but a lower correlation for xcomp. These correlations are visualized in Fig. 11, where we plot the true vs. predicted value, with the line defined by ρ overlaid. The close correspondence between UD and UDS lets us observe this type of discrepancy, which echoes findings by White et al. (2018), who used factuality prediction to probe neural models' ability to make inferences based on lexical and syntactic factuality triggers. Furthermore, it is in holding with semantic theories, as the xcomp relation is used for open clausal complements, i.e. non-finite embedded clauses, with an overt control subject in the main clause (e.g. object or subject control). In English, xcomp relations correspond to infinitival embedded clauses, e.g. "I remembered to turn off the stove." As pointed out by White (2020), factuality inferences are particularly hard in these contexts, as they are not only sensitive to the lexical category of the embedding predicate (i.e. "remembered" vs. "forgot") but also its polarity (i.e. "remembered" vs. "didn't remember"). This separates them from finite clausal complements, where a matrix negation still results in the same factuality inference; e.g. in both "I remembered that I turned off the stove" and "I didn't remember that I turned off the stove" we infer that the stove was turned off. Furthermore, xcomp relations are present in object and subject control cases, which may be difficult even for human speakers to acquire (Chomsky, 1969;Cromer, 1970). Beyond comparing our model predictions to theoretical predictions at the syntax-semantics interface, we can also use this analysis to examine the data on which the model was trained. For instance, homing in on the genericity-arg-kind annotations (reflecting the level to which an argument refers to a kind of thing) for direct objects dobj, we see that for some examples, while the model prediction differs from the annotation, it is not wrong per se. One example is, "Take a look at this spreadsheet" where "look" is annotated as high for kind (1.41), but predicted as low (-1.09). In another example, "...I could not find one place in Tampa Bay that sells...", the argument "place" has a high predicted kind value (0.72) but is annotated otherwise (-0.87). In both these cases, one could argue that the model's prediction is not entirely incorrect. w s -p l a n t g e n -a r g -k i n d g e n -a r g -a b s t r a c t g e n -a r g -p a r t i c

Experiment 3: Multilingual Parsing
The results in §6 that English UD parsing and UDS parsing are mutually beneficial naturally give rise to a follow-up question: does this relationship extend to a multilingual setting. As in the monolingual case, we explore both the impact of UD parsing on UDS, and vice versa. UD is by design highly multilingual, spanning scores of languages from a diverse typological range, often with multiple treebanks per language. This has led to interest in evaluating the performance of UD parsing models not just on English, but across a range of languages and language families; both the 2017 and 2018 CoNLL shared tasks focused on multilingual UD parsing (Zeman et al., 2017(Zeman et al., , 2018. The introduction of multilingual contextualized encoders, such as mBERT (Devlin et al., 2019;Devlin, 2018) and XLM-R (Conneau et al., 2020) has enabled models to perform UD parsing in multiple languages simultaneously by using features obtained by a single multilingual encoder (Schuster et al., 2019;Kondratyuk and Straka, 2019). By initializing weights for one task (syntactic or semantic) with weights learned on the other, and leveraging the shared input representation space of XLM-R, we examine bottom-up effects of representations learned with a syntactic objective on semantic parsing performance, and top-down effects the semantic objective on syntactic performance. Note that unlike in §6, we do not have parallel data in these settings, leading to the use of pre-training rather than simultaneous multitask learning. Note also that we are examining the relationship between English semantic parsing and multilingual, non-English syntactic parsing. We do not make use of pre-trained type-level word embeddings in these experiments, leading us to expect slightly lower absolute performance on the UDS parsing task as compared to Table 1. Based on the syntactic results in §6, we explore only the Transformer models in our multilingual experiments. We tune the XLM-R encoder through layer 5, based on our observations on the development set in §6. 9 Languages 8 languages in 5 families from the 2018 CoNLL Shared Task (Zeman et al., 2018) were chosen, across both higher and lower resource settings. Table 2 gives further details and highlights the range of resource settings examined. 10 Bottom-up Effects Examining bottom-up effects of syntax on semantics, we pre-train a multilingual UD model on all 8 languages simultaneously, using alternating batches from each language and capping each epoch at 20,000 examples. We then initialize the encoder and biaffine parser in UDS parser with the weights learned on UD 9 We also re-tuned the Transformer hyperparameters, as the input space changed from BERT and GloVe to XLM-R. 10 Note that despite its relatively large test set size, the Kazakh train set is very small (n = 31). Of these sentences, 3 were used for validation, leaving 28 train sentences.  Glavaš and Vulić (2020): 93.1 UAS, 90.5 LAS. Since our model is additionally capable of performing UDS parsing at a level competitive with the best system presented in Table 1, we encourage others to make use of it in the future.
The trends here suggest that a multilingual syntactic signal, when incorporated well into a UDS model, can provide benefit to the syntactic performance without necessarily reducing the semantic  performance. Note that unlike in §6, the syntactic data used to pretrain the syntactic encoder and biaffine parser is neither parallel to the UDS dataset, nor is it in English. Thus, that the syntactic data can act as a signal for semantic parsing, albeit with small effects, is surprising.

Top-down Effects
In the top-down direction (semantics to syntax) we train the encoder-side and intermediate variants of the joint UDS and syntactic parsing model and subsequently load the weights from their encoders and biaffine parsers into separate UD models for all 8 languages. These are compared against a baseline model with weights initialized from a English UD parsing model. Thus any improvement obtained by the encoder and intermediate models comes strictly from the semantic signal, since the syntactic signal is shared with the baseline model. Table 4 shows the LAS and UAS performance of these models, with arrows indicating the direction of change from the baseline model. We see that almost all of the languages see benefits from the addition of semantic signal in at least one model variant, with the exception of Finnish, which performs worse across all variants and metrics when the semantic signal is included. For Galician, Hungarian, and Armenian, we see a sizeable improvement between the models. With the exception of Kazakh, whose train set is miniscule, these are among the lowest-resource languages considered. While, given the typical pipeline view of syntax as a substrate for semantics, we might expect the bottom-up results to be stronger than top-down results, here we find that the syntactic benefits of pretraining on a semantic task are more consistent and stronger than in the other direction.
Discussion On the whole, the multilingual topdown and bottom-up effects seem to mimick the monolingual results, albeit with smaller relative improvements. In §6, we saw a mutually beneficial relationship between UD and UDS parsing in a number of settings; here, we see that in several cases, this pattern generalizes to a case where we pretrain on data that is not only in a different domain than the evaluation (i.e. syntax vs. semantics) but also in a different language. These effects hint  Table 4: LAS/UAS for models with weights transferred from EWT UD parsing compared with those from encoder-side and intermediate UDS models.
at useful commonalities not only between syntactic parses across multiple languages, but also between multilingual syntax and the UDS representation.

Conclusion
In §5, we introduced a number of multitask architectures for joint syntactic and semantic parsing, which we demonstrated in §6 can perform UD and UDS parsing simultaneously without sacrificing performance, as evaluated across a number of syntactic and semantic metrics. In particular, we observed a top-down benefit to syntactic parsing from the semantic signal as well as a bottom-up benefit to semantic performance from syntactic parsing. We contrasted both LSTM and Transformer-based variants of these architectures, finding the Transformer to be better on all metrics. Finding the syntactic and semantic information present in the data to be complementary to that encoded in a frozen contextualized encoder, we experimented in §7 with tuning the encoder to varying depths, finding that tuning the top-most layers provides the greatest benefit. We analyzed the models resulting from this tuning step on their ability to resolve attachment ambiguities, as well as examining interactions between UDS annotations and UD dependency relations. Furthermore, in §8, we extended our experiments beyond English, using a transfer-learning experimental paradigm to investigate effects between multilingual syntactic parsing in 8 languages and English semantic parsing, where we found similar trends to the English-only setting. Based on these multilingual results, we believe that expanding the UDS data paradigm (i.e. UD-based graph structure, continuous attributes) beyond English and building robust multilingual parsing models is a particularly promising direction for future work. Other directions include improving the robustness of the Transformer model in low-resource settings and improvements to the attribute modules.