Exploring Neural Methods for Parsing Discourse Representation Structures

Neural methods have had several recent successes in semantic parsing, though they have yet to face the challenge of producing meaning representations based on formal semantics. We present a sequence-to-sequence neural semantic parser that is able to produce Discourse Representation Structures (DRSs) for English sentences with high accuracy, outperforming traditional DRS parsers. To facilitate the learning of the output, we represent DRSs as a sequence of flat clauses and introduce a method to verify that produced DRSs are well-formed and interpretable. We compare models using characters and words as input and see (somewhat surprisingly) that the former performs better than the latter. We show that eliminating variable names from the output using De Bruijn-indices increases parser performance. Adding silver training data boosts performance even further.


Introduction
Semantic parsing is the task of mapping a natural language expression to an interpretable meaning representation. Semantic parsing used to be the domain of symbolic and statistical approaches (Pereira and Shieber, 1987;Zelle and Mooney, 1996;Blackburn and Bos, 2005). Recently however, neural methods, and in particular sequenceto-sequence models, have been successfully applied to a wide range of semantic parsing tasks. These include code generation (Ling et al., 2016), question-answering (Dong and Lapata, 2016;He and Golub, 2016) and Abstract Meaning Representation parsing (Konstas et al., 2017). Since these models have no intrinsic knowledge of the structure (tree, graph, set) they have to produce, recent work also focused on structured decoding methods, creating neural architectures that always output a graph or a tree (Buys and Blunsom, 2017;Alvarez-Melis and Jaakkola, 2017). These methods often outperform the more general sequenceto-sequence models but are tailored to specific meaning representations. This paper will focus on parsing Discourse Representation Structures (DRSs) proposed in Discourse Representation Theory (DRT), a wellstudied formalism developed in formal semantics (Kamp, 1984;Van der Sandt, 1992;Kamp and Reyle, 1993;Asher, 1993;Muskens, 1996;van Eijck and Kamp, 1997;Kadmon, 2001;Asher and Lascarides, 2003), dealing with many semantic phenomena: quantifiers, negation, scope ambiguities, pronouns, presuppositions, and discourse structure (see Figure 1). DRSs are recursive structures and form therefore a challenge for sequenceto-sequence models because they need to generate a well-formed structure and not something that looks like one but is not interpretable.
The problem that we try to tackle bears similarities with the recently introduced task of mapping sentences to an Abstract Meaning Representation (AMR, Banarescu et al. 2013). But there are notable differences between DRS and AMR. Firstly, DRSs contain scope, which results in a more linguistically motivated treatment of modals, quantification, and negation. And secondly, DRSs contain a substantially higher number of variable bindings (reentrant nodes in AMR terminology), which are challenging for learning (Damonte et al., 2017).
DRS parsing has been attempted already in the 1980s for small fragments of English (Johnson and Klein, 1986;Wada and Asher, 1986). Widecoverage DRS parsers based on supervised ma-chine learning emerged later (Bos, 2008b;Le and Zuidema, 2012;Bos, 2015;Liu et al., 2018). The objectives of this paper are to apply neural methods to DRS parsing. In particular, we are interested in answers to the following questions: 1. Are sequence-to-sequence models able to produce formal meaning representations (DRSs)? 2. What is better for input: sequences of characters or sequences of words; does tokenization help; and what kind of casing is best used? 3. What is the best way of dealing with variables that occur in DRSs? 4. Does adding silver data increase the performance of the neural parser? 5. What parts of semantics are learned and what parts of semantics are still challenging?
We make the following contributions to semantic parsing: 1 (a) The output of our parser consists of interpretable scoped meaning representations, guaranteed by a specially designed checking tool (Section 3); (b) We compare different methods of representing input and output in Section 4; (c) We show in Section 5 that employing additional, nongold standard data can improve performance; (d) We perform a thorough analysis of the produced output and compare our methods to symbolic/statistical approaches (Section 6).

The Structure of DRS
DRSs are meaning representations introduced by DRT ( Kamp and Reyle, 1993). In general, a DRS can be seen as an ordered pair A, l : B , where A is a set of presuppositional DRSs, and B a DRS with a label l. The presuppositional DRSs A can be viewed as propositions that need to be anchored in the context in order to make the main DRS B true, where presuppositions comprise anaphoric phenomena too (Van der Sandt, 1992;Geurts, 1999;Beaver, 2002). DRSs are either elementary DRSs or segmented DRSs. An elementary DRS is an ordered pair of a set of discourse referents and a set of conditions. There are basic conditions and complex conditions. A basic condition is a predicate applied to constants or discourse referents while a 1 The code is available here: https://github.com/ RikVN/Neural_DRS.

Raw input:
Tom isn't afraid of anything.
System output of a DRS in a clausal form: b3 Time s1 t1 b1 Name x1 "tom" b3 Experiencer s1 x1 b2 REF t1 b3 afraid "a.01" s1 b2 EQU t1 "now" b3 Stimulus s1 x2 b2 time "n.08" t1 b3 REF x2 b0 NOT b3 b3 entity "n.01" x2 The same DRS in a box format: Figure 1: DRS parsing in a nutshell: given a raw text, a system has to generate a DRS in the clause format, a flat version of the standard box notation. The semantic representation formats are made more readable by using various letters for variables: the letters x, e, s, and t are used for discourse referents denoting individuals, events, states and time, respectively, while b is used for variables denoting DRS boxes.
complex condition can introduce boolean operators ranging over DRSs (negation, conditionals, disjunction). Segmented DRSs capture discourse structure by connecting two units of discourse by a discourse relation (Asher and Lascarides, 2003).

Annotated Corpora
Despite a long tradition of formal interest in DRT, it is only since recently that textual corpora annotated with DRSs have been made available. The Groningen Meaning Bank (GMB) is a large corpus with DRS annotation for mostly short English newspaper texts (Basile et al., 2012;. The DRSs in this corpus are produced by an existing semantic parser and then partially corrected. The DRSs in the GMB are therefore not gold standard. A similar corpus is the Parallel Meaning Bank (PMB), that provides DRSs for English, German, Dutch and Italian sentences based on a parallel corpus . The PMB, too, is constructed using an existing semantic parser, but a part of it is completely manually checked and corrected (i.e., gold standard). In contrast to the GMB, the PMB involves two major additions: (a) its semantics are refined by modelling tense and employing semantic tagging (Bjerva et al., 2016;, and (b) the non-logical symbols of the DRSs corresponding to concepts and semantic roles are grounded in WordNet (Fellbaum, 1998) and VerbNet (Bonial et al., 2011) respectively.
These above-mentioned additions make the DRSs of the PMB more fine-grained meaning representations. For this reason we choose the PMB (over the GMB) as our corpus for evaluating our semantic parser. Even though the sentences in the current release of the PMB are relatively short, they contain many hard semantic phenomena that a semantic parser has to deal with: pronoun resolution, quantifiers, scope of modals and negation, multi-word expressions, word senses, semantic roles, presupposition, tense, and discourse relations. As far as we know, we are the first that employs the PMB corpus for semantic parsing.

Formatting DRSs with Boxes and Clauses
The usual way to represent DRSs is the wellknown box-format. In order to facilitate reading a DRS with unresolved presuppositions, it can be depicted as a network of boxes, where a nonpresuppositional (i.e., main) DRS l : B is connected to the presuppositional DRSs A with arrows. Each box comes with a unique label and has two rows. In case of elementary DRSs these rows contain discourse referents in the top row and conditions in the bottom row ( Figure 1). A segmented DRS has a row with labelled DRSs and a row with discourse relations (Figure 2).
The DRS in Figure 1 consists of a main box b0 and two presuppositional boxes, b1 and b2. Note that b0 has no discourse referents but introduces negation via a single condition ¬b3 with a nested box b3. The conditions of b3 represent unary and binary relations over discourse referents that are introduced either by b3 or the presuppositional DRSs.
A clausal form is another way of formatting DRSs. It represents a DRS as a set of clauses (see Figure 1 and 2). This format is better suitable for machine learning than the box-format as it has a simple, flat structure and facilitates partial matching of DRSs which is useful for evaluation (van Noord et al., 2018). Conversion from the box-notation to the clausal form and vice versa 00/3008: He played the piano and she sang.  is transparent: discourse referents, conditions, and discourse relations in the clausal form are preceded by the label of the box they occur in. Notice that the variable letters in the semantic representations are automatically set and they simply serve for readability purposes. Throughout the experiments described in this paper, we employ clausal form DRSs.

Annotated Data
We use the English DRSs from release 2.1.0 of the PMB . 2 The release suggests to use the parts 00, 10, 20 and 30 as the development set, resulting in 3,998 train and 557 development instances. Basic statistics are shown in Table 1, while the number of occurrences of some of the semantic phenomena mentioned in Section 2.2 are given in Table 2.
Since this is a rather small training set, we tune our model using 10-fold cross-validation (CV) on the training set, instead of tuning on a separate development set. This means that we will use the suggested development set as a test set (and refer to it as such). When testing on this set, we train a model on all available training data.The employed PMB release also comes with "silver" data,  namely, 71,308 DRSs that are only partially manually corrected. In addition, we employ the DRSs from the silver data but without the manual corrections, which makes them "bronze" DRSs following the PMB terminology. Our experiments will initially use only the gold standard data, after which we will employ the silver or bronze data to further push the score of our best systems.

Clausal Form Checker
The clausal form of a DRS needs to satisfy a set of constraints in order to correspond to a semantically interpretable DRS, i.e., translatable into a first-order logic formula without free occurrences of a variable (Kamp and Reyle, 1993). For example, all discourse referents need to be explicitly introduced with a REF clause to avoid free occurrences of variables. We implemented a clausal form checker that validates the clausal form if and only if it represents a semantically interpretable DRS. Distinguishing box variables from entity variables is crucial for the validity checking, but automatically learned clausal forms are not expected to differen-tiate variable types. First, the checker separately parses each clause in the form to induce variable types based on the fixed set of comparison and DRS operators. After typing all the variables, the checker verifies whether the clauses collectively correspond to a DRS with well-formed semantics. For each box variable in a discourse relation, existence of the corresponding box inside the same segmented DRS is checked. For each entity variable in a condition, an introduction of the binder (i.e., accessible) discourse variable is found. The goal of these two steps is to prevent free occurrences of variables in DRSs. While binding the entity variables, necessary accessibility relations between the boxes are induced. In the end, the checker verifies the transitive closure of the induced accessibility relation on loops and checks existence of a unique main box of the DRS.
The checker is applied to every automatically obtained clausal form. If a clausal form fails the test, it is considered as ill-formed and will not have a single clause matched with the gold standard when calculating the F-score.

Evaluation
A DRS parser is evaluated by comparing its output DRS to a gold standard DRS using the Counter tool (van Noord et al., 2018). Counter calculates an F-score over matching clauses. Since variable names are meaningless, obtaining the matching clauses essentially is a search for the best variable mapping between two DRSs. Counter tries to find this mapping by performing a hill-climbing search with a predefined number of restarts to avoid getting stuck in a local optimum, which is similar to the evaluation system SMATCH  for AMR parsing. 4 Counter generalises over WordNet synsets, i.e., a system is not penalised for predicting a word sense that is in the same synset as the gold standard word sense.
To calculate whether there is a significant difference between two systems, we perform approximate randomization (Noreen, 1989) with α = 0.05, R = 1000 and F (model 1 ) > F (model 2 ) as test statistic for each individual DRS pair.

Neural Architecture
We employ a recurrent sequence-to-sequence neural network (henceforth seq2seq) with two bidirec-  Figure 3: The sequence-to-sequence model with word-representation input. SEP is used as a special character to separate clauses in the output.
tional LSTM layers and 300 nodes, implemented in OpenNMT (Klein et al., 2017). The network encodes a sequence representation of the natural language utterance, while the decoder produces the sequences of the meaning representation. We apply dropout (Srivastava et al., 2014) between both the recurrent encoding and decoding layers to prevent overfitting, and use general attention (Luong et al., 2015) to selectively give more weight to certain parts of the input sentence. An overview of the general framework of the seq2seq model is shown in Figure 3.
During decoding we perform beam search with length normalization, which in neural machine translation (NMT) is crucial to obtaining good results (Britz et al., 2017). We experimented with a wide range of parameter settings, of which the final settings can be found in Table 3.
We opted against trying to find the best parameter settings for each individual experiment (next to impossible in terms of computing time necessary as a single 10-fold CV experiment takes 12 hours on GPU), but selected parameter settings that showed good performance for both the initial character and word-level representations (see Section 4 for details). The parameter search was performed using 10-fold CV on the training set. Training is stopped when there is no more improvement in perplexity on the validation set, which in our case occurred after 13-15 epochs.
A powerful, well-known technique in the field of NMT is to use an ensemble of models during decoding Sennrich et al., 2016a). The resulting model averages over the predictions of the individual models, which can balance out some of the errors. In our experiments, we apply this method when decoding on the test set, but not for our experiments of 10-fold CV (this would take too much computation time).

Experiments with Data Representations
This section describes the experiments we conduct regarding the data representations of the input (English sentences) and output (a DRS) during training.

Between Characters and Words
We first try two (default) representations: character-level and word-level. Most semantic parsers use word-level representations for the input, but as a result are often dependent on pre-trained word embeddings or anonymization of the input 5 to obtain good results. Character-level models avoid this issue but might be at a higher risk of producing ill-formed output.
Character-based model In the character-level model, the input (an English sentence) is represented as a sequence of individual characters. The output (a DRS in clause format) is linearized, with special characters indicating spaces and clause separators. The semantic roles (e.g. Agent, Theme), DRS operators (e.g. REF, NOT, POS) and deictic constants (e.g. "now", "speaker", "hearer") are not represented as character sequences, but treated as compound characters, meaning that REF is not treated as a sequence of R, E and F, but directly as REF.
All proper names, WordNet senses, time/date expressions, and numerals are represented as character sequences.
Word-based model In the word-level model, the input is represented as a sequence of words, using spaces as a separator (i.e., the original words are kept). The output is the same as for the character-based model, except that the character sequences are represented as words. We use pre-trained GloVe embeddings (Pennington et al., 2014) 6 to initialise the encoder and decoder representations. In the DRS representation, there are semantic roles and DRS operators that might look like English words, but should not be interpreted as such (e.g. Agent, NOT). These entities are removed from the set of pre-trained embeddings, so that the model will learn them from scratch (starting from a random initialization).
Hybrid representations: BPE We do not necessarily have to restrict ourselves to using only characters or words as input representation. In NMT, byte-pair encoding (BPE, Sennrich et al. 2016b) is currently the de facto standard (Bojar et al., 2017). This is a frequency-based method that automatically finds a representation that is in between character and word-level. It starts out with the character-level format and then does a predefined number of merges of frequently cooccurring characters. Tuning this number of merges determines if the resulting representation is closer to character or word-level. We explore a large range of merges (1k-100k), while applying a corresponding set of pre-trained BPE embeddings (Heinzerling and Strube, 2018). However, none of the BPE experiments improved on the characterlevel or word-level score (F-scores between 57 and 68), only coming close when using a small number of merges (which is very close to characterlevel anyway). Therefore this technique was disregarded for further experiments.
Combined char and word There is also a fourth possible representation of the input: concatenating the character and word-level representations. This is uncommon in NMT due to the large size of the embedding space (hence their preference for BPE), but possible here since the PMB data contains relatively short sentences. We simply add the word embedding vector after the sequence of character-embeddings for each word in the input and still initialise these embeddings using the pre-trained GloVe embeddings.

Representation results
The results of the experiments (10-fold CV) for finding the best representation are shown in Table 4. Character representations are clearly better than word representations, though the word-level representation produces fewer ill-formed DRSs. Both representations are maintained for our further experiments. Although the combination of characters and words did lead to a small increase in performance over characters only (Table 4), this difference is not significant. Hence, this representation is discarded in further experiments described in this paper.

Model
Prec Rec F-score % ill

Tokenization
An interesting aspect of the PMB data is the way the input sentences are tokenized. In the data set, multi-word expressions are tokenized as single words, for example, "New York" is tokenized to "New∼York". Unfortunately, most off-the-shelf tokenizers (e.g. the Moses tokenizer) are not equipped to deal with this. We experiment with using Elephant (Evang et al., 2013), a tokenizer that can be (re-)trained on individual data sets, using the tokenized sentences of the published silver and gold PMB data set. 7 Simultaneously, we are interested in whether character-level models need tokenization at all, which would be a possible advantage of this type of representing the input text. Results of the experiment are shown in Table 5. None of the two tokenization methods yielded a significant advantage for the character-level models, so they will not be employed further. The word-level models, however, did benefit from tokenization, but Elephant did not give us an advantage over the Moses tokenizer. Therefore, for word-level models, we will use Moses in our next experiments. b1 REF x1 b1 male "n.02" x1 b1 Name x1 "tom" b2 REF t1 b2 EQU t1 "now" b2 time "n.08" t1 b0 NOT b3 b3 REF s1 b3 Time s1 t1 b3 Experiencer s1 x1 b3 afraid "a.01" s1 b3 Stimulus s1 x2 b3 REF x2 b3 entity "n.01" x2 (a) Standard naming $1 REF @1 $1 male "n.02" @1 $1 Name @1 "tom" $2 REF @2 $2 EQU @2 "now" $2 time "n.08" @2 $0 NOT $3 $3 REF @3 $3 Time @3 @2 $3 Experiencer @3 @1 $3 afraid "a.01" @3 $3 Stimulus @3 @4 $3 REF @4 $3 entity "n.01" @4 (b) Absolute naming  Figure 1. For (c), positive numbers refer to introductions that have yet to occur, while negative numbers refer to known introductions. A zero refers to the previous introduction for that variable type.

Representing Variables
So far we did not attempt to do anything special with the variables that occur in DRSs, as we simply tried to learn them as supplied in the PMB data set. Obviously, DRSs constitute a challenge for seq2seq models because of the high number of multiple occurrences of the same variables, in particular compared to AMR. AMR parsers do not deal well with this, since the reentrancy metric (Damonte et al., 2017) is among the lowest metrics for all AMR parsers that reported them or are publicly available (van Noord and Bos, 2017b). Moreover, for AMR, only 50% of the representations contain at least one reentrant node, and only 20% of the triples in AMR contain a reentrant node (van Noord and Bos, 2017a), but for DRSs these are both virtually 100%. While seq2seq AMR parsers could get away with ignoring variables during training and reinstating them in a post-processing step, for DRSs this is unfeasible.
However, since variable names are chosen arbitrarily, they will be hard for a seq2seq model to learn. We will therefore experiment with two methods of rewriting the variables to a more general representation, distinguishing between box variables and discourse variables. Our first method (absolute) traverses down the list of clauses, rewriting each new variable to a unique representation, taking the order into account. The second method (relative) is more sophisticated; it rewrites variables based on when they were introduced, inspired by De Bruijn index (de Bruijn, 1972). We view box variables as introduced when they are first mentioned, while we take the REF clause of a discourse referent as their introduction. The two rewriting methods are illustrated in Figure 4.
The results are shown in Table 5. For both characters and words, the relative rewriting method significantly outperforms the absolute method and the baseline, though the absolute method produces fewer ill-formed DRSs. Interestingly, the character-level model still obtains a higher F1score compared to the word-level model, even though it produces more ill-formed DRSs.

Casing
Casing is a writing device mostly used for punctuation purposes. On the one hand, it increases the set of characters (hence adding more redundant variation to the input). On the other hand, case can be a useful feature to recognise proper names as names of individuals are semantically analysed as presuppositions. Explicitly encoding uppercase with a feature could therefore prevent us from including a named-entity recogniser, often used in other semantic parsers. Although we do not expect dealing with case is a major challenge, we try out different techniques to find an optimal balance between abstracting over input characters and parsing performance. The results, in Table 5, show that the feature works well for the character-level model, but for the word-level model, it does not outperform lowercasing. These settings are used in further experiments.

Experiments with Additional Data
Since semantic annotation is a difficult and timeconsuming task, gold standard data sets are usually relatively small. This means that semantic parsers (and data-hungry neural methods in particular) can often benefit from more training data. Some examples in semantic parsing are data recombination (Jia and Liang, 2016), paraphrasing (Berant and Liang, 2014) or exploiting machinegenerated output (Konstas et al., 2017). However, before we do any experiments using extra training data, we want to be sure that we can still benefit from more gold training data. For both the character-level and word-level we plot the learning curve, adding 500 training instances at a time, in Figure 5. For both models the F-score clearly still improves when using more training instances, which shows that there is at least the potential for  additional data to improve the score. For DRSs, the PMB-2.1.0 release already contains a large set of silver standard data (71,308 instances), containing DRSs that are only partially manually corrected. We then train a model on both the gold and silver standard data, making no distinction between them during training. After training we take the last model and restart the training on only the gold data, in a similar process as described in Konstas et al. (2017) and van Noord and Bos (2017b). In general, restarting the training to fine-tune the weights of the model is a common technique in NMT (Denkowski and Neubig, 2017).
We are aware that there are many methods to obtain and employ additional data. However, our main aim is not to find the optimal method for DRS parsing, but to demonstrate that using additional data is indeed beneficial for neural DRS parsing. Since we are not further fine-tuning our model, we will show results on the test set in this section. Table 6 shows the results of adding the silver data. This results in a large increase in performance, for both the character and word-level models. We are still reliant on manually annotated data, however, since without the gold data (so training on only the silver data), we score even lower than our baseline model (68.4 and 68.1 for the char and word parser). Similarly, we are reliant on the fine-tuning procedure, as we also score below our baseline models without it (71.6 and 71.0 for the char and word parsers, respectively).
We believe there are two possible factors that could explain why the addition of silver data results in such a large improvement: (i) the fact that the data is silver instead of bronze or (ii) the fact that a different DRS parser (Boxer, see Section 6), 77.9 2.7 74.5 2.2 without ill-formed DRSs 78.6 1.6 74.9 0.9 Table 7: Test set results of the experiments that analyse the impact of the silver data. is used to create the silver data instead of our own parser.
We conduct an experiment to find out the impact on performance of silver vs bronze and Boxer vs our parser. The results are shown in Table 7. Note that these experiments are performed to analyse the impact of the silver data, not to further push the score, meaning Silver (Boxer-generated) is our final model that will be compared to other approaches in Section 6.
For (i), we compare the performance of the model trained on silver and bronze versions of the exact same documents (so leaving out the manual corrections). Interestingly, we score slightly higher for the character-level model with bronze than with silver (though the difference is not statistically significant), meaning that the extra manual corrections are not beneficial (in their current format). This suggests that the silver data is closer to bronze than to gold standard.
For (ii), we use our own best parser (without silver data) to parse the sentences in the PMB silver data release and use that as additional training data. 8 Since the silver data contains longer and more complicated sentences than the gold data, our best parser produces more ill-formed DRSs (13.7% for char and 15.6% for word). We can either discard those instances or still maintain them for the model to learn from. For Boxer this is not an issue since only 0.3% of DRSs produced were ill-formed. We observe that a full self-training pipeline results in lower performance compared to using Boxer-produced DRSs. In fact, this does not seem to be beneficial over only using the gold standard data. Most likely, since Boxer combines symbolic and statistical methods, it learns very different things than our neural parsers, which in turn provides more valuable information to the model. A more detailed analysis on the difference 8 Note that we cannot apply the manual corrections, so in PMB terminology, this data is bronze instead of silver.  Table 8: Test set results of our best neural models compared to two baseline models and two parsers.
in (semantic) output is performed in Section 6.2 and 6.3. Removing ill-formed DRSs before training leads to higher F-scores for both the char and word parser, as well as a lower number of illformed DRSs.

Comparison
In this section, we compare our best neural models (with and without silver data, see Table 6) to two baseline systems and to two DRS parsers: AMR2DRS and Boxer. AMR2DRS is a parser that obtains DRSs from AMRs by applying a set of rules (van Noord et al., 2018), in our case using AMRs produced by the AMR parser of van Noord and Bos (2017b). Boxer is an existing DRS parser using a statistical CCG parser for syntactic analysis and a compositional semantics based on λ-calculus, followed by pronoun and presupposition resolution (Curran et al., 2007;Bos, 2008b). SPAR is a baseline parser that outputs the same (fixed) default DRS for each input sentence. We implemented a second baseline model, SIM-SPAR, which outputs, for each sentence in the test set, the DRS of the most similar sentence in the training set. This similarity is calculated by taking the cosine similarity of the average word embedding vector (with removed stopwords) based on the Glove embeddings (Pennington et al., 2014). Table 8 show the result of the comparison. The neural models comfortably outperform the baselines. We see that both our neural models outperform Boxer by a large margin when using the Boxer labelled silver data. However, even without this dependence, the neural models perform significantly better than Boxer. It is worth noting that the character-level model significantly outper-  Table 9: F-scores of fine-grained evaluation on the test set of the three semantic parsers.
forms the word-level model, even though it cannot benefit from pre-trained word embeddings and from a tokenizer. Concurrently with our work, a neural DRS parser has been developed by Liu et al. (2018). They use a customised neural seq2seq model, which produces the DRS in three stages. It first predicts the general (deep) structure of the DRSs, after which the conditions and referents are filled in. Unfortunately, they train and evaluate their parser on annotated data from the GMB rather than from the PMB (see Section 2). This, combined with the fact that their work is contemporaneous to the current paper, makes it difficult to compare the approaches. However, we see no apparent reason why their method should not work on the PMB data.

Analysis
An intriguing question is what our models actually learn, and what parts of meaning are still challenging for neural methods. We do this in two ways, by performing an automatic analysis and by doing a manual inspection on a variety of semantic phenomena. Table 9 shows an overview of the different automatic evaluation metrics we implemented with corresponding scores of the three models.
The character-and word-level systems perform comparably in all categories except for VerbNet roles, where the character-based parser shows a clear advantage (1.6% absolute). The score for WordNet synsets is similar, but the word-level model has more difficulty predicting synsets that are introduced by verbs than for nouns. It is clear that the neural models outperform Boxer consistently on each of these metrics (partly because Boxer picks the first sense by default). What also stands out is the impact of the word senses: with a perfect word sense disambiguation module (oracle senses) large improvements can be gained for all three parsers. It is interesting to look at what errors the model makes in terms of producing ill-formed output. For both the neural parsers, only about 2% of the ill-formed DRSs are ill-formed because of a syntactic error in an individual clause (e.g. b1 Agent x1, where a fourth argument is missing), while all the other errors are due to a violated semantic constraint (see Section 3.2). In other words, the produced output is a syntactically wellformed DRS but is not interpretable.
To find out how sentence length affects performance, we plot in Figure 6 the mean F-score obtained by each parser on input sentences of different lengths, from 3 to 10 words. 9 We observe that all the parsers degrade with sentence length. To find out whether any of the parsers degrades significantly more than any other, we build a regression model, in which we predict the F-score using as predictors the parser (char, word and Boxer), the sentence length and the number of clauses produced. According to the regression model, (i) the performance of all the three systems decreases with sentence length, thus corroborating the trends shown in Figure 6 and (ii) the interaction between parser and sentence length is not significant, i.e., none of the parsers decreases significantly more than any other with sentence length. The fact that the performance of the neural parsers degrades with sentence length is not surprising, since they are based on the seq2seq architecture, and models built on this architecture for other tasks, such as machine translation, have been shown to have the same issue (Toral and Sánchez-Cartagena, 2017).

Manual Inspection
The automatic evaluation metrics provide overall scores but do not capture how the models perform on certain semantic phenomena present in the DRSs. Therefore, we manually inspected the test set output of the three parsers for the semantic phenomena listed in Table 2. Below we describe each phenomenon and explain how the parser output is evaluated on them.
The negation & modals phenomenon covers possibility (POS), necessity (NEC), and negation (NOT). The phenomenon is considered successfully captured if an automatically produced clausal form has the clause with the modal operator and the main concept is correctly put under the scope of the modal operator. For example, to capture the negation in Figure 1, the presence of b0 NOT b3 and b3 afraid "a.01" s1 is sufficient. Scope ambiguity counts nested pairs of scopal operators such as possibility (POS), necessity (NEC), negation (NOT), and implication (IMP). Pronoun resolution checks if an anaphoric pronoun and its antecedent are represented by the same discourse referent. Discourse relation & implication involves determining a discourse relation or an implication with a main concept in each of their scopes (i.e., boxes). For instance, to get the discourse relation in Figure 2 correctly, a clausal form needs to include b0 CONTINUATION b1 b5, b1 play "v.03" e1, and b5 sing "v.01" e2. Finally, the embedded clauses phenomenon verifies whether the main verb concept of an embedded clause is placed inside the propositional box (PRP). This phenomenon also covers control verbs: it checks if a controlled argument of a subordinate verb is correctly identified as an argument of a control verb.
The results of the semantic evaluation of the parsers on the test set is given in Table 10. The character-level parser performs better than the word-level parser on all the phenomena except one. Even though both our neural parsers clearly outperformed Boxer in terms of F-score, they perform worse than Boxer on the selected semantic phenomena. Although the differences are not big,  Boxer obtained the highest score for four out of five phenomena. This suggests that just the Fscore is perhaps not good enough as an evaluation metric, or that the final F-score should perhaps be weighted towards certain clauses. For example, it is arguably more important to capture a negation correctly than tense. Our current metric only gives a rough indication about the contents, but not about the inferential capabilities of the meaning representation.

Conclusions and Future Work
We implemented a general, end-to-end neural seq2seq model that is able to produce well-formed DRSs with high accuracy (RQ1). Character-level models can outperform word-level models, even though they are not dependent on tokenization and pre-trained word embeddings (RQ2). It is beneficial to rewrite DRS variables to a more general representation (RQ3). Obtaining and employing additional data can benefit performance as well, though it might be better to use an external parser instead of doing a full self-training pipeline (RQ4). F-score is only a rough measure for semantic accuracy: Boxer still outperformed our best neural models on a subset of specific semantic phenomena (RQ5). We think there are a lot of opportunities for future work. Since the sentences in the PMB data set are relatively short, it makes sense to investigate seq2seq models performing well for longer texts. There are a few promising directions here that could combat the degrading performance on longer sentences. First, the Transformer model (Vaswani et al., 2017) is an interesting candidate for exploration, a state-of-the-art neural model developed for MT that does not have worse performance for longer sentences. Second, a seq2seq model that is able to first predict the general structure of the DRS, after which it can fill in the de-tails, similar to Liu et al. (2018), is something that could be explored. A third possibility is a neural parser that tries to build the DRS incrementally, producing clauses for different parts of the sentence individually, and then combining them to a final DRS.
Concerning the evaluation of DRS-parsers, we feel there are a couple of issues that could be addressed in future work. One idea is to facilitate computing F-scores tailored to specific semantic phenomena that are dubbed important, so the evaluation we performed in this paper manually could be carried out automatically. Another idea is to evaluate the application of DRSs to improve performance on other linguistic or semantic tasks, in which DRSs that capture the full semantics will, presumably, have an advantage. A combination of glass-box and black-box evaluation seems a promising direction here (Bos, 2008a;van Noord et al., 2018).