AMR Similarity Metrics from Principles

Different metrics have been proposed to compare Abstract Meaning Representation (AMR) graphs. The canonical Smatch metric (Cai and Knight, 2013) aligns variables from one graph to another and compares the matching triples. The recently released SemBleu metric (Song and Gildea, 2019) is based on the machine-translation metric Bleu (Papineni et al., 2002), increasing computational efficiency by ablating a variable-alignment step and aiming at capturing more global graph properties. Our aims are threefold: i) we establish criteria that allow us to perform a principled comparison between metrics of symbolic meaning representations like AMR; ii) we undertake a thorough analysis of Smatch and SemBleu where we show that the latter exhibits some undesirable properties. E.g., it violates the identity of indiscernibles rule and introduces biases that are hard to control; iii) we propose a novel metric S2match that is more benevolent to only very slight meaning deviations and targets the fulfilment of all established criteria. We assess its suitability and show its advantages over Smatch and SemBleu.


Introduction
Proposed in 2013, the aim of Abstract Meaning Representation (AMR) is to represent a sentence's meaning in a machine-readable graph format (Banarescu et al., 2013). AMR graphs are rooted, acyclic, directed and edge-labeled. Entities, events, properties and states are represented as variables that are linked to corresponding concepts (encoded as leaf nodes) via is-instance relations (cf. Figure 1, left). This structure allows us to capture complex linguistic phenomena such as coreference, semantic roles or factuality.
When comparing two AMR graphs A and B, e.g, for the purpose of AMR parse quality evalua- tion, the metric of choice is usually SMATCH . Its backbone is an alignmentsearch between variables of the two graphs. Recently, the SEMBLEU metric (Song and Gildea, 2019) has been proposed that operates on the basis of a variable-free AMR graph (cf. Figure  1, right). 1 Circumventing a variable alignment search reduces computational cost and ensures full determinacy. Also, grounding the metric in BLEU (Papineni et al., 2002) has a certain appeal, since BLEU is quite popular in Machine Translation. However, we find that we are lacking a principled in-depth comparison of the properties of different AMR metrics which would help informing researchers to answer questions such as: Which metric should I use to assess the similarity of two AMR graphs, e.g., in AMR parser evaluation? What are the trade-offs of using one metric over the other? Besides providing such a principled comparison, we discuss a property that none of the existing AMR metrics currently satisfies: they do not measure graded meaning differences. Such differences may emerge due to nearsynonyms such as ruin -annihilate; skinny -thin -slim; enemy -foe (Inkpen and Hirst, 2006;Edmonds and Hirst, 2002) or paraphrases such as be able to -can; unclear -not clear. In a clas-sical syntactic parsing task, metrics do not need to address this issue since input tokens are typically projected to lexical concepts by lemmatization, hence two graphs for the same sentence tend not to disagree on the concepts projected from the input. This is different in semantic parsing where the projected concepts are often more abstract.
The paper is structured as follows: We first establish seven principles that one may expect a metrics for comparing meaning representations to have, in order to obtain meaningful and appropriate scores for the given purpose ( §2). Based on these principles we provide an in-depth analysis of the properties of SMATCH and SEMBLEU when used for comparing AMR graphs ( §3). We then develop S 2 MATCH, an extension of SMATCH that abstracts away from a purely symbolic level, allowing for a graded semantic comparison of atomic graph-elements ( §4). By this move, we enable SMATCH to take into account fine-grained meaning differences. We show that our proposed metric retains valuable benefits of SMATCH but at the same time is more benevolent to slight meaning deviations. We will make our code publicly available: https://github.com/ Heidelberg-NLP/amr-metric-suite.

From principles to AMR metrics
The problem of comparing AMR graphs A, B ∈ D with respect to their meaning occurs in several scenarios, for example, parser evaluation or interannotator agreement calculation (IAA). To measure the extent to which A and B agree in their meaning, we need a metric: D × D → R that returns a score that expresses meaning distance or meaning similarity (for convenience, we use similarity). Below we establish seven principles that seem desirable for this metric.
Principles The first four metric principles are mathematically motivated: I. continuity, non-negativity and upper-bound A similarity function should be continuous, with two natural edge cases: A, B are equivalent (maximum similarity) or unrelated (minimum similarity). This translates to the following constraint on metric : D × D → [0, 1].
II. identity of indiscernibles This focal principle is formalized by metric(A, B) = 1 ⇔ A = B. It is violated if a metric assigns a value indicating equivalence to inputs that are not equivalent or if it considers equivalent inputs as different.
III. symmetry In many cases, we want a metric to be symmetric: metric(A, B) = metric(B, A). A metric violates this principle if it assigns a pair of objects different scores when argument order is inverted. Together with principle I and II, it extends the metric beyond usage for parser evaluation because it also enables sound IAA calculation, clustering of AMR graphs and classification of AMR graphs when we use the metric as a kernel (e.g., SVM). In parser evaluation, one may dispense with any (strong) requirements for symmetry -however, the metric has then to be applied in a standardized way (with a fixed order of arguments). In all other situations, where reference and gold is not pre-defined, we may potentially handle the asymmetry by computing an aggregate value based on metric(A, B) and metric(B, A), for instance, the arithmetic mean. However, it is then unclear which aggregation is best suited and how to interpret the result (e.g., if metric(A, B)=0.9 and metric(B, A)=0.1, the mean does neither reflect the first nor the second judgment).
IV. determinacy Repeated calculation over the same inputs yields the same score. This principle is clearly desirable as it ensures reproducibility (a very small deviation may be tolerable).
The next three principles we believe to be desirable specifically when comparing meaning representation graphs such as AMR (Banarescu et al., 2013). The first two of the following principles are motivated by computer science and linguistics, whereas the last one is motivated from a linguistic and an engineering perspective.
V. no bias: Meaning representations are composed of objects of different types. Unless it is explicitly justified, a sound metric should not unjustifiably or in unintended ways favor correctness or penalize errors for substructures of certain types. In cases where a metric favors or penalizes certain substructures more than others, in the interest of transparency, this should be made clear and explicit, and should be easily verifiable and consistent. E.g., if we wished to give negation of the main predicate of a sentence a two times higher weight compared to negation in an embedded sentence, we would want this to be made clear and agreed on in the community or else need a mapping function to make the metric comparable to established metrics in the community.
We now turn to properties that focus on the nature of the objects we desire to compare: graph-based compositional meaning representations. These graphs consist of atomic conditions that determine the circumstances under which a sentence is true. Hence, our metric should increase with increasing overlap of A and B, which we denote by f (A, B), the number of matching conditions. This overlap can be viewed from a symbolic or/and a graded perspective (cf., e.g., Schenker et al. (2005) who denote these perspectives as 'syntactic' vs. 'semantic'). From the symbolic perspective, we compare the nodes and edges of two graphs on a symbolic level, while from the graded perspective, we take into account the degree to which nodes and edges differ. Both types of matching involve an important precondition for successfully computing the overlap f (A, B) of two graphs: if A and B contain variables, a function is needed that maps variables in A to variables in B in order to match conditions from A and B. 2 VI. matching (graph-based) meaning representations -symbolic match A natural symbolic overlap-objective can be found in the Jaccard index J: Let t(G) be the set of triples of graph G, . The greater this index is, the higher should be the metric score. An allowed exception to this monotonic relationship can occur if we want to take into account the graded semantic match of atomic graph elements or sub-structures, which we will elaborate on in the following.
VII. matching (graph-based) meaning representations -graded semantic match: One motivation for this principle can be found in engineering, e.g., when assessing the quality of produced parts. Here, small deviations from a reference may be tolerable within certain limits. Similarly, two AMR graphs may match almost perfectly -except for two small divergent components. The extent of divergence can be measured by the degree of similarity of the two divergent components. In our case, we need lin-2 E.g., consider graph A in Fig. 1 and its set of triples t(A): {instance(x1, drink), instance(x2, cat), arg0(x1, x2), arg1(x1, x3), instance(x3, water)}. When comparing A against graph B we need to judge whether a triple ti ∈ A is also contained in B: ti ∈ t(B). guistic knowledge to judge whether some divergence is tolerable and to what degree it is. For example, consider that graph A contains a triple x, is-instance, conceptA and graph B contains y, is-instance, conceptB while otherwise the graphs are equivalent and in the alignment we have set x = y. Then f (A, B) should be higher when conceptA is similar to conceptB compared with the case where conceptA is dissimilar to conceptB. In AMR, concepts are often abstract, so near-synonyms may even be fully admissible (enemy-foe). Therefore, we want the metric to reflect that matches of meaning representation graphs do not need to be binary, but can be graded. By defining metric to map to a range [0,1] we already defined it to be globally graded. We here further motivate that graded similarity can hold of minimal units of AMR graphs, such as atomic concepts or even larger units, such as sub-graphs. Finally, when adopting principle VII we can envisage modeling graded similarity of compositional meaning. E.g., injustice(x) could be represented alternatively as justice(x) ∧ polarity(x, −).

AMR metrics: SMATCH and SEMBLEU
With our seven principles for AMR similarity metrics in place, we now introduce SMATCH and SEMBLEU, two metrics that strongly differ in their design and assumptions. We describe each of them in detail and summarize their differences. This prepares an in-depth metric comparison ( §3).
Align and match -SMATCH The SMATCH metric operates in two steps. First, (i) we align the variables in A and B in the best possible way, by finding a mapping map : dom(A) → dom(B) that yields a maximal set of matching triples between A and B. E.g., if x i , rel, x j ∈ t(A) and map (x i ),rel, map (x j ) = y k , rel, y m ∈ t(B), we obtain one triple match. (ii) We compute Precision, Recall and F1 score based on the set of matching triples returned by the alignment search. The alignment search problem of step (i) is solved with a greedy hill-climber: Let f map (A, B) be the number of matching triples between A and B under any mapping function map. Then, (1) In practical implementation, multiple restarts with different random seeds increase the likelihood of finding better optima for this function.
Simplify and match -SEMBLEU The SEM-BLEU metric in Song and Gildea (2019) can also be described as a two-step procedure. But unlike SMATCH it operates on a variable-free reduction of an AMR graph G, which we denote by G vf (vf : variable-free) (Figure 1, right side). By skipping the alignment search of SMATCH, SEMBLEU is more time-efficient and avoids the indeterminacy incurred by the hill-climbing search in SMATCH.
In the first step, (i) SEMBLEU performs k-gram extraction from A vf and B vf in a breadth-first traversal (path extraction). Second, (ii) it adopts the BLEU score from machine translation (Papineni et al., 2002) to calculate an overlap score based on the extracted k-grams from A vf and B vf : where p k is BLEU's modified k-gram precision that measures k-gram overlap of a candidate string against a reference: and w k the (typically uniform) weight over chosen k-gram sizes. SEMBLEU uses NIST geometric probability smoothing (Chen and Cherry, 2014). The 'brevity penalty' BP returns a value smaller than 1 when the candidate length |A vf | is smaller than the reference length |B vf |.
The graph traversal performed in SEMBLEU starts at the root node. During this traversal it simplifies the graph by replacing variables with their corresponding concepts (see Figure 1: the node c becomes DRINK-01) and collects visited nodes and edges in uni-, bi-and tri-grams (k=3 is the recommended default). In SEM-BLEU, a source node together with a relation and its target node counts as a bi-gram. For the graph in Figure 1, the extracted unigrams are {cat, water, drink-01}; the extracted bi-grams are {drink-01 arg1 cat, drink-01 arg2 water}. SMATCH vs. SEMBLEU in a nutshell SEM-BLEU differs significantly from SMATCH. A key difference is that SEMBLEU operates on reduced variable-free AMR graphs (G vf ) -instead of full-fledged AMR graphs. By eliminating variables, SEMBLEU bypasses an alignment search. This makes the calculation faster and alleviates a weakness of SMATCH: the hill-climbing search is slightly imprecise. However, SEMBLEU is not guided by aligned variables as anchors. Instead, SEMBLEU uses an n-gram statistic (BLEU) to compute an overlap score for graphs, based on nhop paths extracted from G vf , using the root node as the start for the extraction process. SMATCH, by contrast, acts directly on variable-bound graphs matching triples based on a selected alignment. If wished so by an applicant, both metrics allow the capturing of more 'global' graph properties: SEMBLEU can increase its k-parameter and SMATCH may match conjunctions of (interconnected) triples. In the following analysis, however, we will adhere to their default configurations since this is how they are used in most applications.

Assessing AMR metrics with Principles
This section evaluates SMATCH and SEMBLEU against the seven principles we established above by asking: Why does a metric satisfy or violate a given principle? and What does this imply? We start with principles from mathematics.

II. Identity of indiscernibles
This principle is fundamental: An AMR metric must return maximum score if and only if the graphs are equivalent in meaning. However, there are cases where SEMBLEU, in contrast to SMATCH, violates this principle. Examples are given in Figure 2.
In both cases, SEMBLEU yields a perfect score for a pair of AMRs that differ in a single but fundamental aspect: two of its ARG x roles are filled with arguments that are meant to refer to distinct individuals that share the same concept: being a man or a woman. In Figure 2a, the graph on the left is an abstraction of, e.g. The man 1 sees the other man 2 in the other man 2 , while the graph on the right is an abstraction of The man 1 sees himself 1 in the other man 2 . Figure 2b shows another case with two AMRs representing different meanings: Some women helping each other (left) vs. one woman helping herself and another woman.
In both cases SEMBLEU cannot recognize the difference in meaning between a reflexive ((wo)man seeing/helping h*self ) and a non-reflexive relation ((wo)man seeing/helping another (wo)man). Hence, it assigns the respective AMRs maximum similarity score, whereas SMATCH reflects such meaning differences appropriately be-
SEMBLEU's failure to satisfy Principle II is a corollary of the fact that it operates on a variablefree AMR (G vf ). One could address this problem by reverting to canonical AMR graphs and adopting variable alignment in SEMBLEU. 3 But this would of course adversely affect the advertised efficiency advantages over SMATCH. Re-integrating the alignment step would make SEMBLEU less efficient than SMATCH since it would add the complexity of breadth-first traversal, yielding a total complexity of O(SMATCH) plus O(V + E).

III. Symmetry
This principle if fulfilled if a metric assigns the same similarity score when we compare A to B and B to A. Figure 3 shows that SEMBLEU can violate this principle to a significant extent: when comparing AMR graph A against B, SEMBLEU yields a score of more than 0.8, yet, when comparing B to A the score is less than 0.5. We perform an experiment that quantifies this effect on a larger scale by assessing the frequency and the extent of such diver- :ARG0-of (d9 / do-02 :loc (b / between :ARG1 t8 :op1 (w / we)) :ARG1 (t0 / thing :degree (s / so)) :ARG1-of (h2 / heat-01 :op2 (k / know-01 :degree (s1 / so) :polarity -:loc (b3 / between :ARG0 (i / i) :op1 (w4 / we)))))) :ARG1 (t2 / thing :ARG1 (t8 / thing) :ARG1-of (d / do-02)))) : gences. To this end, we parse 1386 development sentences from an AMR corpus (LDC2017T10) with an AMR parser (obtaining graph bank A) and evaluate it against another graph bank B (gold graphs or another parser-output). We quantify the symmetry violation by the symmetry violation ratio (Eq. 4) and the mean symmetry violation (Eq. 5) given some metric m: We perform the experiment with parses from three systems: CAMR , GPLA (Lyu and Titov, 2018) and JAMR (Flanigan et al., 2014) and the gold graphs, and compare the ratio and mean symmetry violation of SMATCH and SEMBLEU (cf . Table 1). Moreover, to provide a baseline that allows us to better put the results into perspective, we additionally estimate the symmetry violation of BLEU (SEMBLEU's MT ancestor) in an MT setting (Table 2). Specifically, we fetch 16 system outputs of the WMT 2018 ende metrics task (Ma et al., 2018) and calculate BLEU(A,B) and BLEU(B,A) based on the MT system's output and the reference sentences (using the same smoothing method as SEMBLEU). As worstcase/avg.-case, we use the outputs from the team where BLEU exhibits maximum/median msv. 4 Table 1 shows that more than 80% of the evaluated AMR graph pairs lead to a symmetry violation with SEMBLEU (as opposed to less than 10% for SMATCH). The average mean symmetry violation of SMATCH is considerably smaller compared   (right-hand side). The SEMBLEU-plots show that the effect is widespread, some cases are extreme, many others are less extreme but still considerable. This stands in contrast to SMATCH but also to its ancestor BLEU: in evaluation of MT systems, BLEU appears well calibrated and does not suffer from any major asymmetry (Figure 4, right-hand).
In sum, symmetry violations with SMATCH are significantly fewer and less pronounced than those observed with SEMBLEU. In theory, SMATCH is fully symmetric. Symmetry violations can occur due to alignment errors from the greedy variablealignment search. Common practice aims to reduce such effects by performing multiple restarts.
By contrast, the symmetry violation of SEM-BLEU is intrinsic to the method since the underlying overlap measure BLEU is inherently asymmetric. Our experiments show that this asymmetry is amplified in SEMBLEU compared to BLEU, and as we will show in detail in §3.1 below, this is due to the way in which k-grams are extracted from variable-free AMR graphs during graph traversal.  variables of A and B by means of greedy hillclimbing. However, (i) the stochastic nature of hill-climbing allows averaging multiple results.

IV. Determinacy
Together with the small set of AMR variables this implies that the deviation will be ≤ (a very small number close to 0). In Table 3 we measure in terms of standard deviation of results with respect to different runs, on a corpus level and on a graph-pair level. 5 As is evident from the table, the expected is already tiny when only one random start is performed (corpus level: = 0.0003, graph pair level: = 0.0013). Put differently, the hill-climbing in Smatch is highly unlikely to have any significant effects on the final score, even when only one random start is performed. Also, (ii) if we want to have = 0 guaranteed, the solution can be found with an Integer Linear Programming calculation .

Principles for meaning representations
We now turn to metric principles that are specifically aimed at comparing meaning representations.
V. No bias A metric for measuring similarity of meaning representations should not unjustifiably or unintentionally favor the correctness or penalize errors pertaining to any (sub-)structures of a given meaning representation. However, we find that SEMBLEU is affected by a bias that strongly affects leafs nodes dependent on high-degree nodes. The bias arises from two related factors: (i) when transforming G to G vf , SEMBLEU replaces variable nodes with concept nodes. In other words, nodes which were once leaf nodes in G (conceptnodes) can be raised to highly connected nodes in G vf (the former variable nodes in the standard AMR graph). (ii) breadth-first k-gram extraction from G vf starts from the root node. The traversal leads to the issue that concept leafs -now occupying the position of (former) variable nodes with a high number of outgoing (and incoming) edgeswill be visited and extracted much more frequently 5 As data we used the development gold standard of LDC2017T10 and automatic parses by GPLA. than others. The two factors in combination make SEMBLEU penalize a wrong concept node overly harshly when it is attached to a high-degree variable node (it is raised to high-degree when transforming G to G vf ). Conversely, SEMBLEU only weakly considers correct or wrongly assigned concepts attached to nodes with low degree. 6 Consider as an example Figure 5, which focuses on errors occurring in leaf concept nodes attached to the root as opposed to leaf concept nodes which remain leaf nodes after transforming G to G vf . SEMBLEU considers two graphs that express very different meanings (left and right) to be more similar than graphs that are almost equivalent in meaning (left, variant A vs. B). This is because the concept node that is attached to the root is raised to a highly connected node in G vf and thus is overfrequently contained in the extracted k-grams.
Quantifying bias We define the root-leaf bias ratio R between SEMBLEU scores of two limiting cases: (i) the concept of the root node is incorrect (SEMBLEU √ √ √ ) and (ii) the case where m leaf nodes are incorrect (SEMBLEU ).
Following Song and Gildea (2019) we assume the AMR graph to be in its simplified form G vf . For simplicity, we also assume that every non-leaf node in the AMR graph has the same out-degree d. Let h be the height of AMR G vf . We further assume that two graphs have the same structure, i.e., BP = 1. Using the NIST geometric probability smoothing as in Song and Gildea (2019), the ratio R takes the following form, with the number of correct kgrams and w their uniform weight: Plotting Eq. 6 in Figure 6 for a typical AMR graph with h = 2, choosing w = 1/3 (Song and Gildea, 2019) for various node degrees, we observe that 3 wrong leafs are equivalent to a wrong root when the node degree is 2. The bias gets higher with growing node degree. For example, for a graph with out-degree 8, the impact of 5 wrong leaf concepts is worth ca. 20% of the impact of a single wrong root concept. This is not an issue for SMATCH, where the ratio between m ≥ 1 wrong leafs and a wrong root is always above one. By now, we analyzed the bias towards high outdegree nodes. Yet, this bias can further aggravate, in the case for general high-degree nodes (inand out-degree) since they will be visited overfrequently in the graph traversal (Figure 7).

Eliminating biases
We have shown that SEM-BLEU can exhibit a large bias towards concepts of highly connected variable nodes. The main reason for this lies in the breadth-first traversal which implies that such concepts will be overly frequently contained in the extracted k-grams. A possible solution for this problem could be to weigh the extracted k-gram matches according to the degree of the contained nodes. However, this would imply that we assume some k-grams (and thus also some nodes and edges) to be of much higher importance than others -in other words, we would eliminate one bias by introducing another. Since the breadth-first traversal is the metric's backbone, this issue may be hard or impossible to address well. When BLEU is used for evaluating machine translation, there is no such bias because the kgrams in a sentence appear linearly.

VI. Graph matching: symbolic perspective
This principle requires that our metric's score should grow with increasing overlap of the conditions that are simultaneously contained in A and B. SMATCH fulfills this principle since it essentially does what is required: it matches two AMR graphs inexactly (Yan et al., 2016;Riesen et al., 2010) by aligning variable-nodes such that the (symbolic) triple matches are maximized. A slight imprecision is encountered only in cases where the alignment is non-optimal (we have quantified this imprecision in Table 3 and we have discussed means for minimizing this imprecision in §3). In fact, SMATCH can be seen as a general graph matching algorithm that works on any pair of graphs that contain (some) nodes that represent variables. It fulfills the Jaccard-index based graph overlap objective which symmetrically measures the amount of triples on which two graphs agree on when normalized by their respective sizes (this follows from the fact that Smatch F1 is monotonically related to the Jaccard index: F1 = 2J(1+J).) Since SEMBLEU neither obeys the identity of indiscernibles nor the symmetry principle, it is a corollary that it cannot comply with the Jaccardindex based overlap objective. 7 Generally, SEM-BLEU does not compare and match two AMR graphs per se, instead it matches the result of a graph-to-set-of-paths projection function ( §2.1). This function is surjective-only, which implies that the input may not be recoverable from the output. Thus, matching the outputs of this function cannot be equated to matching the inputs on a graph level.  ------------------------------------------------- This section focuses on principle VII, semantically graded graph matching, a principle that none of the existing AMR metrics considered so-far satisfies: they do not admit that a given pair of concepts can be judged more or less similar in meaning. Consider Figure 8 with three different graphs. Two of them (A, B) are almost equivalent in meaning and differ significantly from C. However, both SMATCH and SEMBLEU yield the same result in the sense that metric(A, B) = metric(A, C). Put differently, neither metric takes into account that a giraffe and a kitten are two quite different concepts, while cat and kitten are more similar. However, we would like this to be reflected by our metrics and obtain metric(A, B) > metric(A, C) in such a case. In the following, we will address this issue.
S 2 MATCH We propose a new metric S 2 MATCH that builds on SMATCH but differs from it in one important aspect: instead of maximizing the number of (hard) triple matches between two graphs during alignment search, we maximize the (soft) triple matches by taking into account the degree of semantic similarity of concepts. Recall that an AMR graph in its canonical form contains two types of triples: instance and relation triples. In the left graph of Figure 8, a, is-instance, cat constitutes an instance triple and c, ARG1, a a relation triple. In SMATCH, two triples can only be matched if they are identical. In S 2 MATCH, we allow these triples to soft-match, even though they may not be exactly identical. Therefore, soft matching has the potential to yield a different, and possibly, a better variable alignment. For example, in SMATCH we matched an instance triple a, i, x ∈ A as follows: where I(c) equals 1 if c is true and 0 otherwise (hard-match). S 2 MATCH relaxes this condition: where d is an arbitrary distance function d : which, when plugged into Eq. 9, results in the cosine similarity bounded by 0 and 1. We can also use a variant of the euclidean distance, y: d(x, y) = 1 − 1 1+ x−y 2 or y: d(x, y) = 1 − 1 e x−y 2 . In some cases, it may be suitable to set a threshold τ , to only consider the similarity between two concepts if it is above τ (e.g., τ = 0.5). In fact, S 2 MATCH is agnostic to a vector distance calculation. In the following pilot experiments, we use cosine (Eq. 10) with τ = 0.5 over 100 dimensional GloVe vectors (Pennington et al., 2014) to assess graded semantic similarity with S 2 MATCH. S 2 MATCH: A use-case We now want to investigate use cases in which S 2 MATCH may offer benefits. A classical use case is AMR parsing, where AMR metrics evaluate a parser's output against a gold standard. S 2 MATCH is designed to either yield the same score as SMATCH -or a slightly increased score when it aligns concepts that are symbolically distinct but semantically similar. An example is shown in Figure 9. Here, S 2 MATCH increases the score to 63 F1 (+10 F1 percentage points, pp.) by detecting a more adequate alignment that accounts for the graded similarity of the AMR concepts. We believe that this is appropriate since the two graphs are very similar and an F1 score of 53 is too low, doing the parser injustice.
ARG0 (d / develop-01 :mod ( u / usual :polarity -)) :ARG0 (d1 / develop-02 :mod ( u0 / unusual )) 0.60 14 pp. dissimilar Table 4: Examples where S 2 MATCH assigns a higher score, accounting for the similarity of aligned concepts . the degree of similarity of the two aligned concepts: Are the concepts dissimilar, similar or extremely similar? When concepts are judged dissimilar, we conclude that S 2 MATCH erroneously increased the score -when the concepts are (extremely) similar, we conclude that S 2 MATCH was justified in increasing the score. We calculated three agreement statistics that all show large consensus among our annotators: Cohen's kappa equals 0.79, Cohen's squared kappa: 0.87 and Pearson's ρ: 0.91. According to the annotations, S 2 MATCH's decision to increase the score is justified in most situations: in 56% and 12% of cases both annotators voted that the newly aligned concepts are extremely similar and similar, respectively: #agree in dissimilar /similar /extremely-similar: 25/12/56. Table 4 lists examples of good or ill-founded score increases. We observe, e.g., that S 2 MATCH accounts for the similarity of two concepts of different number: bacterium (gold) vs. bacteria (parser) (line 3). It also captures abbreviations (km -kilometer) and closely related concepts (farming -agriculture). SEMBLEU and SMATCH would penalize the corresponding triples in exactly the same way as predicting a truly dissimilar concept.
An interesting phenomenon is seen in line 7. Here, usual and unusual were correctly annotated as dissimilar, since they are opposite concepts. S 2 MATCH is equipped with GloVe embeddings to measure similarity of the aligned concepts and measures a cosine of 0.6, above the chosen threshold, which results in a large increase of the overall score (14 pp. F1; the increase is high since the graphs are small). It is well-known that synonyms and antonyms are difficult to distinguish with distributional word representations, since they often share similar contexts. One option is to choose word embeddings that better distinguish antonyms (Ono et al., 2015). However, the case at hand is orthogonal to this problem, in that the concept usual in the gold graph is modified with the polarity '−', whereas the predicted graph assigned the (non-negated) opposite concept unusual. Hence, considering the context in the gold graph, the predicted graph is semantically almost equivalent.
This points to an aspect of principle VII that is not yet covered by S 2 MATCH: it assesses graded similarity at the lexical, but not at the phrasal level, and hence cannot account for compositional phenomena. In future work, we aim at alleviating this issue by extending S 2 MATCH to measure semantic similarity for larger contexts, covering compositional phenomena and paraphrases, in order to fully satisfy all seven principles. 9 Quantitative study: metrics vs. human raters This study investigates to what extent the judgments of the three metrics under discussion resemble human judgements, based on the following assumption: the more a human rates two sentences to be semantically related in their meaning (maximum: equivalence), the higher the met- ric should rate the corresponding (Abstract) Meaning Representation parses. The ground truth for this task we get from the SICK dataset (Marelli et al., 2014), which is built on paraphrase data and contains 9,840 sentence pairs. The sentence pairs are annotated for graded relatedness in meaning based on human judgments on a 5-point rating scale (they also carry a relation-type label, which is not the primary focus of this experiment). Additionally, we manually verified that the data is of high quality and that sentence relatedness in meaning expresses sentence similarity in meaning to an outmost degree (this is also supported from the viewpoint that semantic relatedness generalizes the concept of semantic similarity, c.f., Budanitsky and Hirst (2006)). 1011 We proceed as follows: we use a competitive parser 12 to parse the sentence tuples (s i , s i , r i ).
When considering Pearson's ρ correlation coefficients with respect to the human rater, we find that both SMATCH and S 2 MATCH yield higher correlation scores with human raters than SEMBLEU. 13 When considering the root-meansquared error (RMSE), we find that SEMBLEU exhibits a large difference with the scores assigned by the human. On the other hand, S 2 MATCH ex-10 The following is an example from the SICK data. Maximum meaning relatedness score: A man is cooking pancakes. vs. The man is cooking pancakes.. Minimum meaning relatedness score: Two girls are playing outdoors near a woman. vs. The elephant is being ridden by the man.
11 To further enhance the soundness of this experiment we discard pairs with a contradiction relation and retain the 8,416 pairs with a neutral and entailment relation. 12 GPLA (Lyu and Titov, 2018)   hibits a considerably smaller difference. 14 The reduced RMS error is also visible from the score density distributions plotted in Figure 10. From this figure we see that SEMBLEU drastically underrates a good proportion of parse pairs where the input sentences were assigned a high semantic relatedness by the human. This may well relate to the biases of different node types: root node vs. leaves. Overall S 2 MATCH appears to provide a better fit with to the 'camel-hump'-shaped scoredistribution of the human raters, with some regions where it is notably closer to the human reference than the otherwise similar SMATCH. We also see that neither SMATCH nor S 2 MATCH are perfectly aligned with human scores. This may in part be due to the fact that gradedness of meaning is not yet fully captured by S 2 MATCH, and shows that more research is required to extend its scope. Table 5 summarizes our analyses' integral results. Principle I is fulfilled by all metrics as they exhibit continuity, non-negativity and an upper bound. Principle II, however, is violated by SEM-BLEU since it may mistakingly judge two AMR graphs of different meaning to be equivalent. The reason for this is that it simplifies AMR graphs by removing variables, and thus cannot capture facets of coreference. A positive outcome of this simplification, however, is that SEMBLEU is fast to compute: this could make it the first choice in some recent AMR parsing approaches that are based on reinforcement learning (Naseem et al., 2019), where continuous and rapid feedback is required for a large number of predicted graphs. SEMBLEU marks a point by fully satisfying Principle IV, in that -due to its variable free representation -it yields fully deterministic results. SMATCH, by contrast, either needs to resort to a costly ILP solution or (in practice) applies an approximative hill-climbing algorithm for variable alignment that generates small divergences -with 3 or more restarts, however, the divergence is negligible.

Summary of our metric analyses
An important insight brought out by our analysis is that the simplification of AMR in SEM-BLEU influences Principle V negatively, since it introduces biases. This is caused by two (interacting) factors: (i) The extraction of k-grams is applied on the graph top to bottom and thus visits some nodes more frequently than others. (ii) It raises some (but not all) leaf nodes to highly connected nodes, and thus these nodes will be overly frequently contained in the extracted k-grams during breadth-first-graph-traversal. We have shown that these two factors in combination lead to large biases that researchers can now be aware of when using SEMBLEU (principle V, cf. §3.1). Its ancestor metric BLEU does not suffer from such bias issues since the latter extracts k-grams linearly from a sentence.
Given that SEMBLEU is built on BLEU, it is inherently asymmetric. However, we have shown that SEMBLEU differs from BLEU in that the asymmetry measured for BLEU in MT is strongly amplified by SEMBLEU in AMR, which is, presumably, due to the biases it incurs. The fact that SEMBLEU does not satisfy symmetry can arguably be tolerated in parser evaluation if outputs are compared against gold standard references in a standardized manner. However, it is difficult to apply an asymmetric metric to measure IAA or to compare parser outputs in tri-parsing, where no reference is available. If the asymmetry is amplified by a bias that arises from different factors, it may further increase the difficulty to judge the scores. Finally, considering that SEMBLEU does not match AMR graphs on the graph-level but matches extracted k-grams, it turns out that it cannot be categorized as a symbolic graph matching algorithm as defined in Principle VI.
Principle VI is fulfilled by SMATCH which does not perform any transformation on the AMR graphs. Instead it searches for an optimal variablealignment and then counts the matching graph triples. As a corollary from this, it also fulfills principles I, II, III and V. An -error has to be expected because of the non-optimal alignment search in its implementation. Yet we have estimated the expected -error in §3, finding that it can be reduced to a negligible amount when performing multiple random initializations. In sum, the fact that SMATCH fulfills six of seven principles backs up many experiments of prior literature that use SMATCH as the sole criterion for IAA calculation and parsing.
Our principles also let us detect a weakness from which all present AMR metrics suffer: they operate on a purely symbolic level and cannot assess graded meaning differences. Therefore, we designed S 2 MATCH as a first step towards accounting for such differences: it preserves the beneficial properties of SMATCH but it is less harsh to only slight lexical meaning deviations.

Related work
Meaning representation metrics The availability of sound metrics for graph-based meaning representations is important, as it affects the evaluation of semantic parsers and the computation of IAA statistics, e.g., when creating a new corpus. Meaning representations in general extend beyond AMR, they are designed to represent the meaning of text in a well-defined, interpretable form that is able to identify meaning differences and support inference. Recently, Bos (2016) has shown how AMR can be translated to FOL, a well-established formalism for meaning representation. Discourse Representation Theory (DRT, Kamp (1981); Kamp and Reyle (1993)) is based on and extends FOL, with special focus on the representation of discourse. A recent shared task on DRS parsing has applied the COUNTER metric (Abzianidze et al., 2019;Evang, 2019), which is an adaption of SMATCH to DRT. This proves the applicability of SMATCH in general. Thus, its extension S 2 MATCH can also prove beneficial for evaluating DRSs.
Other research in AMR metrics aims at making AMR evaluation fairer by normalizing graphs that express the same or similar meanings with different structures (Goodman, 2019). Furthermore, Anchieta et al. (2019) argue that one should not, e.g., insert an extra is-root node when comparing AMR graphs (as is done in SEMBLEU and SMATCH). Damonte et al. (2017) extend the SMATCH metric to various sub-tasks (coreference, WSD, polarity detection, etc.). Cai and Lam (2019) provide two extensions of the SMATCH metric: SMATCH-weighted, that takes into account the relative distance of the triple to the root and SMATCH-core, that only considers the regions of the parse which are close to the root. Our metric S 2 MATCH allows for easy integration of all these approaches.
Computational AMR tasks Since the introduction of AMR in 2013, several computational AMR tasks have emerged. Most prominent is AMR parsing (Wang et al., 2015Damonte et al., 2017;Konstas et al., 2017;Lyu and Titov, 2018;Zhang et al., 2019). The inverse task aims to generate text from AMR graphs (Song et al., 2017(Song et al., , 2018Damonte and Cohen, 2019). Opitz and Frank (2019) rate the quality of machinegenerated AMR graphs without costly gold data.

Conclusion
We motivated seven principles for metrics that compare graph-based (Abstract) Meaning representations, from mathematical, linguistic and engineering perspectives. The fulfilment of all principles lends a metric the capacity to be successfully applied in a wide spectrum of use-cases, ranging from the evaluation of parsers to sound IAA calculation. Therefore, (i) our principles can inform (A)MR reaserchers who desire to compare and select among metrics, or (ii) our principles may ease and guide the development of new metrics.
We provided examples for both scenarios. First, in order to showcase (i), we utilized our principles as guidelines for an in-depth analysis of two AMR metrics: the canonical SMATCH and the recent SEMBLEU metric, two fundamentally different approaches. In our analysis, we uncovered that the latter metric does not satisfy some of these principles, which bears the potential to reduce its safety and applicability. Second, for addressing (ii), we aimed at the fulfilment of all seven principles and proposed S 2 MATCH, a metric that accounts for graded similarity of concepts that are realized as atomic graph components. In future work, we would like to move beyond this and build a metric that accounts for graded compositional similarity of graph substructures of any size.