Abstract
Several metrics have been proposed for assessing the similarity of (abstract) meaning representations (AMRs), but little is known about how they relate to human similarity ratings. Moreover, the current metrics have complementary strengths and weaknesses: Some emphasize speed, while others make the alignment of graph structures explicit, at the price of a costly alignment step.
In this work we propose new Weisfeiler-Leman AMR similarity metrics that unify the strengths of previous metrics, while mitigating their weaknesses. Specifically, our new metrics are able to match contextualized substructures and induce n:m alignments between their nodes. Furthermore, we introduce a Benchmark for AMR Metrics based on Overt Objectives (Bamboo), the first benchmark to support empirical assessment of graph-based MR similarity metrics. Bamboo maximizes the interpretability of results by defining multiple overt objectives that range from sentence similarity objectives to stress tests that probe a metric’s robustness against meaning-altering and meaning- preserving graph transformations. We show the benefits of Bamboo by profiling previous metrics and our own metrics. Results indicate that our novel metrics may serve as a strong baseline for future work.
1 Introduction
Meaning representations aim at capturing the meaning of text in an explicit graph format. A prominent framework is abstract meaning representation (AMR), proposed by Banarescu et al. (2013). AMR views sentences as rooted, directed, acyclic, labeled graphs. Their nodes are variables, attributes, or (open-class) concepts and are connected with edges that express semantic relations.
There are many use cases in which we need to compare or relate two AMR graphs. A common situation is found in parser evaluation, where AMR metrics are widely applied (May, 2016; May and Priyadarshi, 2017).1 Yet, there are more situations where we need to measure similarity of meaning as expressed in AMR graphs. For example, Bonial et al. (2020) leverage AMR metrics in a semantic search engine for COVID-19 queries, Naseem et al. (2019) use metric feedback to reinforce AMR parsers, Opitz (2020) emulates metrics for referenceless AMR ranking and rating, and Opitz and Frank (2021) use AMR metrics for NLG evaluation.
So far, multiple AMR metrics (Cai and Knight, 2013; Cai and Lam, 2019; Song and Gildea, 2019; Anchiêta et al., 2019; Opitz et al., 2020) have been proposed to assess AMR similarity. However, due to a lack of an appropriate evaluation benchmark, we have no empirical evidence that could tell us more about their strengths and weaknesses or offer insight about which metrics may be preferable over others in specific use cases.
Additionally, we would like to move beyond the aforementioned metrics and develop new metrics that account for graded similarity of graph substructures, which is not an easy task. However, it is crucial when we need to compare AMR graphs in a deeper way. Consider Figure 1, which shows two AMRs that convey very similar meanings. All aforementioned metrics assign this pair a low similarity score, and—if alignment-based, as is Smatch (Cai and Knight, 2013)—find only subpar alignments.2 In this case, we want a metric that provides us with a high similarity score and, ideally, an explanatory alignment.
The structure of this paper is as follows. In §2 we discuss related work. In §3 we describe our first contribution: new AMR metrics that aim at unifying the strengths of previous metrics while mitigating their weaknesses. Specifically, our new metrics are capable of matching larger substructures and provide valuable n:m alignments in polynomial time. In §4 we introduce Bamboo, our second contribution: It is the first benchmark data set for AMR metrics and includes novel robustness objectives that probe the behavior of AMR metrics under meaning-preserving and meaning-altering transformations of the inputs (§5). In §6 we use Bamboo for a detailed, multi-faceted empirical study of previous and our proposed AMR metrics.
We release Bamboo and our new metrics.3
2 Related Work
The Classical AMR Metric and its Adaptions
The ‘canonical’ and widely applied AMR metric is Smatch (Semantic match) (Cai and Knight, 2013). It solves an NP-hard graph alignment problem approximately with a hill-climber and scores matching triples. Smatch has been adapted to S2match (Soft Semantic match), by Opitz et al. (2020) to account for graded similarity of concept nodes (e.g., cat—kitten), using word embeddings. Smatch has also been adapted by Cai and Lam (2019) in W(eighted)Smatch (WSmatch), which penalizes errors relative to their distance to the root. This is motivated by the hypothesis that “core semantics” tend to be located near a graph’s root.
BFS-based and Alignment-free AMR Metrics
Recently, two new AMR metrics have been proposed: Sema by Anchiêta et al. (2019) and SemBleu by Song and Gildea (2019). Common to both is a mechanism that traverses the graph. Both start from the root, and collect structures with a breadth-first traversal (BFS). Also, both ablate the variable alignment of (W)S(2)match and only consider their attached concepts, which increases computation speed. Apart from this, the metrics differ significantly: SemBleu extracts bags of k-hop paths (k≤3) from the AMR graphs and thereupon calculates BLEU (Papineni et al., 2002). Sema, on the other hand, is somewhat simpler and provides us with an F1 score that it achieves by comparing extracted triples.
From Measuring Structure Overlap to Measuring Meaning Similarity
Most AMR metrics have been designed for semantic parser evaluation, and therefore determine a score for structure overlap. While this is legitimate, with extended use cases for AMR metrics arising, there is increased awareness that structural matching of labeled nodes and edges of an AMR graph is not sufficient for assessing the meaning similarity expressed by two AMRs (Kapanipathi et al., 2021). This insufficiency has also been observed in cross- lingual AMR parsing evaluation (Blloshmi et al., 2020; Sheth et al., 2021; Uhrig et al., 2021), but is most prominent when attempting to compare the meaning of AMRs that represent different sentences (Opitz et al., 2020; Opitz and Frank, 2021). This work argues that in cases like Figure 1, the available metrics do not sufficiently reflect the similarity of the two AMRs and their underlying sentences.
How Do Humans Rate Similarity of Sentence Meaning?
STS (Baudiš et al., 2016a, 2016b; Cer et al., 2017) and SICK (Marelli et al., 2014) elicited human ratings of sentence similarity on a Likert scale. While STS annotates semantic similarity, SICK annotates semantic relatedness. These two aspects are highly related, but not the exact same (Budanitsky and Hirst, 2006; Kolb, 2009). Only the highest scores on the Likert scales of SICK and STS can be seen as reflecting the equivalence of meaning of two sentences. Other data sets contain binary annotations of paraphrases (Dolan and Brockett, 2005), that cover a wide spectrum of semantic phenomena.
Benchmarking Metrics
Metric benchmarking is an active topic in NLP research and led to the emergence of metric benchmarks in various areas, most prominently MT and NLG (Gardent et al., 2017; Zhu et al., 2018; Ma et al., 2019). These benchmarks are useful since they help to assess and select metrics and encourage their further development (Gehrmann et al., 2021). However, there is currently no established benchmark that defines a ground truth of graded semantic similarity between pairs of AMRs, and how to measure it in terms of their structural representations. Also, we do not have an established ground truth to assess what alternative AMR metrics such as (W—S2)match or SemBleu really measure, and how their scores correlate with human judgments of the semantic similarity of sentences represented by AMRs.
3 Grounding Novel AMR etrics in The Weisfleiler-Leman Graph Kernel
Previous AMR metrics have complementary strengths and weaknesses. Therefore, we aim to propose new AMR metrics that are able to mitigate these weaknesses, while unifying their strengths, aiming at the best of all worlds. We want:
- i)
an interpretable alignment (Smatch);
- ii)
a fast metric (Sema, SemBleu);
- iii)
matching larger substructures (SemBleu);
- iv)
and assessment of graded similarity of AMR subgraphs(extending S2match).
This section proposes to make use of the Weisfeiler-Leman graph kernel (WLK) (Weisfeiler and Leman, 1968; Shervashidze et al., 2011) to assess AMR similarity. The idea is that WLK provides us with SemBleu-like matches of larger sub-structures, while bypassing potential biases induced by the BFS-traversal (Opitz et al., 2020). We then describe the Wasserstein Weisfeiler Leman kernel (WWLK) (Togninalli et al., 2019) that is similar to WLK but provides (i) an alignment of atomic and non-atomic substructures (going beyond Smatch) and (ii) a graded match of substructures (going beyond S2match). Finally, we further adapt WWLK to WWLKΘ, a variant that we tailor to learn semantic edge parameters, to better assess AMR graphs.
3.1 Basic Weisfeiler-Leman Kernel (WLK)
The Weisfeiler-Leman kernel (WLK) method (Shervashidze et al., 2011) derives sub-graph features from two input graphs. WLK has shown its power in many tasks, ranging from protein classification to movie recommendation (Togninalli et al., 2019; Yanardag and Vishwanathan, 2015). However, so far, it has not been applied to (A)MR graphs. In the following, we describe the WLK method.
Specifically, we use the cosine similarity kernel and two iterations (K = 2), which implies that every node receives information from its neighbors and their immediate neighbors. For simplicity we will first treat edges as undirected, but later will experiment with various directionality parameterizations.
3.2 Wasserstein Weisfeiler-Leman (WWLK)
S2match differs from all other AMR metrics in that it accepts close concept synonyms for alignment (up to a similarity threshold). But it comes with a restriction and a downside: i) it cannot assess graded similarity of (non-atomic) AMR subgraphs, which is crucial for assessing partial meaning agreement between AMRs (as illustrated in Figure 1), and ii) the alignment is costly to compute.
We hence propose to adopt a variant of WLK, the Wasserstein-Weisfeiler Leman kernel (WWLK) (Togninalli et al., 2019), for the following two reasons: (i) WWLK can assess non-atomic subgraphs on a finer level, and (ii) it provides graph alignments that are faster to compute than any of the existing Smatch metrics: (W)S(2)match.
WWLK works in two steps: (1) Given its initial node embeddings, we use WL to project the graph into a latent space, in which the final node embeddings describe varying degrees of contextualization. (2) Given a pair of such (WL) embedded graphs, a transportation plan is found that describes the minimum cost of transforming one graph into the other. In the top graph of Figure 3, f indicates the first step, while Wasserstein distance indicates the second. Now, we describe the steps in closer detail.
Step 1: WL Graph Projection into Latent Space
Step 2: Computing the Wasserstein Distance Between two WL-embedded Graphs
3.3 From WWLK to WWLKθ with zeroth-order Optimization
Motivation: AMR Edge Labels Have Meaning
The WL-embedding mechanism of WWLK (Eq. 4) associates a weight w(u, v) ∈ℝ with each edge. For unlabeled graphs, w(u, v) is simply set to one. To distinguish between the discrete AMR edge labels, in WWLK we have used random weights. However, AMR edge labels encode complex relations between nodes, and simply choosing random weights may not be enough. In fact, we hypothesize that different edge labels may impact the meaning similarity of AMR graphs in different ways. Whereas a modifier relation in an AMR graph configuration may or may not have a significant influence on the overall AMR graph similarity, an edge representing negation is bound to have a significant influence on the similarity of different AMR graphs. Consider the example in Figure 3: In the top figure, we embed AMRs for The pretty warbler sings and The bird sings gently, which have similar meanings. In the bottom figure, the second AMR has been changed to express the meaning of The bird doesn’t sing, which clearly reduces the meaning similarity of the two AMRs. Hence, we hypothesize that learning edge parameters for different AMR relation types may help to better adjust the graph embeddings, such that the Wasserstein distance may increase or decrease, depending on the specific meaning of AMR relation labels, and thus to better capture global meaning differences between AMRs (as outlined in Figure 3: fθ).
Formally, to make the Wasserstein Weisfeiler- Leman kernel better account for edge-labeled AMR graphs, we learn a parameter set Θ that consists of parameters θedgeLabel, where edgeLabel indicates the semantic relation, i.e., edgeLabel ∈ L = {:arg0, :arg1, …, :polarity, …}. Hence, in Eq. 4, we can set w(u, v) = θlabel(u, v) and apply multiplication θlabel(u, v) ⋅ x(u)k. To facilitate the multiplication, we either may learn a matrix Θ ∈ℝ|L|×d or a parameter vector Θ ∈ R|L|. In this paper, we constrain ourselves to the latter setting, that is, our goal is to learn a parameter vector Θ ∈ R|L|.
Learning Edge Labels with Direct Feedback
To find suitable edge parameters Θ, we propose a zeroth order (gradient-free [Conn et al., 2009]) optimization setup, which has the advantage that we can explicitly teach our metric to better correlate with human ratings, optimizing the desired correlation objective without detours. In our case, we apply a simultaneous perturbation stochastic approximation (SPSA) procedure to estimate gradients (Spall, 1987, 1998; Wang, 2020).5
4 Bamboo: Creating the First Benchmark for AMR Similarity Metrics
We now describe the creation of Bamboo, which aims to provide the first benchmark that allows researchers to empirically (i) assess AMR metrics, (ii) compare AMR metrics, and possibly (iii) train AMR metrics.
Grounding AMR Similarity Metrics in Human Ratings of Semantic Sentence Similarity
Similarity Objectives
Accordingly, we select, as evaluation targets for AMR metrics, three notions of sentence similarity, which have previously been operationalized in terms of human-rated evaluation datasets: (i) the semantic textual similarity (STS) objective from Baudiš et al. (2016a, b); (ii) the sentence relatedness objective (SICK) from Marelli et al. (2014); (iii) the paraphrase detection objective (PARA) by Dolan and Brockett (2005).
Each of these three evaluation data sets can be seen as a set of pairs of sentences (si,si′) with an associated score humanScore(⋅) that provides the human sentence relation assessment score reflecting semantic similarity (STS), semantic relatedness (SICK) and whether sentences are paraphrastic (PARA). Hence, each of these data sets can be described as {(si,si′, humanScore. Both STS and SICK offer scores on Likert scales, ranging from equivalence (max) to unrelated (min), while PARA scores are binary, judging sentence pairs as being paraphrases (1), or not (0). We min-max normalize the Likert scale scores to the range [0,1] to facilitate standardized evaluation.
For Bamboo, we replace each pair (si,si′) with their AMR parses: (pi = parse(si), pi′ = parse(si′)), transforming the data into . This provides the main partition of the benchmarking data for Bamboo, henceforth denoted as Main.6 Statistics of Main are shown in Table 1). The sentences in PARA are longer compared to STS and SICK. The corresponding AMR graphs are, on average, much larger in number of nodes, but less complex with respect to the average density.7
. | data instances . | (s. length) . | graph statistics . | ||||
---|---|---|---|---|---|---|---|
# nodes . | density . | ||||||
source | train/dev/test | avg. | 50th | avg. | 50th | avg. | 50th |
STS | 5749/1500/1379 | 9.9 | 8 | 14.1 | 12 | 0.10 | 0.08 |
SICK | 4500/500/4927 | 9.6 | 9 | 10.7 | 10 | 0.11 | 0.1 |
PARA | 3576/500/1275 | 18.9 | 19 | 30.6 | 30 | 0.04 | 0.04 |
. | data instances . | (s. length) . | graph statistics . | ||||
---|---|---|---|---|---|---|---|
# nodes . | density . | ||||||
source | train/dev/test | avg. | 50th | avg. | 50th | avg. | 50th |
STS | 5749/1500/1379 | 9.9 | 8 | 14.1 | 12 | 0.10 | 0.08 |
SICK | 4500/500/4927 | 9.6 | 9 | 10.7 | 10 | 0.11 | 0.1 |
PARA | 3576/500/1275 | 18.9 | 19 | 30.6 | 30 | 0.04 | 0.04 |
AMR Construction
We choose a strong parser that achieves high scores in the range of human- human inter-annotator agreement estimates in AMR banking: The parser yields 0.80–0.83 Smatch F1 on AMR2 and AMR3. The parser, henceforth denoted as T5S2S, is based on an AMR fine-tuned T5 language model (Raffel et al., 2019) and produces AMRs in a sequence-to-sequence fashion.8 It is on par with the current state-of-the-art that similarly relies on seq-to-seq (Xu et al., 2020), but the T5 backbone alleviates the need for massive MT pre-training. To obtain a better picture of the graph quality we perform manual quality inspections.
Manual Data Quality Assessment: Three-way Graph Quality Ratings
From each data set (SICK, STS, PARA) we randomly select 100 sentences and create their parses with T5S2S. Additionally, to establish a baseline, we also parse the same sentences with the GPLA parser of Lyu and Titov (2018), a neural graph prediction system that uses latent alignments (which reports 74.4 Smatch score on AMR2). This results in 300 GPLA parses and 300 T5S2S parses. A human annotator9 inspects the (shuffled) sample and assigns three-way labels: flawed—an AMR contains critical errors that distort the meaning significantly; silver—an AMR contains small errors that can potentially be neglected; gold—an AMR is acceptable.
Results in Table 2 show that the quality of T5S2S parses is substantially better than the baseline in all three data sets. The percentage of excellent parses increases considerably (STS: +11pp, SICK: +10pp, PARA: +11pp) while the percentage of flawed parses drops notably (STS: −15pp, SICK: −8pp, PARA: −23pp). The increases in gold parses and decreases in flawed parses are significant in all data sets (p < 0.05, 10,000 bootstrap samples of the sample means).10
. | Parser . | %gold↑ . | %silver . | %flawed↓ . |
---|---|---|---|---|
STS | GPLA | 43[33,53] | 37[28,46] | 20[12,27] |
T5S2S | 54[44,64]†‡ | 41[31,50] | 5[0,9]†‡ | |
SICK | GPLA | 38[28,47] | 49[39,59] | 13[6,19] |
T5S2S | 48[38,58]† | 47[37,57] | 5[0,9]†‡ | |
PARA | GPLA | 9[3,14] | 52[43,62] | 39[29,48] |
T5S2S | 21[13,29]†‡ | 63[54, 73]†‡ | 16[8,23]†‡ | |
ALL | GPLA | 30[25,35] | 46[40,52] | 24[19,29] |
T5S2S | 41[35,46]†‡ | 50[45,56] | 9[5,12]†‡ |
. | Parser . | %gold↑ . | %silver . | %flawed↓ . |
---|---|---|---|---|
STS | GPLA | 43[33,53] | 37[28,46] | 20[12,27] |
T5S2S | 54[44,64]†‡ | 41[31,50] | 5[0,9]†‡ | |
SICK | GPLA | 38[28,47] | 49[39,59] | 13[6,19] |
T5S2S | 48[38,58]† | 47[37,57] | 5[0,9]†‡ | |
PARA | GPLA | 9[3,14] | 52[43,62] | 39[29,48] |
T5S2S | 21[13,29]†‡ | 63[54, 73]†‡ | 16[8,23]†‡ | |
ALL | GPLA | 30[25,35] | 46[40,52] | 24[19,29] |
T5S2S | 41[35,46]†‡ | 50[45,56] | 9[5,12]†‡ |
5 Bamboo: Robustness Challenges
Besides benchmarking AMR metric scores against human ratings, we are also interested in assessing a metric’s robustness under meaning-preserving and -altering graph transformations. Assume we are given any pair of AMRs from paraphrases. A small change in structure or node content can lead to two outcomes: The graphs still represent paraphrases, or they do not. We consider a metric to be robust if its ratings correctly reflect such changes.
Specifically, we propose three transformation strategies. (i) Reification (Reify↯), which changes the graph’s surface structure, but not its meaning; (ii) Concept synonym replacement (Syno↯), which also preserves meaning and may or may not change the graph surface structure; (iii) Role confusion (Arg↯), which applies small changes to the graph structure that do not preserve its meaning.
5.1 Meaning-preserving Transforms
Reification Transform (Reify↯)
. | STS . | SICK . | PARA . | |||
---|---|---|---|---|---|---|
mean | th | mean | th | mean | th | |
Reify↯-OPS | 2.74 | [1, 2, 4] | 1.17 | [0, 1, 2] | 5.14 | [3, 5, 7] |
Syno↯-OPS | 0.80 | [0, 1, 2] | 1.31 | [0, 1, 2] | 1.30 | [0, 1, 2] |
Arg↯-OPS | 1.33 | [1, 1, 2] | 1.11 | [1, 1, 1] | 1.80 | [1, 2, 2] |
. | STS . | SICK . | PARA . | |||
---|---|---|---|---|---|---|
mean | th | mean | th | mean | th | |
Reify↯-OPS | 2.74 | [1, 2, 4] | 1.17 | [0, 1, 2] | 5.14 | [3, 5, 7] |
Syno↯-OPS | 0.80 | [0, 1, 2] | 1.31 | [0, 1, 2] | 1.30 | [0, 1, 2] |
Arg↯-OPS | 1.33 | [1, 1, 2] | 1.11 | [1, 1, 1] | 1.80 | [1, 2, 2] |
Synonym Concept Node Transform (Syno↯)
Here, we iterate over AMR concept nodes. For any node that involves a predicate from PropBank, we consult a manually created database of (near-) synonyms that are also contained in PropBank, and sample one for replacement. For example, some sense of fall is near-equivalent to a sense of decrease (car prices fell/decreased). For concepts that are not predicates we run an ensemble of four WSD solvers12 (based on the concept and the sentence underlying the AMR) to identify its WordNet synset. From this synset we sample an alternative lemma.13 If an alternative lemma consists of multiple tokens where modifiers precede the noun, we replace the node with a graph- substructure. So, if the concept is man and we sample adult_male, we expand ‘instance(x, man)’ with ‘mod(x, y) ∧ instance(y, adult) ∧ instance =(x, male)’. Data processing statistics are shown in Table 3.
5.2 Meaning-altering Graph Transforms
Role Confusion (Arg↯)
A naïve AMR metric could be one that treats an AMR as a bag-of-nodes, omitting structural information, such as edges and edge-labels. Such metrics could exhibit misleadingly high correlation scores with human ratings, solely due to a high overlap in concept content.
In some cases g may create a meaning that still makes sense (The tiger bites the snake.The snake bites the tiger.), while in others, g may induce a non-sensical meaning (The tiger jumps on the rock.The rock jumps on the tiger.). However, this is not our primary concern, since in all cases, applying g achieves our main goal: It returns a different meaning that turns a paraphrase-relation between two AMRs into a non-paraphrastic one.
To implement Arg↯, for each data set (PARA, STS, SICK) we create one new data subset. First, (i) we collect all paraphrases from the initial data (in SICK and STS these are pairs with maximum human score).14 (ii) We iterate over the AMR pairs and randomly select the first or second AMR from the tuple. We then collect all n nodes with more than one outgoing edge. If n = 0, we skip this AMR pair (the pair will not be contained in the data). If n > 0, we apply the meaning altering function g and randomly flip edge labels. Finally, we add the original to our data with the label paraphrase, and the altered pair with the label non-paraphrase (cf. Figure 5). Per the graph, we allow a maximum of 3 role confusion operations (see Table 3 for processing statistics).
5.3 Discussion
Safety of Robustness Objectives
We have proposed three challenging robustness objectives. Reify↯ changes the graph structure, but preserves the meaning. Arg↯ keeps the graph structure (modulo edge labels) while changing the meaning. Syno↯ changes node labels and possibly the graph structure and aims at preserving the meaning.
Reify↯ and Arg↯ are fully safe: they are well defined and are guaranteed to fulfill our goal (Eq. 12 and 17): meaning-preserving or -altering graph transforms. Syno↯ is more experimental and has (at least) three failure modes. In the first mode, depending on context, a human similarity judgments could change when near-synonyms are chosen (sleep doze, a young cat kitten, etc.). The second mode occurs when WSD commits an error (e.g., minister (political sense) priest). A third mode are societal biases found in WordNet (e.g., the node girl may be mapped onto its ‘synonym’ missy). The third mode may not really be a failure, since it may not change the human rating, but, nevertheless, it may be undesirable.
In conclusion, Reify↯ and Arg↯ confusion constitute safe robustness challenges, while results on Syno↯ have to be taken with a grain of salt.
Status of the Challenges in Bamboo and Outlook
We believe that a key benefit of the robustness challenges lies in their potential to provide complementary performance indicators, in addition to evaluation on the Main partition of Bamboo (cf. §4). In particular, the challenges may serve to assess metrics more deeply, uncover potential weak spots, and help select among metrics, for example, when performance differences on Main are small. In this work, however, the complementary nature of Reify↯, Syno↯ or Arg↯ versus Main is only reflected in the name of the partitions, and in our experiments, we consider all partitions equally. Future work may deviate from this setup.
Our proposed robustness challenges are also by no means exhaustive, and we believe that there is ample room for developing more challenges (extendingBamboo) or experimenting with different setups of our challenges (varyingBamboo15 ). For these reasons, it is possible that future work may justify alternative or enhanced setups, extensions and variations of Bamboo.
6 Experimental Insights
Questions Posed to Bamboo
Bamboo allows us to address several open questions: The first set of questions aims to gain more knowledge about previously released metrics. That is, we would like to know: What semantic aspects of AMR does a metric measure? If a metric has hyper-parameters (e.g., SemBleu), which hyper-parameters are suitable (for a specific objective)? Does the costly alignment of Smatch pay off, by yielding better predictions, or do the faster alignment-free metrics offer a ‘free-lunch’? A second set of questions aims to evaluate our proposed novel AMR similarity metrics, and to assess their potential advantages.
Experimental Setup
We evaluate all metrics on the test set of Bamboo. The two hyper-parameters of S2match, that determine when concepts are similar, are set with a small search on the development set (by contrast, S2matchdefault denotes the default setup). WWLKθ is trained with batch size 16 on the training data. S2match, WWLK and WWLKθ all make use of GloVe embeddings (Pennington et al., 2014).
Our main evaluation metric is Pearson’s ρ between a metric’s output and the human ratings. Additionally, we consider two global performance measures to better rank AMR metrics: the arithmetic mean (amean) and the harmonic mean (hmean) over a metric’s results achieved in all tasks. Hmean is always ≤ amean and is driven by low outliers. Hence, a large difference between amean and hmean serves as a warning light for a metric that is extremely vulnerable in a specific task.
6.1 Bamboo Studies Previous Metrics
Table 4 shows AMR metric results on Bamboo across all three human similarity rating types (STS, SICK, PARA) and our four challenges: Main represents the standard setup (cf. §4), whereas Reify↯, Syno↯, and Arg↯ test the metric robustness (cf. §5).
. | speed . | align . | Main . | Reify↯ . | Syno↯ . | Arg↯ . | amean . | hmean . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
STS . | SICK . | PARA . | STS . | SICK . | PARA . | STS . | SICK . | PARA . | STS . | SICK . | PARA . | – . | – . | |||
Smatch | – | ✓ | 58.45 | 59.72 | 41.25 | 57.98 | 61.81 | 39.66 | 56.14 | 57.39 | 39.58 | 48.05 | 70.53 | 24.75 | 51.28 | 47.50 |
WSmatch | – | ✓ | 53.06 | 59.24 | 38.64 | 53.39 | 61.17 | 37.49 | 51.41 | 57.56 | 37.85 | 42.47 | 66.79 | 22.68 | 48.48 | 44.58 |
S2matchdefault | – | ✓ | 56.38 | 58.15 | 42.16 | 55.65 | 60.04 | 40.41 | 56.05 | 57.17 | 40.92 | 46.51 | 70.90 | 26.58 | 50.91 | 47.80 |
S2match | – | ✓ | 58.82 | 60.42 | 42.55 | 58.08 | 62.25 | 40.60 | 56.70 | 57.92 | 41.22 | 48.79 | 71.41 | 27.83 | 52.22 | 49.07 |
Sema | ++ | ✗ | 55.90 | 53.32 | 33.43 | 55.51 | 56.16 | 32.33 | 50.16 | 48.87 | 29.11 | 49.73 | 68.18 | 22.79 | 46.29 | 41.85 |
SemBleuk =1 | ++ | ✗ | 66.03 | 62.88 | 39.72 | 61.76 | 62.10 | 38.17 | 61.83 | 58.83 | 37.10 | 1.99 | 1.47 | 1.40 | 41.11 | 5.78 |
SemBleuk =2 | ++ | ✗ | 60.62 | 59.86 | 36.88 | 57.68 | 59.64 | 36.24 | 57.34 | 56.18 | 33.26 | 44.54 | 67.54 | 16.60 | 48.87 | 42.13 |
SemBleuk =3 | ++ | ✗ | 56.49 | 57.76 | 32.47 | 54.84 | 57.70 | 33.25 | 52.82 | 53.47 | 28.44 | 49.06 | 69.49 | 24.27 | 47.50 | 42.82 |
SemBleuk =4 | ++ | ✗ | 53.19 | 56.69 | 29.61 | 52.28 | 56.12 | 30.11 | 49.31 | 52.11 | 25.56 | 49.75 | 69.58 | 29.44 | 46.15 | 41.75 |
WLK (ours) | ++ | ✗ | 64.86 | 61.52 | 37.35 | 62.69 | 62.55 | 36.49 | 59.41 | 56.60 | 33.71 | 45.89 | 64.70 | 19.47 | 50.44 | 44.35 |
WWLK (ours) | + | ✓ | 63.15 | 65.58 | 37.55 | 59.78 | 65.53 | 35.81 | 59.40 | 59.98 | 32.86 | 13.98 | 42.79 | 7.16 | 45.30 | 28.83 |
WWLKΘ (ours) | + | ✓ | 66.94 | 67.64 | 37.91 | 64.34 | 65.49 | 39.23 | 60.11 | 62.29 | 35.15 | 55.03 | 75.06 | 29.64 | 54.90 | 50.26 |
. | speed . | align . | Main . | Reify↯ . | Syno↯ . | Arg↯ . | amean . | hmean . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
STS . | SICK . | PARA . | STS . | SICK . | PARA . | STS . | SICK . | PARA . | STS . | SICK . | PARA . | – . | – . | |||
Smatch | – | ✓ | 58.45 | 59.72 | 41.25 | 57.98 | 61.81 | 39.66 | 56.14 | 57.39 | 39.58 | 48.05 | 70.53 | 24.75 | 51.28 | 47.50 |
WSmatch | – | ✓ | 53.06 | 59.24 | 38.64 | 53.39 | 61.17 | 37.49 | 51.41 | 57.56 | 37.85 | 42.47 | 66.79 | 22.68 | 48.48 | 44.58 |
S2matchdefault | – | ✓ | 56.38 | 58.15 | 42.16 | 55.65 | 60.04 | 40.41 | 56.05 | 57.17 | 40.92 | 46.51 | 70.90 | 26.58 | 50.91 | 47.80 |
S2match | – | ✓ | 58.82 | 60.42 | 42.55 | 58.08 | 62.25 | 40.60 | 56.70 | 57.92 | 41.22 | 48.79 | 71.41 | 27.83 | 52.22 | 49.07 |
Sema | ++ | ✗ | 55.90 | 53.32 | 33.43 | 55.51 | 56.16 | 32.33 | 50.16 | 48.87 | 29.11 | 49.73 | 68.18 | 22.79 | 46.29 | 41.85 |
SemBleuk =1 | ++ | ✗ | 66.03 | 62.88 | 39.72 | 61.76 | 62.10 | 38.17 | 61.83 | 58.83 | 37.10 | 1.99 | 1.47 | 1.40 | 41.11 | 5.78 |
SemBleuk =2 | ++ | ✗ | 60.62 | 59.86 | 36.88 | 57.68 | 59.64 | 36.24 | 57.34 | 56.18 | 33.26 | 44.54 | 67.54 | 16.60 | 48.87 | 42.13 |
SemBleuk =3 | ++ | ✗ | 56.49 | 57.76 | 32.47 | 54.84 | 57.70 | 33.25 | 52.82 | 53.47 | 28.44 | 49.06 | 69.49 | 24.27 | 47.50 | 42.82 |
SemBleuk =4 | ++ | ✗ | 53.19 | 56.69 | 29.61 | 52.28 | 56.12 | 30.11 | 49.31 | 52.11 | 25.56 | 49.75 | 69.58 | 29.44 | 46.15 | 41.75 |
WLK (ours) | ++ | ✗ | 64.86 | 61.52 | 37.35 | 62.69 | 62.55 | 36.49 | 59.41 | 56.60 | 33.71 | 45.89 | 64.70 | 19.47 | 50.44 | 44.35 |
WWLK (ours) | + | ✓ | 63.15 | 65.58 | 37.55 | 59.78 | 65.53 | 35.81 | 59.40 | 59.98 | 32.86 | 13.98 | 42.79 | 7.16 | 45.30 | 28.83 |
WWLKΘ (ours) | + | ✓ | 66.94 | 67.64 | 37.91 | 64.34 | 65.49 | 39.23 | 60.11 | 62.29 | 35.15 | 55.03 | 75.06 | 29.64 | 54.90 | 50.26 |
Smatch and S2match Rank 1st and 2nd of Previous Metrics
Smatch, our baseline metric, provides strong results across all tasks (Table 4, amean: 51.28). With default parameters, S2matchdefault performs slightly worse on the main data for STS and SICK, but improves upon Smatch on PARA, achieving a slight overall improvement with respect to hmean (+0.30), but not amean (−0.37). S2match is more robust against Syno↯ (e.g., +4.6 on Syno↯ STS vs. Smatch), and when confronted with reified graphs (Reify↯ STS +3.3 vs. Smatch).
Finally, S2match, after setting its two hyper- parameters with a small search on the development data16 , consistently improves upon Smatch over all tasks (amean: +0.94, hmean: +1.57).
WSmatch: Are Nodes Near the Root More Important?
The hypothesis underlying WSmatch is that concepts that are located near the top of an AMR have a higher impact on AMR similarity ratings. Interestingly, WSmatch mostly falls short of Smatch, offering substantially lower performance on all main tasks and all robustness checks, resulting in reduced overall amean and hmean scores (e.g., main STS: −5.39 vs. Smatch, amean: −2.8 vs. Smatch, hmean: −2.9 vs. Smatch). This contradicts the ‘core-semantics’ hypothesis and provides novel evidence that semantic concepts that influence human similarity ratings are not necessarily located close to AMR roots.17
BFS-based Metrics I: Sema Increases Speed but Pays a Price
Next, we find that Sema achieves lower scores in almost all categories, when compared with Smatch (amean: −4.99, hmean −5.65), ending up at rank 7 (according to hmean and amean) among prior metrics. It is similar to Smatch in that it extracts triples from graphs, but differs by not providing an alignment. Therefore, it can only loosely model some phenomena, and we conclude that the increase in speed comes at the cost of a substantial drop in modeling capacity.
BFS-based Metrics II: SemBleu is Fast, but is Sensitive to k
Results for SemBleu show that it is very sensible to parameterizations of k. Notably, k = 1, which means that the method only extracts bags of nodes, achieves strong results on SICK and STS. On PARA, however, SemBleu is outperformed by S2match, for all settings of k (best k (k = 2): −2.8 amean, −4.7 hmean). Moreover, all variants of SemBleu are vulnerable to robustness checks. E.g., k = 2, and, naturally, k = 1 are easily fooled by Arg↯, where performance drops massively. k = 4, on the other hand, is most robust against Arg↯, but overall it falls behind k = 2.
Since SemBleu is asymmetric, we also re- compute the metric in a ‘symmetric’ way by averaging the metric result over different argument orders. We find that this can slightly increase its performance ([k, amean, hmean]: [1, +0.8, +0.6]; [2, +0.5, +0.4]; [3, +0.2, +0.2]; [4, +0.1, +0.0]).
In sum, our conclusions concerning SemBleu are: (i) SemBleuk =1 (but not SemBleuk =3) performs well when measuring similarity and relatedness. However, SemBleuk =1 is naïve and easily fooled (Arg↯). (ii) Hence, we recommend k = 2 as a good tradeoff between robustness and performance, with overall rank 4 (amean) and 6 (hmean).18
6.2 Bamboo Assesses Novel Metrics
We now discuss results of our proposed metrics based on the Weisfeiler-Leman Kernel.
Standard Weisfeiler-Leman (WLK) is Fast and a Strong Baseline for AMR Similarity
First, we visit the classic Weisfeiler-Leman kernel. Like SemBleu and Sema, the (alignment-free) method is very fast. However, it outperforms these metrics in almost all tasks (score difference against second best alignment-free metric: ([a∣h]mean: +1.6, +1.5) but falls behind alignment-based Smatch ([a∣h]mean: −0.8, −3.2). Specifically, WLK proves robust against Reify↯ but appears more vulnerable against Syno↯ (−5 points on STS and SICK) and Arg↯ (notably PARA, with −10 points).19
The better performance, compared to SemBleu and Sema, may be due to the fact that WLK (unlike SemBleu and Sema) does not perform BFS traversal from the root, which may reduce biases.
WWLK and WWLKθ Obtain First Ranks
Basic WWLK exhibits strong performance on SICK (ranking second on main and first on Reify↯). However, it has large vulnerabilities, as exposed by Arg↯, where only SemBleuk =1 ranks lower. This can be explained by the fact that WWLK (7.2 Pearson’s ρ on PARA Arg↯) only weakly considers the semantic relations (whereas SemBleuk =1 does not consider semantic relations in the first place).
WWLKΘ, our proposed algorithm for edge label learning, mitigates this vulnerability (29.6 Pearson’s ρ on PARA Arg↯, 1st rank). Learning edge labels also helps assessing similarity (STS) and relatedness (SICK), with substantial improvements over standard WWLK and Smatch (STS: 66.94, +3.9 vs. WWLK and +10.6 vs. Smatch; SICK +2.1 vs. WWLK and +8.4 vs. Smatch).
In sum, WWLKθ occupies rank 1 of all considered metrics (amean and hmean), outperforming all non-alignment based metrics by large margins (amean +4.5 vs. WLK and +6.0 vs. SemBleuk =2; hmean +5.9 vs. WLK and +8.1 vs. SemBleuk =2), but also the alignment-based ones, albeit by lower margins (amean +2.7 vs. S2match; hmean + 1.2 vs. S2match).
6.3 Analyzing Hyper-parameters of (W)WLK
Setting K in (W)WLK
How does setting the number of iterations in Weisfeiler-Leman affect predictions? Table 5 shows K = 2 is a good choice for all WLK variants. K = 3 slightly increases performance in the latent variants (WWLK: +0.4 amean; WWLKθ: +0.3 amean), but lowers performance for the fast symbolic matching WLK (−2.8 amean). This drop is somewhat expected: K > 2 introduces much sparsity in the symbolic WLK feature space.
. | K (#WL iters) . | |||||||
---|---|---|---|---|---|---|---|---|
basic (K=2) . | K=1 . | K=3 . | K=4 . | |||||
amean . | hmean . | amean . | hmean . | amean . | hmean . | amean . | hmean . | |
WLK | 50.4 | 44.4 | 49.8 | 44.2 | 47.6 | 42.4 | 46.4 | 41.5 |
WWLK | 45.3 | 28.8 | 43.4 | 15.3 | 45.7 | 31.4 | 42.3 | 24.0 |
WWLKθ | 54.9 | 50.3 | 52.2 | 35.4 | 55.2 | 51.1 | 50.8 | 47.3 |
. | K (#WL iters) . | |||||||
---|---|---|---|---|---|---|---|---|
basic (K=2) . | K=1 . | K=3 . | K=4 . | |||||
amean . | hmean . | amean . | hmean . | amean . | hmean . | amean . | hmean . | |
WLK | 50.4 | 44.4 | 49.8 | 44.2 | 47.6 | 42.4 | 46.4 | 41.5 |
WWLK | 45.3 | 28.8 | 43.4 | 15.3 | 45.7 | 31.4 | 42.3 | 24.0 |
WWLKθ | 54.9 | 50.3 | 52.2 | 35.4 | 55.2 | 51.1 | 50.8 | 47.3 |
WL Message Passing Direction
Even though AMR defines directional edges, for optimal similarity ratings, it was not a-priori clear in which directions the node contextualization should be restricted when attempting to model human similarity. Therefore, so far, our WLK variants have treated AMR graphs as undirected graphs (). In this experiment, we study three alternate scenarios: ‘TOP-DOWN’ (forward, ), where information is only passed in the direction that AMR edges point at and ‘BOTTOM-UP’ (backwards, ), where information is exclusively passed in the opposite direction, and 2WAY (), where information is passed forwards, but for every edge edge(x, y) we insert an edge−1(y, x). 2WAY facilitates more node interactions than either TOP- DOWN or BOTTOM-UP, while preserving directional information.
Our findings in Table 6 show a clear trend: treating AMR graphs as graphs with undirected edges offers better results than TOP–DOWN (e.g., WWLK −1.6 amean; −6.6 hmean) and considerably better results when compared to WLK in BOTTOM–UP mode (e.g., WWLK −3.7 amean; −18.9 hmean). Overall, 2WAY behaves similarly to the standard setup, with a slight improvement for WWLKθ. Notably, the symbolic WLK variant, that does not use word embeddings, appears more robust in this experiment and differences between the three directional setups are small.
. | undirected . | TOP-DOWN . | BOTTOM-UP . | 2WAYS . | ||||
---|---|---|---|---|---|---|---|---|
amean . | hmean . | amean . | hmean . | amean . | hmean . | amean . | hmean . | |
WLK | 50.4 | 44.4 | 50.3 | 44.3 | 50.2 | 43.8 | 49.5 | 41.8 |
WWLK | 45.3 | 28.8 | 43.7 | 22.0 | 41.6 | 9.9 | 44.8 | 24.1 |
WWLKθ | 54.9 | 50.3 | 53.8 | 46.1 | 50.2 | 18.7 | 55.3 | 51.0 |
. | undirected . | TOP-DOWN . | BOTTOM-UP . | 2WAYS . | ||||
---|---|---|---|---|---|---|---|---|
amean . | hmean . | amean . | hmean . | amean . | hmean . | amean . | hmean . | |
WLK | 50.4 | 44.4 | 50.3 | 44.3 | 50.2 | 43.8 | 49.5 | 41.8 |
WWLK | 45.3 | 28.8 | 43.7 | 22.0 | 41.6 | 9.9 | 44.8 | 24.1 |
WWLKθ | 54.9 | 50.3 | 53.8 | 46.1 | 50.2 | 18.7 | 55.3 | 51.0 |
6.4 Revisiting the Data Quality in Bamboo
Initial quality analyses (§4) suggested that the quality of Bamboo is high, with a large proportion of AMR graphs that are of gold or silver quality. In this experiment, we study how metric rankings and predictions could change when confronted with AMRs corrected by humans. From every data set, we randomly sample 50 AMR graph pairs (300 AMRs in total). In each AMR, the human annotator searched for mistakes, and corrected them.20
We study two settings. (i) Intra metric agreement (IMA): For every metric, we calculate the correlation of its predictions for the initial graph pairs versus the predictions for the graph pairs that are ensured to be correct. Note that, on one hand, a high IMA for all metrics would further corroborate the trustworthiness of Bamboo results. However, on the other hand, a high IMA for a single metric cannot be interpreted as a marker for this metric’s quality. That is, a maximum IMA (1.0) could also indicate that a metric is completely insensitive to the human corrections. Furthermore, we study (ii) Metric human agreement (MHA): Here, we correlate the metric scores against human ratings, once when fed the fully gold-ensured graph pairs and once when fed the standard graph pairs. Both measures, IMA, and IAA, can provide us with an indicator of how much metric ratings would change if Bamboo would be fully human corrected.
Results are shown in Table 7. All metrics exhibit high IMA, suggesting that potential changes in their ratings, when fed gold-ensured graphs, are quite small. Furthermore, on average, all metrics tend to exhibit slightly better correlation with the human when computed on the gold-ensured graph pairs. However, supporting the assessment of IMA, the increments in MHA appear small, ranging from a minimum increment of +0.3 (SemBleu) to a maximum increment of +2.8 (S2match), whereas WWLK yields an increment of +1.8. Generally, while this assessment has to be taken with a grain of salt due to the small sample size, it overall supports the validity of Bamboo results.
. | STS . | SICK . | PARA . | AVERAGE . | ||||
---|---|---|---|---|---|---|---|---|
MHA . | IMA . | MHA . | IMA . | MHA . | IMA . | MHA . | IMA . | |
SM | [71, 73] | 97.9 | [66, 66] | 99.9 | [44, 44] | 97.9 | [60, 61] | 98.6 |
WSM | [64, 65] | 99.2 | [67, 67] | 99.8 | [47, 49] | 98.7 | [59, 60] | 99.2 |
S2Mdef | [69, 70] | 97.7 | [62, 63] | 99.3 | [44, 47] | 97.7 | [58, 60] | 98.2 |
S2M | [71, 73] | 97.8 | [69, 70] | 98.6 | [41, 46] | 98.0 | [60, 63] | 98.1 |
SE | [66, 66] | 97.7 | [55, 55] | 100 | [42, 46] | 99.0 | [55, 56] | 98.9 |
SB2 | [68, 68] | 97.2 | [62, 62] | 99.8 | [41, 42] | 98.8 | [57, 58] | 98.6 |
SB3 | [66, 66] | 98.4 | [63, 63] | 99.7 | [33, 34] | 99.3 | [54, 54] | 99.1 |
WLK | [72, 72] | 98.2 | [65, 65] | 99.8 | [43, 46] | 97.9 | [60, 61] | 98.6 |
WWLK | [77, 78] | 97.8 | [65, 67] | 98.1 | [42, 46] | 97.8 | [61, 63] | 97.9 |
WWLKθ | [78, 78] | 96.8 | [67, 68] | 98.1 | [48, 48] | 96.7 | [64, 65] | 97.2 |
. | STS . | SICK . | PARA . | AVERAGE . | ||||
---|---|---|---|---|---|---|---|---|
MHA . | IMA . | MHA . | IMA . | MHA . | IMA . | MHA . | IMA . | |
SM | [71, 73] | 97.9 | [66, 66] | 99.9 | [44, 44] | 97.9 | [60, 61] | 98.6 |
WSM | [64, 65] | 99.2 | [67, 67] | 99.8 | [47, 49] | 98.7 | [59, 60] | 99.2 |
S2Mdef | [69, 70] | 97.7 | [62, 63] | 99.3 | [44, 47] | 97.7 | [58, 60] | 98.2 |
S2M | [71, 73] | 97.8 | [69, 70] | 98.6 | [41, 46] | 98.0 | [60, 63] | 98.1 |
SE | [66, 66] | 97.7 | [55, 55] | 100 | [42, 46] | 99.0 | [55, 56] | 98.9 |
SB2 | [68, 68] | 97.2 | [62, 62] | 99.8 | [41, 42] | 98.8 | [57, 58] | 98.6 |
SB3 | [66, 66] | 98.4 | [63, 63] | 99.7 | [33, 34] | 99.3 | [54, 54] | 99.1 |
WLK | [72, 72] | 98.2 | [65, 65] | 99.8 | [43, 46] | 97.9 | [60, 61] | 98.6 |
WWLK | [77, 78] | 97.8 | [65, 67] | 98.1 | [42, 46] | 97.8 | [61, 63] | 97.9 |
WWLKθ | [78, 78] | 96.8 | [67, 68] | 98.1 | [48, 48] | 96.7 | [64, 65] | 97.2 |
6.5 Discussion
Align or not Align?
We can group metrics for graph-based meaning representations into whether they compute an alignment between AMRs or not (Liu et al., 2020). A computed alignment, as in Smatch, has the advantage that it lets us assess finer-grained AMR graph similarities and divergences, by creating and exploiting a mapping that shows which specific substructures of two graphs are more or less similar to each other. On the other hand, it was still an open question whether such an alignment is worth its computational cost and enhances similarity judgments.
Experiments on Bamboo provide novel evidence on this matter: alignment-based metrics may be preferred for better accuracy. Non- alignment based metrics may be preferred if speed matters most. The latter situation may occur, for example, when AMR metrics must be executed over a large cross-product of parses (for instance, to semantically cluster sentences from a corpus). For a balanced approach, WWLKΘ offers a good trade-off: polynomial-time alignment and high accuracy.
Example Discussion I: Wasserstein Transportation Analysis Explains Disagreement
Figure 6 (top) shows an example where the human-assigned similarity score is relatively low (rank 1164 of 1379). Due to the graphs having the same structure (xarg0y; x arg1z), the previous metrics (except Sema) tend to assign similarities that are relatively too high. In particular, S2match finds the exact same alignments in this case, but cannot assess the concept-relations more deeply. WWLK yields more informative alignments since they explain its decision to assign a more appropriate lower rank (1253 of 1379): Substantial work is needed to transport, for example, carry-01 to slice-01.
Example Discussion II: TheVvalue of n:m Alignments
Figure 6 (bottom) shows that WWLK produces valuable n:m alignments (play-11 vs. make-01 and music), which are needed to properly reflect similarity (note that Smatch, WSmatch, and S2match only provide 1-1 alignments). Yet, the example also shows that there is still a way to go. While humans assess this near-equivalence easily, providing a relatively high score (rank 331 of 4972), all metrics considered in this paper, including ours, assign relative ranks that are too low (WWLK: 2624). Future work may incorporate external PropBank (Palmer et al., 2005) knowledge into AMR metrics. In PropBank, sense 11 of play is defined as equivalent to making music.
7 Conclusion
Our contributions in this work are three-fold: (i) We propose a suite of novel Weisfeiler-Leman AMR similarity metrics that are able to reconcile a performance conflict between precision of AMR similarity ratings and the efficiency of computing alignments. (ii) We release Bamboo, the first benchmark that allows researchers to assess AMR metrics empirically, setting the stage for future work on graph-based meaning representation metrics. (iii) We showcase the utility of Bamboo, by applying it to profile existing AMR metrics, uncovering hitherto unknown strengths or weaknesses, and to assess the strengths of our newly proposed metrics that we derive and further develop from the classic Weisfeiler-Leman Kernel. We show that through Bamboo we are able to gain novel insight regarding suitable hyperparameters of different metric types, and to gain novel perspectives on how to further improve AMR similarity metrics to achieve better correlation with the degree of meaning similarity of paired sentences, as perceived by humans.
Acknowledgments
We are grateful to three anonymous reviewers and Action Editor Yue Zhang for their valuable comments that have helped to improve this paper. We are also thankful to Philipp Wiesenbach for giving helpful feedback on a draft of this paper. This work has been partially funded by the DFG through the project ACCEPT as part of the Priority Program “Robust Argumentation Machines” (SPP1999).
Notes
For example, in Figure 1, Smatch aligns drink-01 to slurp-01 and kitten to cat, resulting in a single matching triple (x, arg0, y).
We use https://pypi.org/project/pyemd.
It improves upon a classic Kiefer-Wolfowitz approximation (Kiefer et al., 1952) by requiring, per gradient estimate, only 2 objective function evaluations instead of 2n.
The other partitions, which are largely based on this data, will be introduced in §5.
The lower average density could be caused, e.g., by the fact that the PARA data is sampled from news sources, which means that the AMRs contain more named entity structures that usually have more terminal nodes.
The human annotator is a proficient English speaker and has worked several years with AMR.
ℋ0(gold): amount of gold graphs T5S2S ≤ amount of gold graphs GPLA; ℋ0(silver): amount of silver graphs T5S2S ≤ amount of gold graphs GPLA; ℋ0(flawed): amount of gold graphs T5S2S ≥ amount of gold graphs GPLA.
A complete list of reifications are given in the official AMR guidelines: https://github.com/amrisi/amr-guidelines/blob/master/amr.md.
‘Adapted lesk’, ‘Simple Lesk’, ‘Cosine Lesk’, ‘max sim’ (Banerjee and Pedersen, 2002; Lesk, 1986; Pedersen, 2007): https://github.com/alvations/pywsd.
To increase precision, we only perform this step if all solvers agree on the predicted synset.
This shrinks the train/dev/test size of STS (now: 474/106/158) and SICK (now: 246/50/238).
For example, we may rectify only selected relations, or create more data, setting Eq. 13 to , only applying f to one graph.
STS/SICK: τ = 0.90, τ′ = 0.95; PARA: τ = 0.0, τ′ = 0.95
Manual inspection of examples shows that low similarity can frequently be explained with differences in concrete concepts that tend to be distant to the root. For example, the low similarity (0.16) of Morsi supporters clash with riot police in Cairo vs. Protesters clash with riot police in Kiev arises mostly from Kiev and Cairo and Morsi, however, these names (as are names in general in AMR) are distant to the root region, which is similar in both graphs (clash, riot, protesters, supporters).
Setting k = 2 stands in contrast to the original paper that recommended k = 3, the common setting in MT. However, lower k in SemBleu reduces biases (Opitz et al., 2020), which may explain the better result on Bamboo.
Similar to SemBleu, we can mitigate this performance drop on Arg↯ PARA by increasing the amount of passes K in WLK, however, this decreases overall amean and hmean.
Overall, few corrections were necessary, as reflected in a high Smatch between corrected and uncorrected graphs: 95.1 (STS), 96.8 (SICK), 97.9 (PARA).
References
Author notes
Action Editor: Yue Zhang