Abstract
As in many natural language processing tasks, data-driven models based on supervised learning have become the method of choice for semantic role labeling. These models are guaranteed to perform well when given sufficient amount of labeled training data. Producing this data is costly and time-consuming, however, thus raising the question of whether unsupervised methods offer a viable alternative. The working hypothesis of this article is that semantic roles can be induced without human supervision from a corpus of syntactically parsed sentences based on three linguistic principles: (1) arguments in the same syntactic position (within a specific linking) bear the same semantic role, (2) arguments within a clause bear a unique role, and (3) clusters representing the same semantic role should be more or less lexically and distributionally equivalent. We present a method that implements these principles and formalizes the task as a graph partitioning problem, whereby argument instances of a verb are represented as vertices in a graph whose edges express similarities between these instances. The graph consists of multiple edge layers, each one capturing a different aspect of argument-instance similarity, and we develop extensions of standard clustering algorithms for partitioning such multi-layer graphs. Experiments for English and German demonstrate that our approach is able to induce semantic role clusters that are consistently better than a strong baseline and are competitive with the state of the art.
1. Introduction
Recent years have seen increased interest in the shallow semantic analysis of natural language text. The term is often used to describe the automatic identification and labeling of the semantic roles conveyed by sentential constituents (Gildea and Jurafsky 2002). Semantic roles describe the relations that hold between a predicate and its arguments (e.g., “who” did “what” to “whom”, “when”, “where”, and “how”) abstracting over surface syntactic configurations. This type of semantic information is shallow but relatively straightforward to infer automatically and useful for the development of broad-coverage, domain-independent language understanding systems. Indeed, the analysis produced by existing semantic role labelers has been shown to benefit a wide spectrum of applications ranging from information extraction (Surdeanu et al. 2003) and question answering (Shen and Lapata 2007), to machine translation (Wu and Fung 2009) and summarization (Melli et al. 2005).
The examples illustrate the fact that predicates can license several alternate mappings or linkings between their semantic roles and their syntactic realization. Pairs of linkings allowed by a single predicate are often called diathesis alternations (Levin 1993). Sentence pair (1a,b) is an example of the instrument subject alternation, and pair (1b,c) illustrates the causative alternation. Resolving the mapping between the syntactic dependents of a predicate (e.g., subject, object) and the semantic roles that they each express is one of the major challenges faced by semantic role labelers.
The semantic roles in the examples are labeled in the style of PropBank (Palmer, Gildea, and Kingsbury 2005), a broad-coverage human-annotated corpus of semantic roles and their syntactic realizations. Under the PropBank annotation framework each predicate is associated with a set of core roles (named A0, A1, A2, and so on) whose interpretations are specific to that predicate1 and a set of adjunct roles such as location or time whose interpretation is common across predicates (e.g., last night in sentence (1c)). The availability of PropBank and related resources (e.g., FrameNet; Ruppenhofer et al. 2006) has sparked the development of a variety semantic role labeling systems, most of which conceptualize the task as a supervised learning problem and rely on role-annotated data for model training. Most of these systems implement a two-stage architecture consisting of argument identification (determining the arguments of the verbal predicate) and argument classification (labeling these arguments with semantic roles). Current approaches deliver reasonably good performance—a system will recall around 81% of the arguments correctly and 95% of those will be assigned a correct semantic role (see Màrquez et al. [2008] for details), although only on languages and domains for which large amounts of role-annotated training data are available.
Unfortunately, the reliance on labeled data, which is both difficult and expensive to produce, presents a major obstacle to the widespread application of semantic role labeling across different languages and text genres. Although corpora with semantic role annotations exist nowadays in other languages (e.g., German, Spanish, Catalan, Chinese, Korean), they tend to be smaller than their English equivalents and of limited value for modeling purposes. Even within English, a language for which two major annotated corpora are available, systems trained on PropBank demonstrate a marked decrease in performance (approximately by 10%) when tested on out-of-domain data (Pradhan, Ward, and Martin 2008). The data requirements for supervised systems and the current paucity of such data has given impetus to the development of unsupervised methods that learn from unlabeled data. If successful, unsupervised approaches could lead to significant resource savings and the development of semantic role labelers that require less engineering effort. Besides being interesting on their own right, from a theoretical and linguistic perspective, unsupervised methods can provide valuable features for downstream (supervised) processing and serve as a preprocessing step for applications that require broad coverage understanding. In this article we study the potential of unsupervised methods for semantic role labeling. As in the supervised case, we decompose the problem into an argument identification step and an argument classification step. Our work primarily focuses on argument classification, which we term role induction, because there is no predefined set of semantic roles in the unsupervised case, and these must be induced from data. The goal is to assign argument instances to clusters such that each cluster contains arguments corresponding to a specific semantic role and each role corresponds to exactly one cluster.
Unsupervised learning is known to be challenging for many natural language processing problems and role induction is no exception. Firstly, it is difficult to define a learning objective function whose optimization will yield an accurate model. This contrasts with the supervised setting, where the objective function can directly reflect training error (i.e., some estimate of the mismatch between model output and the gold standard) and the model can be tuned to replicate human output for a given input under mathematical guarantees regarding the accuracy of the trained model. Secondly, it is also more difficult to incorporate rich feature sets into an unsupervised model (Berg-Kirkpatrick et al. 2010). Unless we explicitly know exactly how features interact, more features may not necessarily lead to a more accurate model and may even decrease performance. In the supervised setting, feature interactions relevant for a particular learning task can be determined to a large extent automatically and thus a large number of them can be included even if their significance is not clear a priori.
The lack of an extensional definition (in the form of training examples) of the target concept makes a strong case for the development of unsupervised methods that use problem specific prior knowledge. The idea is to derive a strong inductive bias (Gordon and Desjardins 1995) based on this prior knowledge that will guide the learning towards the correct target concept. For semantic role induction, we propose to build on the following linguistic principles:
- 1.
Semantic roles are unique within a particular frame.
- 2.
Arguments occurring in a specific syntactic position within a specific linking all bear the same semantic role.
- 3.
The (asymptotic) distribution over argument heads is the same for two clusters that represent the same semantic role.
We hypothesize that these three principles are, at least in theory, sufficient for inducing high-quality semantic role clusters. A challenge, of course, lies in adequately operationalizing them so that they guide the unsupervised learner towards meaningful solutions. The approach taken in this article translates these principles into estimates of similarity (or dissimilarity) between argument instances and/or clusters of argument instances. Principle (1) states that argument instances occurring in the same frame (i.e., clause) cannot bear the same semantic role, and are thus dissimilar. From Principle (2) it follows that arguments occurring in the same syntactic position within the same linking can be considered similar (leaving aside for the moment the difficulty of representing linkings through syntactic cues observable in a corpus). Principle (3) states that two clusters of instances containing similar distributions over head words should be considered similar.
Based on these similarity estimates we construct a graph whose vertices represent argument instances and whose edges express similarities between these instances. The graphs consist of multiple edge layers, each capturing one particular type of argument-instance similarity. For example, one layer will be used to represent whether argument instances occur in the same frame, and another layer will represent whether two arguments have a similar head word, and so on. Given this graph representation of the data, we formalize role induction as the problem of partitioning the graph into clusters of similar vertices. We present two algorithms for partitioning multi-layer graphs, which are adaptations of standard graph partitioning algorithms to the multi-layer setting. The algorithms differ in the way they exploit the similarity information encoded in the graph. The first one is based on agglomeration, where two clusters containing similar instances are grouped into a larger cluster. The second one is based on propagation, where role-label information is transferred from one cluster to another based on their similarity.
To understand how the aforementioned principles might allow us to handle the ambiguity stemming from alternate linkings, consider again Example (1). The most important thing to note is that, whereas the subject position is ambiguous with respect to the semantic roles it can express (it can be A0, A1, or A2), we can resolve the ambiguity by exploiting overt syntactic cues of the underlying linking. For example, the predicate break is transitive in sentences (1a) and (1b), and intransitive in sentence (1c). Thus, by taking into account the argument's syntactic position and the predicate's transitivity, we can guess that the semantic role expressed by the subject in sentence (1c) is different from the roles expressed by the subjects in sentences (1a,b). Now consider the more difficult case of distinguishing between the subjects in sentences (1a) and (1b). One linking cue that could help here is the prepositional phrase in sentence (1a), which results in a syntactic frame different from sentence (1b). Were the prepositional phrase omitted, we would attempt to disambiguate the linkings by resorting to lexical-semantic cues (e.g., by taking into account whether the subject is animate). In sum, if we encode sufficiently many linking cues, then the resulting fine-grained syntactic information will discriminate ambiguous semantic roles. In cases where syntactic cues are not discerning enough, we can exploit lexical information and group arguments together based on their lexical content.
The remainder of this article is structured as follows. Section 2 provides an overview of unsupervised methods for semantic role labeling. Sections 3 and 4 present the details of our method, that is, how the graphs are constructed and partitioned. Role induction experiments in English and German are described in sections 5 and 6, respectively. Discussion of future work concludes in section 7.
2. Related Work
The bulk of previous work on semantic role labeling has focused on supervised methods (Màrquez et al. 2008), although a few semi-supervised and unsupervised approaches have been proposed. The majority of semi-supervised models have been developed within a framework known as annotation projection. The idea is to combine labeled and unlabeled data by projecting annotations from a labeled source sentence onto an unlabeled target sentence within the same language (Fürstenau and Lapata 2009) or across different languages (Padó and Lapata 2009). Beyond annotation projection, Gordon and Swanson (2007) propose to increase the coverage of PropBank to unseen verbs by finding syntactically similar (labeled) verbs and using their annotations as surrogate training data.
Swier and Stevenson (2004) were the first to introduce an unsupervised semantic role labeling system. Their algorithm induces role labels following a bootstrapping scheme where the set of labeled instances is iteratively expanded using a classifier trained on previously labeled instances. Their method starts with a data set containing no role annotations at all, but crucially relies on VerbNet (Kipper, Dang, and Palmer 2000) for identifying the arguments of predicates and making initial role assignments. VerbNet is a manually constructed lexicon of verb classes, each of which is explicitly associated with argument realization and semantic role specifications.
In this article we will not assume the availability of any role-semantic resources, although we do assume that sentences are syntactically analyzed. There have been two main approaches to role induction from parsed data. Under the first approach, semantic roles are modeled as latent variables in a (directed) graphical model that relates a verb, its semantic roles, and their possible syntactic realizations (Grenager and Manning 2006). Role induction here corresponds to inferring the state of the latent variables representing the semantic roles of arguments. Following up on this work, Lang and Lapata (2010) reformulate role induction as the process of detecting alternations and finding a canonical syntactic form for them. Verbal arguments are then assigned roles, according to their position in this canonical form, because each position references a specific role. Their model extends the logistic classifier with hidden variables and is trained in a manner that takes advantage of the close relationship between syntactic functions and semantic roles. More recently, Garg and Henderson (2012) extend the latent-variable approach by modeling the sequential order of roles.
The second approach is similarity-driven and based on clustering. Lang and Lapata (2011a) propose an algorithm that first splits the set of all argument instances of a verb according to their syntactic position within a particular linking and then iteratively merges clusters. A different clusstering algorithm is adopted in Lang and Lapata (2011b). Specifically, they induce semantic roles via graph partitioning: Each vertex in the graph corresponds to an argument instance and edges represent a heuristically defined measure of their lexical and syntactic similarity. The similarity-driven approach has been recently adopted by Titov and Klementiev (2012a), who propose a Bayesian clustering algorithm based on the Chinese Restaurant Process. In addition, they present a method that shares linking preferences across verbs using a distance-dependent Chinese Restaurant Process prior which encourages similar verbs to have similar linking preferences. Titov and Klementiev (2012b) further introduce the use of multilingual data for improving role induction.
There has also been work on unsupervised methods for argument identification. Abend, Reichart, and Rappoport (2009) devise a method for recognizing the arguments of predicates that relies solely on part of speech annotations, whereas Abend and Rappoport (2010a) distinguish between core and adjunct roles, using an unsupervised parser and part-of-speech tagger. More generally, shallow semantic representations induced from syntactic information are commonly used in lexicon acquisition and information extraction tasks. For example, Lin and Pantel (2001) cluster syntactic relations between pairs of words as expressed by parse tree paths into semantic relations by exploiting lexical distributional similarity. Although not compatible with PropBank or semantic roles as such, Poon and Domingos (2009) and Titov and Klementiev (2011) also induce semantic information from dependency parses and apply it to a question answering task for the biomedical domain. Another example is the work by Gamallo, Agustini, and Lopes (2005), who cluster similar syntactic positions in order to develop models of selectional preferences to be used for word sense induction and the resolution of attachment ambiguities.
The work described here unifies the two clustering methods presented in Lang and Lapata (2011a and 2011b) by reformulating them as graph partitioning algorithms. It also extends them by utilizing multi-layer graphs which separate the similarities between instances on different features (e.g., part-of-speech, argument head) into different layers. This has the advantage that similarity scores on individual features do not have to be eagerly combined into a similarity score between instances. Instead, one can first aggregate the similarity scores on each feature layer between two clusters and then combine them into a similarity score between clusters. This is more robust, as the feature-wise similarity scores between clusters can be computed in a principled way and the heuristic combination step is deferred to the end (see Section 4 for details). Besides providing a general modeling framework for semantic role induction, we discuss in detail the linguistic principles guiding our modeling choices and assess their applicability across languages. Specifically, we show that the framework presented here (and the aforementioned principles) can be readily applied to English and German with identical parametrizations for both languages and without fundamentally changing the underlying model features, despite major syntactic differences between the two languages.
3. Graph Construction
We begin by explaining how we construct a graph that represents verbs and their arguments. Next, we describe how edge weights are computed—these translate to similarity scores between argument instances—and then move on to provide the details of our graph-partitioning algorithms.
As mentioned earlier, we formalize semantic role induction as a clustering problem. Clustering algorithms (see Jain, Murty, and Flynn [1999] for an overview) commonly take a matrix of pairwise similarity scores between instances as input and produce a set of output clusters, often satisfying some explicitly defined optimality criterion. The success or failure of the clustering approach is closely tied to the adequacy of the employed similarity function for the task at hand. The graph partitioning view of clustering (see Schaeffer [2007] for a detailed treatment) arises when instances are represented as the vertices of a graph and the similarity matrix is interpreted as the weight matrix of the graph. For semantic role induction, a straightforward application of clustering would be to construct a graph for each verbal predicate such that vertices correspond to argument instances of the verb and edge weights quantify the similarity between these instances.
3.1 Feature Similarity Functions
Similarities for a specific feature f are measured with a function φf(vi,vj) which assigns a [−1,1] value to any pair of instances (vi,vj). We assume similarities are measured on an interval scale—that is, while sums, differences, and averages of the values of some similarity function φf express meaningful quantities, products and ratios do not. Moreover, the values of two distinct similarity functions cannot necessarily be meaningfully compared without rescaling. Positive similarity values indicate that the semantic roles are likely to be the same, negative values indicate that roles are likely to differ, and zero values indicate that there is no evidence for either case. The magnitude of φf expresses the degree of confidence in the similarity judgment, with extreme values (i.e., −1 and 1) indicating maximal confidence.
The syntactic position of an argument is directly given by the parse tree and can be encoded, for example, by the full path from predicate to argument head, or for practical purposes, in order to reduce sparsity, simply through the relation governing the argument head and its linear position relative to the predicate (left or right). In contrast, linkings are not directly observed, but we can resort to overt syntactic cues as a proxy. Examples include the verb's voice (active/passive), whether it is transitive, the part-of-speech of the subject, and so on. We argue that in principle, if sufficiently many cues are taken into account, they will capture one particular linking, although there may be several encodings for the same linking. Note that syntactic similarity is not used to construct another graph layer; rather, it will be used for deriving initial clusters of instances, as we explain in Section 4.1.
4. Graph Partitioning
The graph partitioning problem consists of finding a set of clusters {c1, … , cS} that form a partition of the vertex-set, namely, ∪ici = V and ci ∩ cj = ∅ for all i ≠ j, such that (ideally) each cluster contains argument instances of only one particular semantic role, and the instances for a particular role are all assigned to one and the same cluster. In the following sections we provide two algorithms for multi-layer graph partitioning, based on standard clustering algorithms for single-layer graphs. Both algorithms operate on the same graph but differ in terms of the underlying clustering mechanism they use. The first algorithm is an adaptation of agglomerative clustering (Jain, Murty, and Flynn 1999) to the multi-layer setting: Starting from an initial clustering, the algorithm iteratively merges vertex clusters in order to arrive at increasingly accurate representations of semantic roles. Rather than greedily merging clusters, our second algorithm is based on propagating cluster membership information among the set of initial clusters (Abney 2007).
4.1 Agglomerative Graph Partitioning
The agglomerative algorithm induces clusters in a bottom–up manner starting from an initial cluster assignment that we will subsequently discuss in detail. Our initialization results in a clustering that has high purity but low collocation, that is, argument instances in each cluster tend to belong to the same role but argument instances of a particular role are scattered among many clusters.3 The algorithm then improves collocation by iteratively merging pairs of clusters. The agglomeration procedure is described in Algorithm 1 . As can be seen, pairs of clusters are merged iteratively until a termination criterion is met. The decision of which cluster pair to merge at each step is made by scoring a set of candidate cluster pairs and choosing the highest one (line 5). The scoring function s(ci, cj′) quantifies how likely two clusters are to contain arguments of the same role. A key question is how to define this scoring function on the basis of the underlying graph representation, that is, with reference to the instance similarities expressed by the edges. In order to collect evidence for or against a merge, we take into account the connectivity of a cluster pair at each feature layer of the graph. This crucially involves aggregating over all edges that connect the two clusters, and allows us to infer a cluster-level similarity score from the individual instance-level similarities encoded in the edges. The evidence collected at each layer is then combined together in order to arrive at an overall decision (see Figure 1 for an illustration).
Although it would be possible to enumerate and score all possible cluster pairs at each step, we apply a more efficient and effective procedure in which the set of candidates consists of pairs formed by combining a fixed cluster ci with all clusters larger than ci. This requires comparing only O(|C|) rather than O(|C|2) scores and, more importantly, it favors merges between large clusters whose score can be computed more reliably. As mentioned earlier, our scoring function implements an averaging procedure over the instances contained in the clusters, and thus yields less noisy scores when clusters are large (i.e., contain many instances). This prioritization promotes reliable merges over less reliable ones in the earlier phases of the algorithm with a positive effect on merges in the later phases. Moreover, by keeping ci fixed, we only require that scores s(ci,x) and s(ci,z) are comparable (i.e., where one cluster is argument in both scores), rather than comparisons between arbitrary cluster pairs (e.g., s(w,x) and s(y,z)). In the following, we will provide details on the initialization of the algorithm and the computation of the similarity scoring function.
A standard agglomerative clustering algorithm forms clusters bottom–up by initially placing each item of interest in its own cluster. In our case, initializing the algorithm with as many clusters as argument instances would result in a clustering with maximal purity and minimal collocation. There are two reasons that justify a more sophisticated initialization procedure for our problem. Firstly, the scoring function we use is more reliable for larger clusters than for smaller clusters (see the subsequent discussion). In fact, the standard initialization that creates clusters with a single instance would not yield useful results as our scoring function crucially relies on initial clusters containing several instances on average. Secondly, the similarity scores for different features are not directly comparable. Recall from Section 3.1 that we introduced different types of similarities based on the arguments' head words (φlex), parts-of-speech (φpos), syntactic positions (φsyn), and frame constraints (φframe). As discussed earlier, engineering a scoring function that integrates these into a single score without resorting to heuristic judgments on how to weight them poses a major challenge. In particular, it is difficult to weight the contribution of the two forms of positive evidence given by lexical and syntactic similarity. This motivates the idea of using syntactic similarity for initialization, and lexical similarity (as well as the frame constraint) for scoring. This separation avoids the difficulty of defining the exact interaction between the two. Specifically, we obtain an initial clustering by grouping together all instances which occur in the same syntactic position within a linking—that is, all pairs (vi, vj) for which φsyn(vi, vj) = 1 are grouped into the same cluster, assuming that arguments occurring in a specific syntactic position under a specific linking share the same role.
We specify the syntactic position of an argument using four cues: the verb's voice (active/passive), the argument's linear position relative to the predicate (left/right), the syntactic relation of the argument to its governor (e.g., subject or object), and the preposition used for realizing the argument (if any). Each argument is assigned a four-tuple consisting of these cues and two syntactic positions are assumed equal iff they agree on all cues.
Whereas the similarity functions defined in Section 3.1 measure role-semantic similarity between instances on a particular feature, the scoring function measures role-semantic similarity between clusters. Naturally, the similarity between two clusters is defined in terms of the similarities of the instances contained in the clusters. This involves two aggregation stages. Initially, instance similarities are aggregated in each feature layer, resulting in an aggregate score for each feature. These layer-specific scores are then integrated into a single score, which quantifies the overall similarity between the two clusters (see Figure 1).
An obvious way to determine the similarity between two clusters (with respect to a particular feature f) would be to analyze their connectivity. For example, we could use edge density (Schaeffer 2007) to average over the weights of edges between two clusters. However, edge density is an inappropriate measure of similarity in our case, because we cannot assume that arbitrary pairs of instances are similar with respect to a particular feature, even if two clusters represent the same semantic role. Consider for example lexical similarity: Most head words will not agree (even within a cluster) and therefore averaging between all pairs would yield low scores, regardless of whether the clusters represent the same role or not. Analogously, the vast majority of instance pairs from any two clusters will belong to different frames, and thus averaging over all possible pairs of instances would not yield indicative scores.
Although our definition avoids weighting, it has introduced threshold parameters α, β, and γ that we need to somehow estimate. We propose a scheme in which parameters β and γ are iteratively adjusted, and α, the threshold determining the extent to which the frame constraints can be violated, is kept fixed. We heuristically set α to − 0.05, based on the intuition that in principle frame constraints must be satisfied although in practice, due to noise we expect a small number of violations (i.e., at most 5% of instances can violate the constraint). Parameters β and γ are initially set to their maximal value 1, thereby ruling out all merges except those with maximal confidence. The parameters then decrease iteratively according to a routine whose pseudo-code is specified in Algorithm 2 . The parameter β decreases at each iteration by a small amount (0.025) until it reaches ε = 0.025, at which point its value is reset to 1.0 and γ is discounted by a factor close to one (0.9). This is repeated until γ falls below ε, upon which the algorithm terminates.
4.2 Multi-Layer Label Propagation
Our second graph partitioning algorithm is based on the idea of propagating cluster membership information along the edges of a graph, subsequently referred to as propagation graph. As we explain in more detail subsequently, compared with agglomerative clustering, this algorithm in principle is less prone to making false greedy decisions that cannot be later revoked. Moreover, it has lower runtime and thus scales better to larger data sets.
Each edge (ai, aj) ∈ Bf in layer f is accordingly weighted by sf(ai, aj). Each vertex ai is associated with a label li, indicating the partition that ai and all the vertices in the original graph that have been collapsed into ai belongs to.
Note that the label propagation algorithm is informed by the same similarity functions as agglomerative clustering and uses an identical initialization procedure but provides an alternative means of cluster inference. Initially, each vertex of the propagation graph belongs to its own cluster, that is, we let the number of clusters L = |A| and set li ← i. Given this initial vertex labeling, the algorithm proceeds by iteratively updating the label for each vertex (lines 4–10 in Algorithm 3). This crucially relies on a scoring procedure in which a score s(l) is computed for each possible label l. We discuss the details of the scoring procedure below.
The label scoring procedure required in line 5 of Algorithm 3 has parallels to the cluster pair scoring procedure of the agglomerative algorithm. It also consists of two stages: Initially, evidence is collected independently on each feature layer by computing label score aggregates with respect to each feature and then these feature scores are combined in order to arrive at an overall score.
We now analyze the runtime of our algorithm. Let T denote the number of iterations of the outer loop starting at line 1 of Algorithm 3 . The inner loop starting at line 4 iterates over |A| clusters and for each one of them it has to evaluate at most |A| neighboring nodes. Additionally, there are the one-time costs of computing the similarities between atomic clusters which take O(|V|2) time. The total runtime is therefore O(T |A|2 + |V|2). Because |A|2 < < |V|2, label propagation is substantially faster than agglomerative clustering.
4.3 Relationship to Single-Layer Graph Partitioning
Clustering algorithms typically assume instance-wise similarities as input (i.e., single-layer graphs). For our role induction problem, this would require a heuristically defined similarity function that combines the similarities on individual features into a single similarity score between instances. In other words, we would collapse the multiple graph layers into a single layer and then partition the resulting single-layer graph according to a standard clustering algorithm. A main difference between the two approaches is the order in which similarities are aggregated: Whereas multi-layer graph partitioning aggregates similarities on each feature layer first and then combines them into an overall cluster-wise similarity score, in the single-layer case feature similarities are eagerly combined into an overall instance-wise similarity score and then aggregated. Thus, in the multi-layer setting, aggregation can be done in a principled way by considering the individual feature layers in isolation. For large clusters the resulting scores for each feature layer will provide reliable evidence for or against a merge. Combining these cluster-wise similarity scores is much less error-prone than the eager combination at the instance-level used by the single-layer approach. We experimentally confirm this intuition (see Section 5.5) by comparing against the single-layer partitioning algorithm presented in Lang and Lapata (2011b).
5. Role Induction Experiments on English
We adopt the general architecture of supervised semantic role labeling systems where argument identification and argument classification are treated separately. Our role labeler is fully unsupervised with respect to both tasks—it does not rely on any role annotated data or semantic resources. However, our system does not learn from raw text. In common with most semantic role labeling research, we assume that the input is syntactically analyzed. Our approach is not tied to a specific syntactic representation—both constituent- and dependency-based representations can be used. The bulk of our experiments focus on English data and a dependency-based representation that simplifies argument identification considerably and is consistent with the CoNLL 2008 benchmark data set used for evaluation in our experiments. To show that our method can be applied to other languages and across varying syntactic representations, we also report experiments on German using a constituent-based representation (see Section 6).
Given the parse of a sentence, our system identifies argument instances and assigns them to clusters. Thereafter, argument instances can be labeled with an identifier corresponding to the cluster they have been assigned to, similar to PropBank core labels (e.g., A0, A1). We view argument identification as a syntactic processing step that can be largely undertaken deterministically through analysis of the syntactic tree. We therefore use a small set of rules to detect arguments with high precision and recall. In the following, we first describe the data set (Section 5.1) on which our experiments were carried out. Next, we present the argument identification component of our system (Section 5.2) and the method used for comparison with our approach. Finally, we explain how system output was evaluated (Section 5.4).
5.1 Data
For evaluation purposes, we ran our method on the CoNLL 2008 shared task data set (Surdeanu et al. 2008), which provides PropBank style gold standard annotations. As our algorithm induces verb-specific roles, PropBank annotations are a natural choice of gold standard for our problem. The data set contains annotations for verbal and nominal predicate-argument constructions, but we only considered the former. The CoNLL data set was taken from the Wall Street Journal portion of the Penn Treebank and converted into a dependency format (Surdeanu et al. 2008). Input sentences are represented in the dependency syntax specified by the CoNLL 2008 shared task (see Figure 2 for an example). In addition to gold standard dependency parses, the data set also contains automatic parses obtained from the MaltParser (Nivre et al. 2007), which we will use as an alternative in our experiments in order to assess the impact of parse quality. For each argument only the head word is annotated with the corresponding semantic role, rather than the whole constituent. We assume that argument heads are content words (e.g., the head of a prepositional phrase is the nominal head rather than the preposition). We do not treat split arguments or co-referential arguments (e.g., in relative clauses). Specifically, we ignore arguments with roles preceded by the C- or R- prefix in the gold standard. All argument lemmas were normalized to lower case; we also replaced numerical quantities with a placeholder; to further reduce data sparsity, we identified the head of proper noun phrases heuristically as the most frequent lemma contained in the phrase.
5.2 Argument Identification
In the supervised setting, a classifier is used in order to decide for each node in the parse tree whether it represents a semantic argument or not. Nodes classified as arguments are then assigned a semantic role. In the unsupervised setting, we slightly reformulate argument identification as the task of discarding as many non-semantic arguments as possible. This means that the argument identification component does not make a final positive decision for any of the argument candidates; instead, this decision is deferred to role induction.4 We assume here that predicate identification is a precursor to argument identification and can be done relatively straightforwardly based on part-of-speech information.
The rules given in Table 1 are used to discard or select argument candidates for English. They primarily take into account the parts of speech and the syntactic relations encountered when traversing the dependency tree from predicate to argument. A priori, all words in a sentence are considered argument candidates for a given predicate. Then, for each candidate, the rules are inspected sequentially and the first matching rule is applied. We will exemplify how the argument identification component works for the predicate expect in the sentence The company said it expects its sales to remain steady whose parse tree is shown in Figure 2. Initially, all words except the predicate itself are treated as argument candidates. Then, the rules from Table 1 are applied as follows. Firstly, the words the and to are discarded based on their part of speech (Rule 1); then, remain is discarded because the path ends with the relation IM and said is discarded as the path ends with an upward-leading OBJ relation (Rule 2). Rule 3 matches to it, which is therefore added as a candidate. Next, steady is discarded because there is a downward-leading OPRD relation along the path and the words company and its are also discarded because of the OBJ relations along the path (Rule 4). Rule 5 does not apply but the word sales is kept as a likely argument (Rule 6). Finally, Rule 7 does not apply, because there are no candidates left.
1. | Discard a candidate if it is a coordinating conjunction or punctuation. |
2. | Discard a candidate if the path of relations from predicate to candidate ends with coordination, subordination, etc. (see Appendix A for the full list of relations). |
3. | Keep a candidate if it is the closest subject (governed by the subject-relation) to the left of a predicate and the relations from predicate p to the governor g of the candidate are all upward-leading (directed as g → p). |
4. | Discard a candidate if the path between the predicate and the candidate, excluding the last relation, contains a subject relation, adjectival modifier relation, etc. (see Appendix A for the full list of relations). |
5. | Discard a candidate if it is an auxiliary verb. |
6. | Keep a candidate if it is directly connected to the predicate. |
7. | Keep a candidate if the path from predicate to candidate leads along several verbal nodes (verb chain) and ends with an arbitrary relation. |
8. | Discard all remaining candidates. |
1. | Discard a candidate if it is a coordinating conjunction or punctuation. |
2. | Discard a candidate if the path of relations from predicate to candidate ends with coordination, subordination, etc. (see Appendix A for the full list of relations). |
3. | Keep a candidate if it is the closest subject (governed by the subject-relation) to the left of a predicate and the relations from predicate p to the governor g of the candidate are all upward-leading (directed as g → p). |
4. | Discard a candidate if the path between the predicate and the candidate, excluding the last relation, contains a subject relation, adjectival modifier relation, etc. (see Appendix A for the full list of relations). |
5. | Discard a candidate if it is an auxiliary verb. |
6. | Keep a candidate if it is directly connected to the predicate. |
7. | Keep a candidate if the path from predicate to candidate leads along several verbal nodes (verb chain) and ends with an arbitrary relation. |
8. | Discard all remaining candidates. |
On the CoNLL 2008 training set, our argument identification rules obtain a precision of 87.0% and a recall of 92.1% on gold standard parses. On automatic parses, precision is 79.3% and recall 84.8%. Here, precision measures the percentage of selected arguments that are actual semantic arguments, and recall measures the percentage of actual arguments that are not filtered out.
Grenager and Manning (2006) also devise rules for argument identification, unfortunately without providing any details on their implementation. More recently, attempts have been made to identify arguments without relying on a treebank-trained parser (Abend and Rappoport 2010b; Abend, Reichart, and Rappoport 2009). The idea is to combine a part-of-speech tagger and an unsupervised parser in order to identify constituents. Likely arguments can be in turn identified based on a set of rules and the degree of collocation with the predicate. Perhaps unsurprisingly, this method does not match the quality of a rule-based component operating over trees produced by a supervised parser.
5.3 Baseline Method for Semantic Role Induction
The linking between semantic roles and syntactic positions is not arbitrary; specific semantic roles tend to map onto specific syntactic positions such as subject or object (Levin and Rappaport 2005; Merlo and Stevenson 2001). We further illustrate this observation in Table 2, which shows how often individual semantic roles map onto certain syntactic positions. The latter are simply defined as the relations governing the argument. The frequencies in the table were obtained from the CoNLL 2008 data set and are aggregates across predicates. As can be seen, semantic roles often approximately correspond to a single syntactic position. For example, A0 is commonly mapped onto subject (SBJ), whereas A1 is often realized as object (OBJ).
This motivates a baseline that directly assigns instances to clusters according to their syntactic position. The pseudo-code is given in Algorithm 4. For each verb we allocate N = 22 clusters (the maximal number of gold standard clusters together with a default cluster). Apart from the default cluster, each cluster is associated with a syntactic position and all instances occurring in that position are mapped into the cluster. Despite being relatively simple, this baseline has been previously used as a point of comparison by other unsupervised semantic role labeling systems (Grenager and Manning 2006; Lang and Lapata 2010) and shown difficult to outperform. This is partly due to the fact that almost two thirds of the PropBank arguments are either A0 or A1. Identifying these two roles correctly is therefore the most important distinction to make, and because this can be largely achieved on the basis of the arguments' syntactic position (see Table 2), the baseline yields high scores.
5.4 Evaluation
In this section we describe how we assess the quality of a role induction method that assigns labels to units that have been identified as likely arguments. We also discuss how we measure whether differences in model performance are statistically significant.
Arguments are labeled based on the cluster they have been assigned to, which means that in contrast to the supervised setting we cannot verify the correctness of these labels directly (e.g., by comparing them to the gold standard). Instead, we will look at the induced clusters as a whole and assess their quality in terms of how well they reflect the assumed gold standard. Specifically, for each verb, we determine the extent to which argument instances in the clusters share the same gold standard role (purity) and the extent to which a particular gold standard role is assigned to a single cluster (collocation).
Purity and collocation measure essentially the same data traits as precision and recall, which in the context of clustering are, however, defined on pairs of instances (Manning, Raghavan, and Schütze 2008), which makes them a bit harder to grasp intuitively. We therefore prefer purity and collocation, arguing that these should be assessed in combination or together with F1 because they can be traded off against each other. Purity can be trivially maximized by mapping each instance into its own cluster, and collocation can be trivially maximized by mapping all instances into a single cluster.
Although it is desirable to report performance with a single score such as F1, it is equally important to assess how purity and collocation contribute to this score. In particular, if a hypothetical system were to be used for automatically annotating data, low collocation would result in higher annotation effort and low purity would result in lower data quality. Therefore high purity is imperative for an effective system whereas high collocation contributes to efficient data labeling. For assessing our methods we therefore introduce the following terminology. If a model attains higher purity than the baseline, we will say that it is adequate, because it induced roles that adequately represent semantic roles. If a model attains higher F1 than the baseline, we will say that it is non-trivial, because it strikes a tradeoff between collocation and purity that is non-trivial. Our goal then is to find models that are both adequate and non-trivial.
5.5 Results
Our results are summarized in Tables 3,4–5, which report cluster purity (PU), collocation (CO), and their harmonic mean (F1) for the baseline and our two multi-layer graph partitioning algorithms. We present scores on four data sets that result from the combination of automatic parses with automatically identified arguments (auto/auto), gold parses with automatic arguments (gold/auto), automatic parses with gold arguments (auto/gold), and gold parses with gold arguments (gold/gold). We show how performance varies for our methods when measuring cluster similarity in the two ways described above: (a) by finding for each instance in one cluster the instance in the other cluster that is maximally similar or dissimilar and averaging over the scores of these alignments (avgmax) and (b) by using cosine similarity (see Section 4.1). We also report results for the single-layer algorithm proposed in Lang and Lapata (2011b).5 Given a verbal predicate, they construct a single-layer graph whose edge weights express instance-wise similarities directly. The graph is partitioned into vertex clusters representing semantic roles using a variant of Chinese Whispers, a graph clustering algorithm proposed by Biemann (2006). The algorithm iteratively assigns cluster labels to graph vertices by greedily choosing the most common label among the neighbors of the vertex being updated.
Both agglomerative partitioning and multi-layered label propagation algorithms systematically achieve higher F1 scores than the baseline—that is, induce non-trivial clusterings and more adequate semantic roles (by attaining higher purity). For example, on the auto/auto data set, the agglomerative algorithm using cosine similarity increases F1 by 2.3 points over the baseline and by 7.2 points in terms of purity. This increase in purity is achieved by trading off against collocation, although in a favorable ratio as indicated by the overall higher F1. All improvements over the baseline are statistically significant (q < 0.001 according to the test described in Section 5.4). In general, we observe that cosine similarity outperforms avgmax similarity. We conjecture that cosine is a more appropriate measure of cluster similarity for features where it is beneficial to capture the distributional similarity of clusters. The two algorithms perform comparably—differences in F1 are not statistically significant (except in the gold/auto setting). Nevertheless, agglomerative partitioning obtains higher purity and F1 than label propagation. The latter trades off more purity and in return obtains higher collocation. The single-layer method is inferior to the multi-layer algorithms, in particular because it is less robust to noise, as demonstrated by the markedly worse results on automatic parses. On the auto/auto data set the single-layered algorithm is on a par with the baseline and marginally outperforms it on the auto/gold and gold/gold configurations.
To help put our results in context, we compare our methods with Titov and Klementiev's (2012a) Bayesian clustering models. They report results on the CoNLL 2008 data sets with two model variants, a factored model that models each verb independently and a coupled model where model parameters are shared across verbs. In an attempt to reduce the sparsity of the argument fillers, they also present variants of the factored and coupled models where the argument heads have been replaced by lexical cluster ids stemming from Brown et al.'s (1992) clustering algorithm on the RCV1 corpus. In Table 6 we follow Titov and Klementiev (2012a) and show results on the gold/gold and gold/auto settings. As can be seen, both the agglomerative clustering and label propagation perform comparably to their coupled model, even though they do not implement any specific mechanism for sharing clustering preferences across verbs. Versions of their models that use Brown word clusters (i.e., Factored+Br and Coupled+Br) yield overall best results. We expect this type of preprocessing to also increase the performance of our models, however we leave this to future work. Finally, we should point out that Titov and Klementiev (2012a) do not cluster adjunct-like modifier arguments that are already explicitly represented in syntax (e.g., TMP, LOC, DIR). Thus, their Coupled+Mods model is most comparable to ours in terms of the clustering objective as it treats both core and adjunct arguments and does not make use of the Brown clustering. Table 6 shows the performance of Coupled+Mods on the gold/gold setting only because auto/gold results are not reported.
We further examined the output of the baseline and our best performing model in order to better understand where the performance gains are coming from. Table 7 shows how the two approaches differ when it comes to individual roles. We observe that the agglomerative clustering algorithm performs better than the baseline on all core roles. There are some adjunct roles for which the baseline obtains a higher F1. This is not surprising because the parser directly outputs certain labels such as LOC and TMP which results in high baseline scores for these roles. A word of caution is necessary here since core roles are defined individually for each verb and need not have a uniform corpus-wide interpretation. Thus, conflating per-role scores across verbs is only meaningful to the extent that these labels actually signify the same role (which is mostly true for A0 and A1). Furthermore, the purity scores we provide in this context are averages over the clusters for which the specified role is the majority role.
We further investigated the degree to which the baseline and the agglomerative clustering algorithm agree in their role assignments. The overall mean overlap was 46.03%. Figure 3a shows the percentage of verbs for which the baseline and our algorithm have no, some, or complete overlap. We discretized overlap into 10 bins of equal size ranging from 0 to 100. We observe that the role assignments produced by the two methods have nothing in common for approximately 13.6% verbs, whereas assignments are identical for 18.1% verbs. Aside from these two bins (see 0 and 100 in Figure 3), a large number of verbs seems to exhibit overlap in the range of 40–60%. Figure 3b shows how the overlap in the cluster assignments varies with verb frequency. Perhaps unsurprisingly, we can see that overlap is higher for least frequent and therefore less ambiguous verbs. In general, although the two methods have some degree of overlap, agglomerative clustering does indeed manage to change and improve the original role assignments of the baseline.
An interesting question concerns precisely the type of changes affected by the agglomerative clustering algorithm over the assignments of the baseline. To be able to characterize these changes we first examined the consistency of the role assignments created by the two algorithms. Specifically, we would expect a verb-argument pair to be mostly assigned to the same cluster (i.e., an argument to bear the same role label for the same verb). Of course this is not a hard constraint as arguments and predicates can be ambiguous and their roles may vary in specific syntactic configurations and contexts. To give an idea of an upper bound, in our gold standard, an argument instance of the same verb bears on average 2.23 distinct roles. For comparison, the baseline creates (on average) 2.9 role clusters for an argument, whereas agglomerative clustering yields more consistent assignments, with an average of 2.34 role clusters per argument.
We further grouped the verbs in our data set into different bins according to their polysemy and allowable argument realizations. Specifically, we followed Levin's (1993) taxonomy and grouped verbs according to the number of semantic classes they inhabit (e.g., one, two, and so on). We also binned verbs according to the number of alternations they exhibit. To give an example, the verb donate is a member of the contribute class and participates in the causative/inchoative and dative alternations, whereas the verb shower is a member of four classes (i.e., spray/load, pelt, dress, and weather) and participates in the understood reflexive object and spray/load alternations. Figures 4a,b show the overlap in role assignments between the baseline and agglomerative clustering and how it varies according to verb class ambiguity and argument structure; figures 4c,d illustrate the same for role assignments and their consistency. As can be seen, there is less overlap between the two methods when the verbs in question are more polysemous (Figures 4a) or exhibit more variation in their argument structure (Figure 4b). As far as consistency in role assignments is concerned, agglomerative clustering appears overall more consistent than the baseline. As expected, the mean role assignment is slightly higher for polysemous verbs because differences in meaning manifest themselves in different argument realizations.
Figure 5 shows how purity, collocation, and F1 vary across alternations and verb classes. Perhaps unsurprisingly, performance is generally better for least ambiguous verbs exhibiting a small number of alternations. In general, agglomerative clustering achieves higher purity across the board whereas the baseline achieves higher collocation. Although agglomerative clustering achieves a consistently higher F1 over the baseline, the performance of the two algorithms converges for the most polysemous verbs (i.e., those inhabiting more than six semantic classes; see Figure 5f). Interestingly, also note that F1 is comparable for verbs with less varied argument structure (i.e., verbs inhabiting one alternation; see Figure 5c). For such verbs the performance gap between the baseline and the agglomerative algorithm is narrower both in terms of purity and collocation. Overall, we observe that agglomerative clustering is able to change some of the role assignments of the baseline for verbs exhibiting a good degree of alternations and polysemy.
Table 8 reports results for 12 individual verbs for the best performing method (i.e., agglomerative partitioning using cosine similarity) on the auto/auto data set. These verbs were selected so as to exhibit varied occurrence frequencies and alternation patterns. As can be seen, the macroscopic result—higher F1 due to significantly higher purity—seems to consistently hold also across verbs. An important exception is the verb say, for which the baseline attains high scores due to little variation in its syntactic realization within the corpus. Example output is given in Table 9, which shows the five largest clusters produced by the baseline and agglomerative partitioning for the verb increase. For each cluster we list the 10 most frequent argument head lemmas. In this case, our method managed to induce an A0 cluster that is not present in the top five clusters of the baseline, although the cluster also incorrectly contains some A1 arguments that stem from a false merge.
6. Role Induction Experiments on German
The applicability of our method to arbitrary languages is important from a theoretical and practical perspective. On the one hand, linguistic theory calls for models which are universal and generalize across languages. This is especially true for models operating on the (frame-) semantic level, which is a generalization over surface structure and should therefore be less language specific (Boas 2005). On the other hand, a language-independent model can be applied to arbitrary languages, genres, and domains and is thus of greater practical benefit. Because our approach is based on the language-independent principles discussed in Section 1, we argue that it can easily generalize to other languages. To test this claim, we further applied our methods to German data.
Although on a high-level, German clauses do not differ drastically from English ones with respect to their frame-semantic make-up, there are differences in terms of how frame elements are mapped onto specific positions on the linear surface structure of a sentence, beyond any variations observed among English verbs. In general, German places fewer constraints on word order (more precisely phrase order) and instead relies on richer morphology to help disambiguate the grammatical functions of linguistic units. In particular, verbal nominal arguments are marked with a grammatical case6 that directly indicates their grammatical function. Although in main declarative clauses the inflected part of the verb has to occur in second position, German is commonly considered a verb-final language. This is because the verb often takes the final position in subordinate clauses, as do infinitive verbs (Brigitta 1996).
6.1 Data
We conducted our experiments on the SALSA corpus (Burchardt et al. 2006), a lexical resource for German, which, like FrameNet for English, associates predicates with frames. SALSA is built as an extra annotation layer over the TIGER corpus (Brants et al. 2002), a treebank for German consisting of approximately 40,000 sentences (700,000 tokens) of newspaper text taken from the Frankfurter Rundschau, although to date not all predicate-argument structures have been annotated. The frame and role inventory of SALSA was taken from FrameNet, but has been extended and adapted where necessary due to lack of coverage and cross-lingual divergences.
The syntactic structure of a sentence is represented through a constituent tree whose terminal nodes are tokens and non-terminal nodes are phrases (see Figure 6). In addition to labeling each node with a constituent type such as Sentence, Noun Phrase, and Verb Phrase, the edges between a parent and a child node are labeled according to the function of the child within the parent constituent, for example, Accusative Object, Noun Kernel, or Head. Edges can cross, allowing local and non-local dependencies to be encoded in a uniform way and eliminating the need for traces. This approach has significant advantages for non-configurational languages such as German, which exhibit a rich inventory of discontinuous constituents and considerable freedom with respect to word order (Smith 2003). Compared with the Penn TreeBank (Marcus, Santorini, and Marcinkiewicz 1993), tree structures are relatively flat. For example, the tree does not encode whether a constituent is a verbal argument or adjunct; this information is encoded through the edge labels instead.
The frame annotations contained in SALSA do not cover all of the predicate-argument structures of the underlying TIGER corpus. Only a subset of around 550 predicates with approximately 18,000 occurrences in the corpus have been annotated. Moreover, only core roles are annotated, whereas adjunct roles are not, resulting in a smaller number of arguments per predicate (1.96 on average) compared with the CoNLL 2008 data set (2.57 on average) described in section 5.1. Because our method is designed to induce verb-specific frames, we converted the SALSA frames into PropBank-like frames by splitting each frame into several verb-specific frames and accordingly mapping frame roles onto verb-specific roles. Our data set is comparable to the German data set released as part of the CoNLL 2009 shared task (Hajič et al. 2009), which was also derived from the SALSA corpus. However, we did not convert the original constituent-based SALSA representation into dependencies, as we wanted to assess whether our methods are also compatible with phrase structure trees.
6.2 Experimental Set-up
Although we follow the same experimental set-up as described in Section 5 for English, there are some deviations due to differences in the data sets utilized for the two languages. Firstly, in contrast to the CoNLL 2008 data set, the SALSA data set (and the underlying TIGER corpus) does not supply automatic parse trees and we therefore conducted our experiments on gold parses only. Moreover, because adjunct arguments are not annotated in SALSA, and because argument identification is not the central issue of this work, we chose to also consider only the gold argument identification. Thus, all our experiments for German were carried out on the gold/gold data set.
A substantial linguistic difference between the German and English data sets is the sparsity of the argument head lemmas, which is significantly higher for German than for English: In the CoNLL 2008 data set, the average number of distinct head lemmas per verb is only 3.69, whereas in the SALSA data set it is 20.12. This is partly due to the fact that the Wall Street Journal text underlying the English data is topically more focused than the Rundschau newspaper text, which covers a broader range of news beyond economics and politics. Moreover, noun compounding is more commonly used in German than in English (Corston-Oliver and Gamon 2004), which leads to higher lexical sparsity.
Data sparsity affects our method, which crucially relies on lexical similarity for determining the role-equivalence of clusters. Therefore, we reduced the number of syntactic cues used for cluster initialization in order to avoid creating too many small clusters for which similarities cannot be reliably computed. Specifically, only the syntactic position and function word served as cues to initialize our clusters. Note that, as in English, the relatively small number of syntactic cues that determine the syntactic position within a linking is a consequence of the size of our evaluation data set (which is rather small) and not an inherent limitation of our method. On larger data sets, more syntactic cues could and should be incorporated in order to increase performance.
In our experiments we compared the baseline introduced in Section 5.3 against agglomerative partitioning and the label propagation algorithm using both cosine- and avgmax-similarity. The parameters α, β, and γ, which determine the thresholds used in defining overall similarity scores, were set and updated identically as for English (i.e., these parameters can be considered language-independent).
6.3 Results
Table 10 reports results for the baseline and our role induction methods, namely, agglomerative clustering and multi-layered label propagation (using the avgmax and cosine similarity functions) on the SALSA gold/gold data set. For comparison, we also include results on the English CoNLL-2008 gold/gold data set. As can be seen, the baseline obtains a similar F1 for German and English, although the contributions of purity and collocation are different for the two languages. In English, purity is noticeably higher than in German, whereas collocation is higher in German. This is not surprising when taking into account the distribution of syntactic relations governing an argument. A few frequent relation labels absorb most of the probability mass in German (see Figure 7b), whereas the mass is distributed more evenly among the labels in English (Figure 7a), thus leading to higher purity but lower collocation.
In German, our role induction algorithms improve over the baseline in terms of F1. All four methods perform comparably and manage to strike a tradeoff between collocation and purity that is non-trivial and represents semantic roles adequately. Compared with English, the difference between the baseline and our algorithms is narrower. This is because we use fewer syntactic cues for initialization in German, due to the increased data sparsity discussed in the previous section. This also explains why there is little variation in the collocation and purity results across methods. However, qualitatively the tradeoff between purity and collocation is the same as for English (i.e., purity is increased at the cost of collocation).
Tables 11 and 12 show per-verb and per-role results, respectively, for agglomerative clustering using cosine similarity. We report per-verb scores for a selection of 10 verbs (see Table 12a), which in some cases are translations of the verbs used for English. With respect to per-role scores, we make use of the fact that roles have a common meaning across predicates (like A0 and A1 in PropBank), and report scores for a selection of 15 different roles (Table 12b) with varied occurrence frequencies. Per-verb results confirm that data sparsity affects performance in German. As can be seen, agglomerative clustering outperforms the baseline on high-frequency verbs that are less affected by sparsity, although this is not always the case on lower-frequency verbs. Analogously, the method tends to perform better on high-frequency roles, whereas there is no clear trend on lower-frequency roles. In contrast to English, for more than half of the verbs the method manages to outperform the baseline in terms of both purity and collocation, which is consistent with our macroscopic result, where the tradeoff between purity and collocation is not as strong as for English.
The experiments show that our methods can be successfully applied to languages other than English, thereby supporting the claim that they are based on a set of language-independent assumptions and principles. Despite substantial differences between German and English grammar, both generally and in terms of the specific syntactic representation that was used, our methods increased F1 over the baseline for both languages and resulted in a similar tradeoff between purity and collocation. Improvements were observed in spite of pronounced data sparsity in the case of German. Recall that we had to reduce the number of syntactic initialization cues in order to be able to obtain results on the relatively small amount of gold-standard data. We would also like to note that porting our system to German did not require any additional feature engineering or algorithmic changes.
7. Conclusions
In this article we described an unsupervised method for semantic role induction in which argument-instance graphs are partitioned into clusters representing semantic roles. A major hypothesis underlying our work has been that semantic roles can be induced without human supervision from a corpus of syntactically parsed sentences based on three linguistic principles : (1) arguments in the same syntactic position (within a specific linking) bear the same semantic role, (2) arguments within a clause bear a unique role, and (3) clusters representing the same semantic role should be more or less lexically and distributionally equivalent. Based on these principles we have formulated a similarity-driven model and introduced a multi-layer graph partitioning approach that represents similarity between clusters on multiple feature layers, whose connectivity can be analyzed separately and then combined into an overall cluster-similarity score.
Our work has challenged the established view that supervised learning is the method of choice for the semantic role labeling task. Although the proposed unsupervised models do yet achieve results comparable to their supervised counterparts, we have been able to show that they consistently outperform the syntactic baseline across several data sets that combine automatic and gold parses, with gold and automatic argument identification in English and German. Our methods obtain F1 scores that are systematically above the baseline and the purity of the induced clusters is considerably higher, although in most cases this increase in purity is achieved by decreasing collocation. In sum, these results provide strong empirical evidence towards the soundness of our method and the principles they are based on.
In terms of modeling, we have contributed to the body of work on similarity-driven models by demonstrating their suitability for this problem, their effectiveness, and their computational efficiency. The models are based on judgments regarding the similarity of argument instances with respect to their semantic roles. We showed that these judgments are comparatively simple to formulate and incorporate into a graph representation of the data. We have introduced the idea of separating different similarity features into different graph layers, which resolves the problem faced by many similarity-based approaches of having to heuristically define an instance-wise similarity function and brings the advantage that cluster similarities can be computed in a more principled way. Beyond semantic role labeling, we hope that the multi-layered graph representation described here might be of relevance to other unsupervised problems such part-of-speech tagging or coreference resolution. The approach is general and amenable to other graph partitioning algorithms besides agglomeration and label propagation.
There are two forms of data sparsity that arise in the context of our work, namely, the lexical sparsity of argument head lemmas and the sparsity of specific combinations of linking and syntactic position. As our methods are unsupervised, the conceptually simple solution to sparsity is to train on larger data sets. Because, with some modifications, our graph partitioning approaches could be scaled to larger data sets (in terms of orders of magnitude), this is an obvious next step and would address both instances of data sparsity. Firstly, it would allow us to incorporate a richer set of syntactic features for initialization and would therefore necessarily result in initial clusterings of higher purity. Secondly, the larger size of clusters would result in more reliable similarity scores. Augmenting the data set would therefore almost surely increase the quality of induced clusterings; however, we leave this to future work.
Another interesting future direction would be to eliminate the model's reliance on a syntactic parser that prohibits its application to languages for which parsing resources are not available. It would therefore be worthwhile, albeit challenging, to build models that operate on more readily available forms of syntactic analysis or even raw text. For example, existing work (Abend and Rappoport 2010b; Abend, Reichart, and Rappoport 2009) attempts to identify arguments and distinguish them into core and adjunct ones through unsupervised part of speech and grammar induction. As much as making our model more unsupervised it would also be interesting to see whether some form of weak supervision might help induce higher-quality semantic roles without incurring a major labeling effort. The ideas conveyed in this article and the proposed methods extend naturally to this setting: Introducing labels on some of the graph vertices would translate into a semi-supervised graph-based learning task, akin to Zhu, Ghahramani, and Lafferty (2003).
Appendix A. Argument Identification Rules
This appendix specifies the full set of relations used by Rules (2) and (4) of the argument identification rules given for English in Section 5.2, Table 1. The symbols ↑ and ↓ denote the direction of the dependency relation (upward and downward, respectively). The dependency relations are explained in Surdeanu et al. (2008), in their Table 4.
The relations in Rule (2) from Table 1 are IM↑↓, PRT↓, COORD↑↓, P↑↓, OBJ↑, PMOD↑, ADV↑, SUB↑↓, ROOT↑, TMP↑, SBJ↑, OPRD↑.
The relations in Rule (4) are ADV↑↓, AMOD↑↓, APPO↑↓, BNF↑↓-, CONJ↑↓, COORD↑↓, DIR↑↓, DTV↑↓-, EXT↑↓, EXTR↑↓, HMOD↑↓, IOBJ↑↓, LGS↑↓, LOC↑↓, MNR↑↓, NMOD↑↓, OBJ↑↓, OPRD↑↓, POSTHON↑↓, PRD↑↓, PRN↑↓, PRP↑↓, PRT↑↓, PUT↑↓, SBJ↑↓, SUB↑↓, SUFFIX↑↓ TMP↑↓, VOC↑↓.
Acknowledgments
We are grateful to the anonymous referees, whose feedback helped to substantially improve this article. We also thank the members of the Probabilistic Models reading group at the University of Edinburgh for helpful discussions and comments. We acknowledge the support of EPSRC (grant EP/K017845/1).
Notes
More precisely, A0 and A1 have a common interpretation across predicates as proto-agent and proto-patient in the sense of Dowty (1991).
We include parts of speech as a simple means of alleviating the sparsity of head words.
We define the terms purity and collocation more formally in Section 5.4.
A few supervised systems implement a similar definition (Koomen et al. 2005), although in most cases the argument identification component makes a final positive or negative decision regarding the status of an argument candidate.
German has (partially ambiguous) markers for Nominative, Accusative, Dative, and Genitive.
References
Author notes
Department of Computer Science, University of Geneva, 7 route de Drize, 1227 Carouge, Switzerland, E-mail: [email protected].
Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh, 10 Crichton Street, EH8 9AB, E-mail: [email protected].