## Abstract

The task of event coreference resolution plays a critical role in many natural language processing applications such as information extraction, question answering, and topic detection and tracking. In this article, we describe a new class of unsupervised, nonparametric Bayesian models with the purpose of probabilistically inferring coreference clusters of event mentions from a collection of unlabeled documents. In order to infer these clusters, we automatically extract various lexical, syntactic, and semantic features for each event mention from the document collection. Extracting a rich set of features for each event mention allows us to cast event coreference resolution as the task of grouping together the mentions that share the same features (they have the same participating entities, share the same location, happen at the same time, etc.).

Some of the most important challenges posed by the resolution of event coreference in an unsupervised way stem from (a) the choice of representing event mentions through a rich set of features and (b) the ability of modeling events described both within the same document and across multiple documents. Our first unsupervised model that addresses these challenges is a generalization of the hierarchical Dirichlet process. This new extension presents the hierarchical Dirichlet process's ability to capture the uncertainty regarding the number of clustering components and, additionally, takes into account any finite number of features associated with each event mention. Furthermore, to overcome some of the limitations of this extension, we devised a new hybrid model, which combines an infinite latent class model with a discrete time series model. The main advantage of this hybrid model stands in its capability to automatically infer the number of features associated with each event mention from data and, at the same time, to perform an automatic selection of the most informative features for the task of event coreference. The evaluation performed for solving both within- and cross-document event coreference shows significant improvements of these models when compared against two baselines for this task.

## 1. Introduction

Event coreference resolution consists of grouping together the text expressions that refer to real-world events (also called **event mentions**) into a set of clusters such that all the mentions from the same cluster correspond to a unique event. The problem of event coreference is not new. It was originally studied in philosophy, where researchers tried to determine when two events are identical and when they are different. One relevant theory in this direction was proposed by Davidson (1969), who argued that two events are identical if they have the same causes and effects. Later on, a different theory was proposed by Quine (1985), who considered that each event is associated with a physical object (which is well defined in space and time), and therefore, two events are identical if their corresponding objects have the same spatiotemporal location. According to Malpas (2009), in the same year, Davidson abandoned his suggestion to embrace the Quinean theory on event identity (Davidson 1985).

Resolving event coreference is an essential requirement for many natural language processing (NLP) applications. For instance, in topic detection and tracking, event coreference resolution is required in order to identify new seminal events in broadcast news that have not been mentioned before (Allan et al. 1998). In information extraction, event coreference information was used for filling predefined template structures from text documents (Humphreys, Gaizauskas, and Azzam 1997). In question answering, a novel method of mapping event structures was used in order to provide answer justification (Narayanan and Harabagiu 2004). The same idea of mapping event structures was used in a graph-matching approach for enhancing textual entailment (Haghighi, Ng, and Manning 2005). Event coreference information was also used for detecting contradictions in text (de Marneffe, Rafferty, and Manning 2008).

Previous NLP approaches for solving event coreference relied on supervised learning methods that explore various linguistic properties in order to decide if a pair of event mentions is coreferential or not (Humphreys, Gaizauskas, and Azzam 1997; Bagga and Baldwin 1999; Ahn 2006; Chen and Ji 2009; Chen, Su, and Tan 2010b). In spite of being successful for a particular labeled corpus, in general, these pairwise models are dependent on the domain or language that they are trained on. For instance, in order to adapt a supervised system to run over a collection of documents written in a different language or belonging to a different domain of interest, at least a minimal annotation effort needs to be performed (Daumé III 2007). Furthermore, because these models are dependent on local pairwise decisions, they are unable to capture a global event distribution at the topic- or document-collection level.

To address these limitations, we departed from the idea of using supervised approaches for event coreference resolution and explored how a new class of unsupervised, nonparametric Bayesian models can be used to probabilistically infer coreference clusters of event mentions from a collection of unlabeled documents. In addition, because an event can be mentioned multiple times in a document collection and its mentions may occur both in the same document or across multiple documents, we designed our unsupervised models to solve the two subproblems of **within-document** and **cross-document** event coreference resolution. In order to evaluate the unsupervised models for these two subproblems, we annotated a new data set encoding both within- and cross-document event coreference information.

Besides our contribution of using unsupervised methods to solve within- and cross-document event coreference, in this article we present novel Bayesian models that provide a more flexible framework for representing data than current models. By starting from the generic problem of clustering observable linguistic objects (i.e., event mentions) encoded into a large collection of text documents where the clusters (i.e., events) can be shared across documents, we devised our unsupervised models such that they provide solutions to the following four desiderata:

- 1)
We prefer the number of clusters (denoted by

*K*) to be probabilistically inferred from data rather than to be assigned to an a priori fixed value. This desideratum of allowing*K*to be a free parameter in the Bayesian models devised for our problem constitutes a more realistic approach because, in general, document collections encode an unspecified number of latent linguistic structures. - 2)
We redefine the task of finding clusters of mentions that refer to the same events as the task of identifying those mentions that share the same

**event participants**and the same**event properties**. For example, the same entity must participate in all the event mentions that are coreferential; also, all the coreferential mentions must have the same spatiotemporal location. These characteristics extracted for each event mention from text are also called**linguistic features**and, in general, the event mentions corresponding to each of these clusters are characterized by a large set of features. Because of this, we desire that the generative process associated with each Bayesian model to automatically adapt every time a new feature is added in the feature extraction phase. - 3)
Although each event mention is represented as a feature-rich linguistic object, there is no guarantee that all the features that describe event mentions have a positive impact for the task of event coreference. Some of these features may be redundant or may increase the complexity of the Bayesian models solving this task and, consequently, they may contribute to lowering the overall performance of event coreference. To address these problems, we wish to incorporate into the Bayesian models a feature selection mechanism that is able to automatically build a set of the most salient features from the initial feature set such that only these salient features will participate in the process of clustering event mentions. In this regard, we assume that a feature is salient if it corresponds to a large number of samples in the generative process. We denote the size of the salient feature set by

*M*. Furthermore, in spite of the fact that the initial feature space describing event mentions can have an unbounded number of features, we want the set of salient features to be finite (i.e.,*M*–finite) at any given point in time during the generative process corresponding to each Bayesian model. - 4)
Finally, we also want our Bayesian models to capture the structural dependencies of the observable objects. In this way, the models can take advantage of the sequential order in which the event mentions are generated inside each document.

It is worth pointing out that the generic problem described here can be instantiated by tasks not only from the area of computational linguistics, but also from other research areas as well. For instance, in biomedical informatics, clinical researchers can use the new Bayesian models to perform studies over various cohorts of patients. In this configuration, the observations to be clustered correspond to patients, and the features associated with the patients can be extracted from clinical reports or can be represented by structured clinical information (e.g., white blood cells, temperature, heart rate, respiratory rate, sputum culture). Another instance of the generic problem described here is from data mining. In this domain, clustering tasks can be performed over structured information stored in large tables (e.g., products, restaurants, hotels). For this type of problem, each object is associated with a row in a table and the features correspond to table columns.

## 2. Related Work

Unlike entity coreference resolution, event coreference resolution is a relatively less-studied task. One rationale is that events are expressed in many more varied linguistic constructs. For example, event mentions are typically predications that require more complex lexico-semantic processing, and furthermore, the capability of extracting features that characterize them has been available only since semantic parsers based on PropBank (Palmer, Gildea, and Kingsbury 2005) and FrameNet (Baker, Fillmore, and Lowe 1998) corpora have been developed. In contrast, entity coreference resolution has been intensively studied and many successful techniques for identifying mention clusters have been developed (Cardie and Wagstaf 1999; Haghighi and Klein 2009; Stoyanov et al. 2009; Haghighi and Klein 2010; Raghunathan et al. 2010; Rahman and Ng 2011).

Even if entity coreference resolution has received much attention from the computational linguistic researchers, there is only limited work that incorporates event-related information to solve entity coreference, typically by considering the verbs that are present in the context of a referring entity as features. For instance, Haghighi and Klein (2010) include the governor of the head of nominal mentions as features in their model. Rahman and Ng (2011) used event-related information by looking at which semantic role the entity mentions can have and the verb pairs of their predicates. More recently, Lee et al. (2012) proposed an approach to jointly model event and entity coreference by allowing information from event coreference to help entity coreference, and the other way around. Their supervised method uses a high-precision entity resolution method based on a collection of deterministic models (called sieves) to produce both entity and event clusters that are optimally merged using linear regression. A similar technique that treated entity and event coreference resolution jointly was reported in He (2007) using narrative clinical data.

Research that aimed at resolving only event coreference was initiated by the template merging task required in MUC evaluations and was primarily focused on scenario-specific events (Humphreys, Gaizauskas, and Azzam 1997; Bagga and Baldwin 1999). More recently, various supervised approaches using a mention-pair probabilistic framework (Ahn 2006), spectral graph clustering (Chen and Ji 2009), and tree kernel–based methods (Chen, Su, and Tan 2010b) have been used to solve event coreference. Tree kernel–based methods have also been used to solve a special case of event coreference resolution called event pronoun resolution (Chen, Su, and Tan 2010a; Kong and Zhou 2011). To the best of our knowledge, the framework for solving event coreference presented in this article, extending the approach reported in Bejan and colleagues (Bejan et al. 2009; Bejan and Harabagiu 2010), is the only line of research on event coreference resolution that uses fully unsupervised methods and is based on Bayesian models.

Over the past years, Bayesian models have been extensively used for the purpose of solving similar problems or subproblems of the generic problem presented in the previous section. In 2003, Blei, Ng, and Jordan proposed a parametric approach, called **latent Dirichlet allocation** (LDA), for automatically learning probability distributions of words corresponding to a specific number of latent classes (or **topics**) from a large collection of text documents. In this latent class model, documents are expressed as probabilistic mixtures of topics, while each topic has assigned a multinomial distribution over the words from the entire document collection. This approach also uses an exchangeability assumption by modeling the documents as bags of words. The LDA model and variations of it have been used in many applications such as topic modeling (Blei, Ng, and Jordan 2003; Griffiths and Steyvers 2004), word sense disambiguation (Boyd-Graber, Blei, and Zhu 2007), object categorization from a collection of images (Sivic et al. 2005; Sivic et al. 2008), image classification into scene categories (Li and Perona 2005), discovery of event scenarios from text documents (Bejan 2008; Bejan and Harabagiu 2008b), and attachment of attributes to a concept ontology (Reisinger and Paşca 2009). The LDA model, although attractive, has the disadvantage of requiring a priori knowledge regarding the number of latent classes.

A more suitable approach for solving our problem is the **hierarchical Dirichlet process** (HDP) model described in Teh et al. (2006). Like LDA, this model considers problems that involve groups of data, where each observable object is sampled from a mixture model and each mixture component is shared across groups. However, the HDP mixture model is a nonparametric generalization of LDA that is also able to automatically infer the number of clustering components *K* (the first desideratum for our problem). It consists of a set of **Dirichlet processes** (DPs) (Ferguson 1973), in which each DP is associated with a group of data. In addition, these DPs are coupled through a common random base measure which is itself distributed according to a DP. Due to the fact that a DP provides a nonparametric prior for the number of classes *K*, the HDP setting allows for this number to be unbounded in each group. More recently, various other applications have been proposed to improve the existing HDP inference algorithms (Wang, Paisley, and Blei 2011; Bryant and Sudderth 2012). HDP has been used in a wide variety of applications such as maneuvering target tracking (Fox, Sudderth, and Willsky 2007), visual scene analysis (Sudderth et al. 2008), information retrieval (Cowans 2004), entity coreference resolution (Haghighi and Klein 2007; Ng 2008), event coreference resolution (Bejan et al. 2009; Bejan and Harabagiu 2010), word segmentation (Goldwater, Griffiths, and Johnson 2006), and construction of stochastic context-free grammars (Finkel, Grenager, and Manning 2007; Liang et al. 2007).

Although infinite latent class models like HDP have the advantage of automatically inferring the number of categorical outcomes *K*, they are still limited in representing feature-rich objects. Specifically, in their original form, they are not able to model the data such that each observable object can be generated from a combination of multiple features. For example, in HDP, each data point is represented only by its corresponding word. For this reason, we built new Bayesian models on top of already-existing models with the main goal of providing a more flexible framework for representing data. The first model extends the HDP model such that it takes into account additional linguistic features associated with event mentions. This extension is performed by using a conditional independence assumption between the observed random variables corresponding to object features. Thus, instead of considering as features only the words that express the event mentions (which is the way an observable object is represented in the original HDP model), we devised an HDP extension that is also able to represent features such as location, time, and agent for each event mention. This extension was inspired from the fully generative Bayesian model proposed by Haghighi and Klein (2007). However, Haghighi and Klein's model was strictly customized for the task of entity coreference resolution. As also noted in Ng (2008) and Poon and Domingos (2008), whenever new features need to be considered in Haghighi and Klein's model, the extension becomes a challenging task. Also, Daumé III and Marcu (2005) performed related work in this direction by proposing a generative model for solving supervised clustering problems.

As an alternative to the HDP model, an important extension of latent class models that are able to represent feature-rich objects is the **Indian buffet process** (IBP) model presented in Griffiths and Ghahramani (2005). The IBP model defines a distribution over infinite binary sparse matrices that can be used as a nonparametric prior on the features associated with observable objects. Moreover, extensions of this model were considered in order to provide a more flexible approach for modeling the data. For example, the **Markov Indian buffet process** (mIBP) (Van Gael, Teh, and Ghahramani 2008) was defined as a distribution over an unbounded set of binary Markov chains, where each chain can be associated with a binary latent feature that evolves over time according to Markov dynamics. Also, the **phylogenetic Indian buffet process** (pIBP) (Miller, Griffiths, and Jordan 2008) was created as a non-exchangeable, nonparametric prior for latent feature models, where the dependencies between objects were expressed as tree structures. Examples of applications that utilized these models are: identification of protein complexes (Chu et al. 2006), modeling of dyadic data (Meeds et al. 2006), modeling of choice behavior (Görür, Jäkel, and Rasmussen 2006), and event coreference resolution (Bejan et al. 2009; Bejan and Harabagiu 2010).

Our extension of the HDP model still does not fulfill all the desiderata for the generic problem introduced in Section 1. It still requires a mechanism to automatically select a finite set of salient features that will be used in the clustering process (third desideratum) as well as a mechanism for capturing the structural dependencies between objects (fourth desideratum). To overcome these limitations, we created two additional models. First, we incorporated the mIBP framework into our HDP extension to create the **mIBP–HDP** model. And second, we coupled an infinite latent feature model with an infinite latent class model into a new discrete time series model. For the infinite latent feature model, we chose the **infinite factorial hidden Markov model** (iFHMM) (Van Gael, Teh, and Ghahramani 2008) coupled with the mIBP mechanism in order to represent the latent features as an infinite set of parallel Markov chains; for the infinite latent class model, we chose the **infinite hidden Markov model** (iHMM) (Beal, Ghahramani, and Rasmussen 2002). We call this new hybrid the **iFHMM–iHMM** model.

### 2.1 Contribution

This article represents an extension of our previous work on unsupervised event coreference resolution (Bejan et al. 2009; Bejan and Harabagiu 2010). In this work, we present more details on the problem of solving both within- and cross-document event coreference as well as describe a generic framework for solving this type of problem in an unsupervised way. As data sets, we consider three different resources, including our own corpus (which is the only corpus available that encodes event coreference annotations across and within documents). In the next section, we provide additional information on how we performed the annotation of this corpus. Another major contribution of this article is an extended description of the unsupervised models for solving event coreference. In particular, we focused on providing further explanations about the implementation of the mIBP framework as well as its integration into the HDP and iHMM models. Finally, in this work, we significantly extended the experimental results section, which also includes a novel set of experiments performed over the OntoNotes English corpus (LDC-ON 2007).

## 3. Event Coreference Data Sets

Because our nonparametric Bayesian models are also unsupervised, they do not require the data set(s) on which they are trained to be annotated with event coreference information. The only requirement for them to infer coreference clusters of event mentions is to have the observable objects (i.e., the event mentions) identified in the order they occur in the documents as well as to have all the linguistic features associated with these objects extracted. However, in order to see how well these models perform, we need to compare their results with manually annotated clusters of event mentions. For this purpose, we evaluated our models on three different data sets annotated with event coreference information.

The first data set was used for the event coreference evaluations performed in the automatic content extraction (ACE) task (LDC-ACE 2005). This resource contains only a restricted set of event types such as life, business, conflict, and justice. As a second data set, we used the OntoNotes English corpus (release 2.0), a more diverse resource that provides a larger coverage of event (and entity) annotations. The utilization of the ACE and OntoNotes corpora for evaluating our event coreference models is, however, limited because these resources provide only within-document event coreference annotations. For this reason, as a third data set, we created the **EventCorefBank** (ECB) corpus^{1} to increase the diversity of event types and to be able to evaluate our models for both within- and cross-document event coreference resolution. Recently, Lee et al. (2012) extended the EventCorefBank corpus with entity coreference information and additional annotations of event coreference.

One important step in the creation process of the ECB corpus consists of finding sets of related documents that describe the same **seminal event**^{2} such that the annotation of coreferential event mentions across documents is possible. In this regard, we searched the Google News archive^{3} for various topics whose description contains keywords such as *commercial transaction*, *attack*, *death*, *sports*, *announcement*, *terrorist act*, *election*, *arrest*, *natural disaster*, and so on, and manually selected sets of Web documents describing the same seminal event for each of these topics. In a subsequent step, for every Web document, we automatically tokenized and split the textual content into sentences, and saved the preprocessed data in a uniquely identified text file. Next, we manually annotated a limited set of events in each text file in accordance with the TimeML specification (Pustejovsky et al. 2003a). To mark the event mentions and the coreferential relations between them we utilized the Callisto^{4} and Tango^{5} annotation tools, respectively. Additional details regarding the annotation process for creating the ECB resource are described in Bejan and Harabagiu (2008a).

Several annotation fragments from ECB are shown in Example (1). In this example, event mentions are annotated at the sentence level, sentences are grouped into documents, and the documents describing the same seminal event are organized into topics. The topics shown in Example (1) describe the seminal event of arresting sea pirates by a Navy warship (topic 12), the event of buying ATI by AMD (topic 43), the event of buying EDS by HP (topic 44), and the event of arresting a reputed football player (topic 55). When taken out of context, the event mentions annotated in this example refer only to two **generic events**: *arrest* and *buy*. On the other hand, when these mentions are contextually associated with the event properties expressed in Example (1), five **individuated events** can be distinguished: *e*_{1}={*em*_{2}, *em*_{3}}, *e*_{2}={*em*_{4 − 7}, *em*_{9}}, *e*_{3}={*em*_{8}}, *e*_{4}={*em*_{1}}, and *e*_{5}={*em*_{10}, *em*_{11}, *em*_{12}}. For example, *em*_{4 − 7} are event mentions referring to the same real event (of buying EDS by HP), whereas *em*_{2} (*buy*) and *em*_{4}(*buy*) correspond to different individuated events because they have a different Agent (i.e., Buyer(*em*_{2})=*AMD* is different from Buyer(*em*_{4})=*HP*). Similarly, the mentions *em*_{1}(*nabbed*) and *em*_{12} (*apprehended*) do not corefer because they correspond to different spatial and temporal locations (e.g., Location(*em*_{1})=*Gulf of Aden* is different from Location(*em*_{12})=*San Diego*).

This organization of event mentions leads to the idea of creating an event hierarchy as the one illustrated in Figure 1. Specifically, this figure depicts the hierarchy of the events described in Example (1). In this hierarchy, the nodes on the first level correspond to event mentions (e.g., *em*_{11} corresponds to *arrested*), the nodes on the second level correspond to individuated events (e.g., *e*_{5} subsumes all the event mention nodes that refer to the arrest of Vincent Jackson), and, finally, the nodes on the third level correspond to generic events (e.g., the node *arrest* contains all possible arrest events). In this article, our focus is to discover the nodes on the second level of this hierarchy.

As can be seen from Example (1), solving the event coreference problem poses many interesting challenges. For instance, in order to solve the coreference chain of event mentions that refer to the event *e*_{2}, we need to take into account the following issues: (i) a coreference chain can encode both within- and cross-document coreference information; (ii) two mentions from the same chain can have different word classes (e.g., *em*_{4}(*buy*)–verb, *em*_{5}(*purchase*)–noun); (iii) not all the mentions from the same chain are synonymous (e.g., *em*_{4}(*buy*) and *em*_{9}(*acquire*)), although a semantic relation might exist between them (e.g., in WordNet [Fellbaum 1998], the genus of *buy* is *acquire*); (iv) not all the properties associated with an event mention are expressed in text (e.g., all the properties of *em*_{5}(*purchase*) are omitted). In Section 7, we discuss additional challenges of the event coreference problem that are not observed in Example (1).

## 4. Linguistic Features for Event Coreference Resolution

The main idea for solving event coreference is to identify the event mentions (from the same or different documents) that share the same characteristics (e.g., all the mentions in a cluster convey the same meaning in text, have the same participants, and happen in the same space and temporal location). Moreover, finding clusters of event mentions that share the same characteristics is identical to finding clusters of mention features that correspond to the same real event. For instance, Figure 2 depicts five clusters of linguistic features that characterize the five individuated events from Example (1). As can be observed, each individuated event corresponds to a subset of features that are usually common to all the mentions referring to it. For this purpose, we extracted various linguistic features associated with each event mention from the ACE, OntoNotes, and ECB corpora.

Before describing in detail all the categories of linguistic features considered for solving event coreference, we would like to emphasize that we make a clear distinction between the notions of **feature type** and **feature value** throughout this article. A feature type is represented by a characteristic that can be extracted with a specific methodology and is associated with at least two feature values. For instance, the feature values corresponding to the feature type word consist of all the distinct words extracted from a given data set. In order to differentiate between the same values of different feature types, we inserted to the notation of each feature value the name of its corresponding feature type (e.g., word:*play*).

### 4.1 Lexical Features (LF)

We capture the lexical context of an event mention by extracting the following features: the head word (hw), the lemmatized head word (hl), the lemmatized left and right words surrounding the mention (lhl, rhl), and the hl features corresponding to the left and right mentions (lhe, rhe). For instance, the lexical features extracted for the event mention *em*_{8}(*bought*) from our example are hw:*bought*, hl:*buy*, lhl:*it*, rhl:*Compaq*, lhe:*acquisition*, and rhe:*acquire*.

### 4.2 Class Features (CF)

This category of features aims to group mentions into several types of classes: the part-of-speech of the hw feature (pos), the word class of the hw feature (hwc), and the event class of the mention (ec). The hwc feature type is associated with the following four feature values: verb, noun, adjective, and other. As feature values for the ec feature type, we consider the seven event classes defined in the TimeML specification language (Pustejovsky et al. 2003a): occurrence, perception, reporting, aspectual, state, i_action, and i_state. To extract all these event classes for all the event mentions, we used an event identifier trained on the TimeBank corpus (Pustejovsky et al. 2003b), a linguistic resource encoding temporal elements such as events, time expressions, and temporal relations. More details about this event identifier are described in Bejan (2007).

### 4.3 WordNet Features (WF)

In our efforts to create clusters of attributes corresponding to event mentions as close as possible to the true attribute clusters of the individuated events, we built two sets of word clusters using the entire lexical information from the WordNet database. After creating these sets of clusters, we associated each event mention with only one cluster from each set. For the first set, we used the transitive closure of the WordNet synonymous relation to form clusters with all the words from WordNet (wns). For instance, the verbs *buy* and *purchase* correspond to the same cluster ID because there exist a chain of synonymous relations between them in WordNet. For the second set, we considered as grouping criteria the categorization of words from the WordNet lexicographer's files (wnl). In addition, for each word that is not represented in WordNet, we created a new cluster ID in each set of clusters.

### 4.4 Semantic Features (SF)

To extract features that characterize participants and properties of event mentions, we used the semantic parser described in Bejan and Hathaway (2007). One category of semantic features that we identified for event mentions is the **predicate argument structures** encoded in the PropBank annotations (Palmer, Gildea, and Kingsbury 2005). The predicate argument structures in PropBank are represented by events (or verbs) and by the semantic roles (or **predicate arguments**) associated with these events. For example, arg0 annotates a specific type of semantic role which represents the agent, doer, or actor of a specific event. Another argument is arg1, which plays the role of the patient, theme, or experiencer of an event. In Example (1), for instance, the predicate arguments associated with the event mention *em*_{8}(*bought*) are arg0:[*it*], arg1:[*Compaq Computer Corp.*], arg3:[ *for $19 billion*], and arg-tmp:[*in 2002*].

Event mentions are not only expressed as verbs in text, but they also can occur as nouns and adjectives. Therefore, for a better coverage of semantic features, we also used the semantic annotations encoded in the FrameNet corpus (Baker, Fillmore, and Lowe 1998). FrameNet annotates word expressions capable of evoking conceptual structures, or **semantic frames**, which describe specific situations, objects, or events. The semantic roles associated with a word in FrameNet, or **frame elements**, are locally defined for the semantic frame evoked by the word. In general, the words annotated in FrameNet are expressed as verbs, nouns, and adjectives.

To preserve the consistency of the semantic role features, we aligned the frame elements to the predicate arguments by running the PropBank semantic parser on the manual annotations from FrameNet as well as running the FrameNet parser on the PropBank annotations. Moreover, to obtain a better alignment for each semantic role, we ran both parsers on a large amount of unlabeled text. The result of this process is a map with all frame elements statistically aligned to all predicate arguments. For instance, in 99.7% of the cases the frame element buyer of the semantic frame Commerce Buy is mapped to arg0, and in the remaining 0.3% of the cases to arg1. Additionally, we used this map to create a more general semantic feature that assigns a frame element label to each predicate argument. Examples of semantic features for the *em*_{8} mention are arg0:buyer, arg1:goods, arg3:money, and arg-tmp:time.

Another two semantic features used in our experiments are: (1) the semantic frame (fr) evoked by every mention in the data set, since in general, frames are able to capture properties of generic events (Lowe, Baker, and Fillmore 1997); and (2) the wns feature applied to the head word of every semantic role (e.g., wsarg0, wsarg1).

### 4.5 Feature Combinations (FC)

We also explored various combinations of the given features. For instance, the feature resulting from the combination of the hw and hwc feature types for *em*_{8}(*bought*) in Example (1) is hw+hwc:*bought*+verb. Examples of additional feature combinations we experimented with are hl+fr, hw+pos, fr+pos+ec, fe+arg1, and so forth.

## 5. Finite Feature Models

In this section, we first present HDP, a nonparametric Bayesian model that is capable of clustering objects based on one feature type (i.e., word); then, we introduce a novel extension of this model that describes an algorithm for clustering objects characterized by multiple feature types.

The HDP models take as input a collection of *I* documents, where each document *i* has *J*_{i} event mentions. Each event mention is characterized by *L* feature types (ft), and each feature type is represented by a finite vocabulary of feature values (*fv*). For example, the feature values extracted from an event coreference data set and associated with the feature type hw constitute all possible head words of the event mentions annotated in the data set. Therefore, we can represent the observable properties of an event mention as a vector of pairs 〈(ft_{1}:*fv*_{1i}), …, (ft_{L}:*fv*_{Li}), where each feature value index *i* ranges in the feature value space of its corresponding feature type. In the description of these models, we also consider **Z**: the set of indicator random variables for indices of events (i.e., an array of size equal with the number of event mentions in the document collection where *Z*_{i,j} represents the event index of the event mention *j* from the document *i*); *φ*_{z}: the set of parameters associated with an event *z*; *φ*: a notation for all model parameters; and **X**: a notation for all random variables that represent observable features. As already introduced in Section 1, we denote by *K* the total number of latent events.

### 5.1 The HDP_{1f} Model

The one feature model, denoted here as HDP_{1f}, constitutes the simplest representation of an HDP model. In this model, depicted graphically in Figure 3(a), the observable components are characterized by only one feature type (e.g., the head lemma corresponding to each event mention). The distribution over events associated with each document, β, is generated by a Dirichlet process with a concentration parameter *α* > 0. Because this setting enables a clustering of event mentions at the document level, it is desirable that events be shared across documents and the number of events, *K*, be inferred from data. To ensure this flexibility, a global nonparametric DP prior with a hyperparameter *γ* and a global base measure *H* can be considered for β (Teh et al. 2006). The global distribution drawn from this DP prior, denoted as β_{0} in Figure 3(a), encodes the event mixing weights. Thus, the same global events are used for each document, but each event has a document specific distribution *β*_{i} that is drawn from a DP prior centered on β_{0}.

*P*(

**Z**|

**X**), we followed Teh et al. (2006) and used a

**Gibbs sampling algorithm**(Geman and Geman 1984) based on the direct assignment sampling scheme. In this sampling scheme, the β and φ parameters are integrated out analytically. The formula for sampling an event index for mention

*j*from document

*i*,

*Z*

_{i,j}, is given by:

^{6}where

*HL*

_{i,j}is the head lemma of event mention

*j*from document

*i*.

*z*is sampled by using a mechanism that facilitates sampling from a prior for infinite mixture models called the

**Chinese restaurant franchise**(CRF) representation, as reported in (Teh et al. 2006):In this formula,

*n*

_{z}is the number of event mentions with event index

*z*,

*z*

_{new}is a new event index not used already in

**Z**

^{− i,j}, are the global mixing proportions associated with the

*K*events, and is the weight for the unknown mixture component.

**X**= 〈

**HL**〉), the event

*z*is associated with a multinomial emission distribution over the hl feature values having the parameters . We assume that this emission distribution is drawn from a symmetric Dirichlet distribution with concentration

*λ*

_{HL}:where

*HL*

_{i,j}is the head lemma of mention

*j*from document

*i*, and

*n*

_{hl,z}is the number of times the feature value

*hl*has been associated with the event index

*z*in (

**Z**,

**HL**

^{− i,j}).

### 5.2 The HDP_{flat} Model

A model in which observable components are represented only by one feature type has the tendency to cluster these components based on their corresponding feature values. This model may produce good results for tasks such as topic discovery where the linguistic objects rely only on lexical information. Because event coreference involves clustering complex objects characterized by a large number of features, it is desirable to extend the HDP_{1f} model with a generalized model where additional feature types can be easily incorporated. Moreover, this extension should allow multiple feature types to be added simultaneously.

**Z**. This assumption considerably reduces the complexity of computing

*P*(

**Z**|

**X**). For example, if we want to incorporate into the previous model the feature type associated with the semantic frame evoked by every event mention (i.e.,

*FR*), the formula becomes:In this formula, we omit the conditioning components of

**Z**,

**HL**, and

**FR**for the sake of clarity. The graphical representation corresponding to this model is illustrated in Figure 3(b). In general, if

**X**consists of

*L*feature variables, the inference formula for the Gibbs sampler is defined as:The graphical model for this general setting is depicted in Figure 3(c). Drawing an analogy, the graphical representation involving

**Z**and feature variables resembles the graphical representation of a naive Bayes classifier.

### 5.3 The HDP_{struct} Model

*P*(

**Z**|

**X**). For the model depicted in Figure 3(d), for instance, the posterior probability is given by:In this model,

*P*(

*FR*

_{i,j}|

*HL*

_{i,j}, θ) is a global distribution parameterized by θ, and

*FT*is a feature type variable from the set

**X**= 〈

**HL**,

**POS**,

**FR**〉. However, one limitation of this particular model is that it requires domain knowledge in order to establish the dependencies between the feature type variables.

For all the HDP extended models, we computed the prior and likelihood factors as described in the HDP_{1f} model. In the inference mechanism, we assigned soft counts to those likelihood factors whose corresponding feature values cannot be extracted for a given event mention (e.g., unspecified predicate arguments). It is worth noting that there exist event mentions for which not all the features can be extracted. For instance, the feature types corresponding to the left and right lemmatized head words (denoted in Section 4 as lhe and rhe, respectively) are missing for the first and last event mentions in a document. Also, many semantic roles can be absent for an event mention in a given context.

## 6. Infinite Feature Models

One of the main limitations of the HDP extensions presented in the previous section is that these models have limited capabilities in representing the observable objects characterized by a large number of feature types. This is because, in order to sample the event indices into the set of indicator random variables **Z**, the HDP models need to store in memory large matrices that encode the significant statistics for the observable components associated with each cluster. More specifically, in order to compute the likelihood factors in Equation (5), for each feature type ft_{i}, *i* = 1 … *L*, we assigned a counting matrix having the number of rows equal with the number of distinct feature values corresponding to ft_{i} and *K* + 1 columns, where *K* represents the number of inferred events. For instance, the counting matrix corresponding to the head lemma feature type (HL) stores the number of times each feature value of the HL feature type has been associated with each event index during the HDP generative process. The number *n*_{hl,z} in Equation (5), for example, is stored in a cell of this matrix.

Just to have an idea of how much memory the HDP models require to infer the events from OntoNotes, we made the following calculation. In OntoNotes, we automatically identified a total number of 81,938 event mentions for which we extracted 454,170 distinct feature values. For all data sets, we considered *L* = 132 feature types, which means that, on average, each feature type is associated with approximately 3,440 feature values. Because *K* is bounded by the total number of event mentions considered (i.e., the case when each event mention is associated with a different event), the maximum value that it can reach when inferring the event indices from OntoNotes is 81,938. If we consider that each cell from the counting matrices associated with each feature type is represented into the memory by one byte, the total space required to store only one such matrix is, on average, 81,938×3,440 bytes. By a simple computation, the total amount of memory to store all 132 matrices is ∼ 34.6 gigabytes (GB). Furthermore, by adding more data, the amount of memory needed by the HDP models increases considerably. For instance, if we consider all three data sets (with a total number of 148,402 event mentions and 832,611 distinct feature values), the memory space required increases to 115 GB. Because in our implementation we used the int type (4 bytes) to represent the counting matrices, the total amount of memory required by the HDP extensions to infer the event indices from OntoNotes and all three data sets when considering all 132 feature types is in fact 4 × 34.6 = 138.4 GB and 4 × 115 = 460.3 GB, respectively.

Due to this limitation, the HDP extensions will be able to run only using a restricted, manually selected set of feature types.^{7} Therefore, the existence of a novel methodology that is able to consider a much smaller subset of representative feature values from the entire feature space is necessary. For this purpose, we devised two novel approaches that provide a more flexible representation of the data by modeling event mentions with an infinite number of features and by using a mechanism to automatically select a finite set of the most salient features for each mention in the inference process. The first approach uses the **Markov Indian buffet process** (mIBP) to represent each object as a sparse subset of a potentially unbounded set of latent features (Griffiths and Ghahramani 2006; Ghahramani, Griffiths, and Sollich 2007; Van Gael et al. 2008), and combines it with the HDP extension presented in the previous section. We call this hybrid the **mIBP–HDP model**. The second approach uses the **infinite factorial hidden Markov model** (iFHMM), which is an extension of mIBP, and combines it with the **infinite hidden Markov model** (iHMM) to form the **iFHMM–iHMM model**.

### 6.1 The mIBP–HDP Model

In this section, we describe a model that is able to represent event mentions characterized by an unbounded set of feature values into the HDP framework. Although the feature space describing event mentions is unbounded, this approach is able to model the uncertainty in the number of feature values *M* that will be used for clustering event mentions and, at the same time, is able to guarantee that this number is finite at any point in time during the generative process. First, we use mIBP to describe a mechanism for assigning to each event mention a sparse subset of feature values from the set of *M* observed feature values used in the clustering process. We will use the set of notations introduced in this description when presenting both mIBP–HDP and iFHMM–iHMM models. Then, we will show how this mechanism is coupled into the HDP framework.

#### 6.1.1 The Markov Indian Buffet Process

The Markov Indian buffet process (Van Gael, Teh, and Ghahramani 2008) defines a distribution over an unbounded set of independent hidden Markov chains, where each chain is associated with a binary latent feature value that evolves over time according to Markov dynamics. Specifically, if we denote by *M* the total number of Markov chains associated with the latent feature values and by *T* the number of observations, mIBP defines a probability distribution over a binary matrix **F** with an unbounded number of rows *M* (*M*→ ∞) and *T* columns.

In our framework, we use mIBP to incrementally build the set of *M* observed feature values that will be used for clustering event mentions (denoted as {*f*^{1},*f*^{2}, …, *f*^{M}}), as well as to determine which of these feature values will be selected to explain each event mention. The sequence of observations is associated with the sequence of event mentions, *y*_{1},*y*_{2}, …, *y*_{T}, and each latent feature value in the mIBP framework is associated with one observed feature value from the unbounded set of features that characterize our event mentions. It is worth mentioning that, at any given time point during the mIBP generative process, from the unbounded set of observed features, we index only these *M* observed feature values that correspond to the set of hidden feature values.

The selection of the observed feature values which will represent each event mention in the clustering process is determined by the indicator random variables of the binary matrix **F**. For instance, the selection of the observed feature value *f*^{i} for the event mention *y*_{t} is indicated by an assignment of the binary random variable to 1 in the mIBP generative process. More specifically, the set of observed feature values that will represent the event mention *y*_{t} is indicated in the matrix by the column vector of binary random variables . Therefore, **F** decomposes the event mentions and represents them as feature value factors, which can then be associated with hidden variables in an iFHMM model as described in Van Gael, Teh, and Ghahramani (2008).

*a*

_{m}∼Beta(

*α*′/

*M*, 1) and

*b*

_{m}∼Beta(

*γ*′,

*δ*′), and the initial state . In the mIBP process, the hidden variable associated with an observed feature value

*f*

^{m}and an event mention

*y*

_{t}is generated from the following Bernoulli distribution:

Based on these definitions, we computed the probability of the feature matrix **F**^{8} (in which the parameters **a** and **b** are integrated out analytically) by recording the number of 0→0, 0→1, 1→0, and 1→1 transitions for each binary chain *m* into the counting variables , , , and , respectively. For example, the associated with the feature value representing the verb class (*f*^{m} = hwc:verb) counts how many times this feature value was assigned to the event mention *y*_{t} when it was also assigned to the previous event mention *y*_{t − 1} during the generative process.

*α*′) latent features for the first component. In our implementation, this statement is equivalent with the process of randomly selecting for the first event mention a number of Poisson(

*α*′) observed feature values. In the general case, the sampling of the binary variable from the

*m*

^{th}Markov chain and associated with the

*t*

^{th}event mention depends on the value assigned to the hidden variable in the previous

*t*− 1 step:As a result, in our implementation, the observed feature value

*f*

^{m}is selected for the

*t*

^{th}event mention according to the probabilities presented in Equation (11). For example, in order to select the feature value which indicates that the

*t*

^{th}event mention has the occurrence event class (i.e.,

*f*

^{m}= ec:occurrence), we need to determine whether or not the event mention

*t*− 1 from the document collection selected this feature value. In the cases when ec:occurrence was previously selected for the event mention

*t*− 1 (), we select this feature value according to . Otherwise, the selection is determined according to . Furthermore, in the

*t*

^{th}step of the generative process, the same sampling mechanism is repeated until all

*M*latent feature values are generated. After sampling all these feature values for the

*t*

^{th}event mention, an additional number of Poisson(

*α*′/

*t*) new feature values are assigned to this mention, and

*M*gets incremented accordingly.

As an observation regarding the mIBP generative process, it has been shown that *M* grows logarithmically with the number of observed components (in our case, event mentions) (Ghahramani, Griffiths, and Sollich 2007; Doshi-Velez 2009). This type of growth is desirable because it provides a scalable solution for our models to work in an efficient way on fairly large data sets.

#### 6.1.2 Integration of mIBP into HDP

One direct application of the mIBP model is to integrate it into the framework of the HDP extension model described in the previous section. In this way, the new nonparametric extension will have the benefits of capturing the uncertainty regarding the number of mixture components that are characterized by a potentially infinite number of feature values. However, to make this hybrid work, we have to devise a mechanism in which only a finite set of relevant feature values will be selected to explain each observation (i.e., event mention) in the HDP inference process.

*y*

_{t},

*f*

^{m}one of the feature values that characterizes

*y*

_{t},

*q*

_{m}the number of times

*f*

^{m}was selected for all mentions during mIBP, and

*v*

_{t}a threshold variable for

*y*

_{t}such that , we define the finite set of feature values

*B*

_{t}corresponding to the observation

*y*

_{t}as:

A pictorial representation of this idea is illustrated in Figure 4, where only the feature values *f*^{m} with the corresponding counts *q*_{m} above the threshold indicated by *v*_{t} are selected in *B*_{t}. The finiteness of this feature set is based on the observation that, at any time point during the generative process of the mIBP model, only a finite set of latent features have assigned a value of 1 for an event mention. Furthermore, based on the assumption that the more a feature value is selected during the mIBP generative process the more relevant it is for the event coreference task, each set *B*_{t} contains the most informative feature values that are able to explain its corresponding event mention *y*_{t}. This last property is ensured by the second constraint imposed when building each set *B*_{t} (i.e., *q*_{m} ≥ *v*_{t}). Due to the fact that the threshold variables are sampled using a uniform distribution, we denote this model as mIBP–HDP_{uniform}.

The feature values selected by this mechanism are used to represent the event mentions in the clustering process of the HDP. The main difference from the original implementation of the HDP extensions is that, in this new model, instead of representing the event mentions by the entire set of feature values from the initial feature space (which can be as large as possible), only a restricted subset of these feature values is considered. Furthermore, due to the random process of selecting the feature values, the number of feature values associated with each event mention can vary significantly. We adapted the implementation of the HDP framework to this modification by truncating all counting matrices such that they will represent only the feature values selected in mIBP. More specifically, we removed from each counting matrix the rows corresponding to all the feature values that were not selected during the mIBP generative process. Because *M* grows as *O*(log*T*), it now becomes feasible for the HDP extension models to represent event mentions using the entire set of feature types. It is important to mention that this modification does not affect the implementation of the Gibbs sampler in the HDP framework because we always normalize the probabilities corresponding to the likelihood factors in Equation (5) when computing the posterior distribution over event indices.

Moreover, using the assumption that the relevance of a feature value is proportional with the number of times it was selected during the mIBP generative process, we explored additional heuristics for building the sets of feature values *B*_{t} for each event mention. In general, we chose these new heuristics to be biased towards selecting more relevant feature values *f*^{m} for each event mention *y*_{t} (i.e., their counts *q*_{m} to be closer to ). One such heuristic is based on the method that considers for each event mention *y*_{t} all feature values *f*^{m} with the counts *q*_{m} ≥ 1 (i.e., *v*_{t} = 1). In this case, each set *B*_{t} contains all the observed feature values selected for each event mention *y*_{t} during the mIBP process, and therefore it represents a subset of the set of observed feature values {*f*^{1},*f*^{2}, …, *f*^{M}}. It is worth mentioning that all the subsets of {*f*^{1},*f*^{2}, …, *f*^{M}} are finite due to the fact that *M* is finite at any given point in time during mIBP. In consequence, all the *B*_{t} sets derived using this heuristic are finite. Because no feature value is filtered out after it was assigned to an event mention during mIBP, we denote the model implementing this heuristic as mIBP–HDP_{unfiltered}. Starting from the distribution of the counting variables *q*_{m} corresponding to those feature values *f*^{m} selected during the mIBP generative process for an event mention *y*_{t}, another heuristic considers for building each set *B*_{t} only the feature values with the counts above the median of this distribution (mIBP–HDP_{median}). Finally, the last heuristic we experimented with is based on the idea of sampling the threshold variables *v*_{t} directly from the distribution of the counting variables associated with each event mention *y*_{t} (mIBP–HDP_{discrete}). The implementation of these three heuristics is possible due to the observation that in the mIBP–HDP framework the size of each set *B*_{t} is not required to be known in advance.

### 6.2 The iFHMM–iHMM Model

Over the years, the **hidden Markov model** (HMM) (Rabiner 1989) has proven to be one of the most commonly used statistical tools for modeling time series data. Due to the efficiency in estimating its parameters, various HMM generalizations were proposed for a better representation of the latent structure encoded in this type of data. Figure 5 illustrates a hierarchy of HMM extensions whose main criteria of expansion is based on relaxing the constraints on the parameters *M* (the number of state chains) and *K* (the number of clustering components). In the **factorial hidden Markov model** (FHMM), Ghahramani and Jordan (1997) introduced the idea of factoring the hidden state space into a finite number of state variables, in which each of these variables has its own Markovian dynamics. Later on, Van Gael, Teh, and Ghahramani (2008) introduced the **infinite factorial hidden Markov model** (iFHMM) with the purpose of allowing the number of parallel Markov chains *M* to be learned from data. Although the iFHMM provides a more flexible representation of the latent structure, it cannot be used as a framework where the number of clustering components *K* is infinite. In this direction, Beal, Ghahramani, and Rasmussen (2002) proposed the **infinite hidden Markov model** (iHMM) in order to perform inferences with an infinite number of states *K*. To further increase the representational power for modeling discrete time series data, we introduce a novel nonparametric extension that combines the best of the iFHMM and iHMM models (denoted as iFHMM–iHMM) and lets both parameters *M* and *K* to be learned from data.

As shown in Figure 5, the graphical representation of this new model consists of a sequence of hidden state variables, (*s*_{1}, …, *s*_{T}), that corresponds to the sequence of event mentions (*y*_{1}, …, *y*_{T}). Each hidden state *s*_{t} can be assigned to one of the *K* latent events, *s*_{t} ∈ {1, …, *K*}, and each mention *y*_{t} is represented by a column vector of binary random variables . One element of the transition probability π is defined as *π*_{ij} = *P*(*s*_{t} = *j*|*s*_{t − 1} = *i*), and a mention *y*_{t} is generated according to a likelihood model that is parameterized by a state-dependent parameter . The observation parameters φ are independent and identically distributed drawn from a prior base distribution *H*.

#### 6.2.1 Inference

The main idea of the inference mechanism corresponding to this new model is illustrated in Figure 6. As depicted in this figure, each step in the generative process of the new hybrid model is performed in two consecutive phases. In the first phase, the binary random variables associated with each feature value from the iFHMM framework are sampled using the mIBP mechanism, and consequently, the most salient feature values are selected for each event mention (Figure 6: Phase I). Of note, the *B*_{t} sets of feature values associated with each event mention *y*_{t} are determined using the same set of heuristics as described in Section 6.1. In the second phase, the feature values sampled so far, which become observable during this phase, are used in an adapted **beam sampling algorithm** (Van Gael et al. 2008) to infer the clustering components or latent events (Figure 6: Phase II).

Because we utilized the same mechanism for determining the sets of relevant feature values for each event mention (as described in Section 6.1), in this section we focus on describing our implementation of the beam sampling algorithm. The beam sampling algorithm (Van Gael et al. 2008) combines the ideas of slice sampling (Neal 2003) and dynamic programming for an efficient sampling of state trajectories. Because in time series models the transition probabilities have independent priors (Beal, Ghahramani, and Rasmussen 2002), Van Gael et al. (2008) also used the HDP mechanism to allow couplings across transitions. For sampling the whole hidden state trajectory **s**, this algorithm uses a **forward filtering-backward sampling technique**.

*u*

_{t}is sampled for each mention . The auxiliary variables

**u**are used to filter only those trajectories

**s**for which , for all

*t*. Also, in this step, for all

*t*, the probabilities

*P*(

*s*

_{t}|

*y*

_{1:t},

*u*

_{1:t}) are computed as follows:In this formula, the dependencies involving parameters π and φ are omitted for clarity.

*H*to be conjugate with the data distribution in a Dirichlet-multinomial model with the multinomial parameters (

*o*

_{1}, …,

*o*

_{K}) defined as:where

*n*

_{mk}counts how many times the feature value

*f*

^{m}was assigned in the generative process to event

*k*, and

*B*

_{t}stores a finite set of feature values for

*y*

_{t}as defined in Section 6.1. As can be noticed, the multinomial parameters defined here are finite due to the fact that each set of feature values

*B*

_{t}is finite and the number of event mentions

*T*is fixed. This allows us to define a proper emission distribution for the new hybrid model. In a similar manner to the notations of the mIBP–HDP model, we make notations of the iFHMM–iHMM model according to the heuristic used for selecting the feature values.

## 7. Evaluation

In this section, we present the evaluation framework of the Bayesian models for both within-document (wd) and cross-document (cd) coreference resolution. We start by briefly describing the experimental set-up and coreference evaluation measures, and then continue by showing the experimental results on the ACE, OntoNotes, and EventCorefBank data sets. Finally, we conclude with an analysis of the most common errors made by the Bayesian models.

### 7.1 The Experimental Set-up

In the data processing phase, we extracted the linguistic features described in Section 4 for each event mention annotated in the three data sets. As a result of this phase, in the ACE corpus, we identified 6,553 event mentions grouped into 4,946 events, and in the OntoNotes corpus, we identified 11,433 event mentions grouped into 3,393 events. Likewise, in the new ECB corpus, we distinguished 1,744 event mentions, 1,302 within-document events, 339 cross-document events, and 43 seminal events (or topics). Table 1 lists additional statistics extracted from these three data sets after performing this phase.

. | ace . | OntoNotes . | ecb . |
---|---|---|---|

Number of true mentions | 6,553 | 11,433 | 1,744 |

Number of system mentions | 45,289 | 81,938 | 21,175 |

Number of within-document events | 4,946 | 3,393 | 1,302 |

Number of cross-document events | – | – | 339 |

Number of documents | 745 | 1,540 | 482 |

Number of seminal events | – | – | 43 |

Average number of true mentions/within-document event | 1.32 | 3.37 | 1.34 |

Average number of true mentions/document | 8.79 | 7.42 | 3.62 |

Average number of true mentions/seminal event | – | – | 40.55 |

Average number of system mentions/document | 60.79 | 53.2 | 43.93 |

Average number of within-document events/document | 6.63 | 2.20 | 2.70 |

Average number of within-document events/seminal event | – | – | 30.27 |

Average number of cross-document events/seminal event | – | – | 7.88 |

Average number of documents/seminal event | – | – | 11.20 |

Number of distinct feature values for system mentions | 391,798 | 454,170 | 237,197 |

. | ace . | OntoNotes . | ecb . |
---|---|---|---|

Number of true mentions | 6,553 | 11,433 | 1,744 |

Number of system mentions | 45,289 | 81,938 | 21,175 |

Number of within-document events | 4,946 | 3,393 | 1,302 |

Number of cross-document events | – | – | 339 |

Number of documents | 745 | 1,540 | 482 |

Number of seminal events | – | – | 43 |

Average number of true mentions/within-document event | 1.32 | 3.37 | 1.34 |

Average number of true mentions/document | 8.79 | 7.42 | 3.62 |

Average number of true mentions/seminal event | – | – | 40.55 |

Average number of system mentions/document | 60.79 | 53.2 | 43.93 |

Average number of within-document events/document | 6.63 | 2.20 | 2.70 |

Average number of within-document events/seminal event | – | – | 30.27 |

Average number of cross-document events/seminal event | – | – | 7.88 |

Average number of documents/seminal event | – | – | 11.20 |

Number of distinct feature values for system mentions | 391,798 | 454,170 | 237,197 |

It is also worth mentioning that for processing OntoNotes we devoted additional efforts. This is because, in spite of the fact that OntoNotes provides coreference annotations for both entity and event mentions, the annotations from this data set do not specify which of the mentions refer to entities and which of them refer to events. Therefore, in order to identify only the event mentions from OntoNotes, we first ran our event identifier (Bejan 2007) and then marked as event mentions only those mentions annotated in this data set that overlap with the mentions extracted by the event identifier. Using this procedure, we marked a number of 4,940 mentions as event mentions from the total number of 67,500 mentions annotated in OntoNotes. In a second step of processing OntoNotes, we extended the number of event mentions to 11,433 by marking all the mentions that share the same cluster with at least one event mention from the set of 4,940 previously identified event mentions. From the 6,493 event mentions marked in this step, the majority of them correspond to nouns (4,707) and to the *it* pronoun (767).

Although only a small subset of event mentions was manually annotated with event coreference information in the three data sets (also called the set of **true** or **gold event mentions**), during the generative process, we considered all possible event mentions that are expressed in the data sets for every specific event. We believe this is a more realistic approach, in spite of the fact that we evaluated only the manually annotated events. For this purpose, we ran the event identifier described in Bejan (2007) on the ACE, OntoNotes, and ECB corpora, and extracted 45,289, 81,938, and 21,175 event mentions, respectively. It is also worth mentioning that the set of event mentions obtained from running the event identifier (also called the set of **system event mentions**) on ACE and ECB includes more than 98% from the set of true event mentions. In terms of feature space dimensionality over the two data sets, we performed experiments with a set of 132 feature types, where each feature type consists, on average, of 6,300 distinct feature values.

In the evaluation phase, we considered only the true mentions from the ACE test data set and from the test sets of a five-fold cross validation scheme on the OntoNotes and ECB data sets. For evaluating the cross-document coreference annotations from EventCorefBank, we adopted the same approach as described in Bagga and Baldwin (1999) by merging all the documents from the same topic into a meta-document and then scoring this document as performed for within-document evaluation. To compute the final results of our experiments, we averaged the results over five runs of the generative models.

### 7.2 Coreference Resolution Metrics

Because there is no agreement on the best coreference resolution metric, we used four metrics for our evaluation: the **link**-based muc metric (Vilain et al. 1995), the **mention**-based b^{3} metric (Bagga and Baldwin 1998), the **entity**-based ceaf metric (Luo 2005), and the pairwise (pw) metric. These metrics report results in terms of recall (R), precision (P), and F-score (F) by comparing the true set of coreference chains (i.e., the manually annotated coreference chains) against the set of chains predicted by a coreference resolution system . Here, a **coreference link** represents a pair of coreferential mentions whereas a **coreference chain** represents all the event mentions from the same cluster with coreference links between consecutive mentions.

The muc recall computes the number of common coreference links in and divided by the number of links in , and the muc precision computes the number of common links in and divided by the number of links in . As was previously noted (Luo et al. 2004; Denis and Baldridge 2008; Finkel and Manning 2008), this metric favors the systems that group mentions into smaller number of clusters (or, in other words, systems that predict large coreference chains) and does not take into account single mention clusters. For instance, a system that groups all entity mentions into the same cluster achieves a muc score that surpasses any published results of known systems developed for the task of entity coreference resolution.

The b^{3} metric was designed to overcome some of the muc metric's shortcomings. This metric computes the recall and precision for each mention and then estimates the overall score by averaging over all mention scores. For a given mention *m*, the scorer compares the true coreference chain that contains the mention *m* (*T*_{m}) against the system chain that contains the same mention *m* (*S*_{m}). Thus, the recall for *m* is the ratio of the number of common elements in *S*_{m} and *T*_{m} over the number of elements in *T*_{m}. Similarly, the precision corresponding to the mention *m* is the ratio of the number of common elements in *S*_{m} and *T*_{m} over the number of elements in *S*_{m}. Because this metric computes the precision and recall for each mention, it will penalize in precision the systems that predict a small number of clusters. Because of the same reason, this metric includes single mention clusters in the evaluation.

The Constrained Entity-Alignment F-Measure (ceaf) scorer finds the best alignment between the set of true coreference chains and the set of predicted coreference chains . This is equivalent to finding the best mapping in a weighted bipartite graph. We computed the weight of a pair of coreference chains (*T*_{i}, *S*_{j}), with and , by using the *φ*_{4} similarity measure described in Luo et al. (2004). Therefore, the ceaf recall and precision measures are computed as the overall similarity score of the best alignment divided by the self-similarity score of the coreference links in and , respectively.

The last coreference metric that we considered, the pw metric, finds correspondences between all mentions pairs (*m*_{i}, *m*_{j}) from the true and system chains with the coreference chains linking the mentions *m*_{i} and *m*_{j} in the system and true chains, respectively. As can be noticed, this metric overpenalizes those systems that predict too many or too few clusters when compared with the number of true clusters.

### 7.3 Experimental Results

Tables 2, 3, 4, and 5 list the results performed by our proposed baselines (rows 1–2), by the HDP models (rows 3–8), by the mIBP–HDP model (row 9), and by the iFHMM–iHMM model (rows 10–13). We discuss the performance achieved by these models in the remaining part of this section.

#### 7.3.1 Baseline Results

A simple baseline for event coreference, which was proposed by Ahn (2006), consists of grouping event mentions by their event classes (BL_{eclass}). To compute this baseline, we grouped mentions into clusters according to their corresponding ec feature value. In consequence, this baseline categorizes events into a small number of clusters, since the event identifier for extracting the ec features is trained to predict the seven event classes annotated in TimeBank. A second baseline that we implemented groups two event mentions if there is a (transitive) synonymous relation between their corresponding head lemmas (BL_{syn}). To implement this baseline, we used the clusters built over the WordNet synonymous relations as described in Section 4. Similarly to the muc results reported for entity coreference resolution, the baselines that group event mentions into very few clusters are overestimated by the muc metric (e.g., the muc F-scores of BL_{eclass} in Table 5).

#### 7.3.2 HDP Results

Due to memory limitations, we evaluated the HDP models on a restricted set of manually selected feature types. For the HDP_{1f} model, which plays the role of baseline for the HDP_{flat} and HDP_{struct} models, we considered hl as the most representative feature type for performing the clustering of event mentions. In this configuration, the HDP_{1f} model outperforms the BL_{eclass} and BL_{syn} baselines. For the HDP_{flat} models (rows 4–7 in Tables 2,^{3}^{4}–5), we classified the experiments according to the set of manually selected feature types. We found that the best configuration of features for this model consists of a combination of feature types from all the categories of features described in Section 4 (row 7 in Tables 2,^{3}^{4}–5). For the experiments of the HDP_{struct} model, we considered the set of features of the best HDP_{flat} experiment as well as the conditional dependencies between the hl, fr, and fea feature types.

In general, the HDP_{flat} model achieved the best performance results on the ACE test data set (the results in Table 2), whereas the HDP_{struct} model, which also encounters dependencies between feature types, proved to be more effective on the ECB data set for both within- and cross-document event coreference evaluation (as shown in Tables 4 and 5). On the OntoNotes data set, as listed in Table 3, HDP_{flat} shows better results than HDP_{struct} when considering the b^{3} and pw metrics, whereas HDP_{struct} outperforms HDP_{flat} when considering the muc and ceaf metrics. Moreover, the results of the HDP_{flat} and HDP_{struct} models show an F-score increase by 4–10 percentage points over the HDP_{1f} model, and therefore prove that the HDP extensions provide a more flexible representation for clustering objects characterized by rich properties than the original HDP model.

We also plot the evolution of the generative process associated with an HDP model. For instance, Figure 7 shows that the HDP_{flat} model corresponding to the experiment from row 7 in Table 2 converges in 350 iteration steps to a posterior distribution over event mentions from the ACE corpus with around 2,000 latent events.

#### 7.3.3 mIBP–HDP Results

In spite of its advantage of working with a potentially infinite number of features in an HDP framework, the mIBP–HDP model (row 9 in Tables 2, 4, and 5) did not achieve a satisfactory performance in comparison with the other proposed models. However, the results were obtained by automatically selecting only 2% of distinct feature values from the entire set of values extracted from both corpora. When compared with the restricted set of features considered by the HDP_{flat} and HDP_{struct} models, the percentage of values selected by the mIBP–HDP model is only 6%.

#### 7.3.4 iFHMM–iHMM Results

The results achieved by the iFHMM–iHMM model using automatic selection of feature values remain competitive against the results of the HDP models, where the feature types were manually tuned. When comparing the strategies for filtering feature values in the iFHMM–iHMM framework, we could not find a distinct separation between the results obtained by the iFHMM–iHMM_{unfiltered}, iFHMM–iHMM_{discrete}, iFHMM–iHMM_{median}, and iFHMM–iHMM_{uniform} models. As observed from Tables 2, 4, and 5, most of the iFHMM–iHMM results fall in between the HDP_{flat} and HDP_{struct} results. Moreover, the results listed in these tables indicate that the iFHMM–iHMM model is a better framework than the HDP framework for capturing the event mention dependencies simulated by the mIBP feature sampling scheme.

A study of the impact on the performance results of the parameter *α*′ that controls the number of feature values selected in the iFHMM–iHMM framework is presented in Figure 8. The results plotted in this figure show a small variation in performance for different values of *α*′ indicating that the iFHMM–iHMM model is able to successfully handle the feature values that introduce additional noise in the data. Figure 8 also shows that the iFHMM–iHMM model achieves the best results on the ACE data set for a relative small value of *α*′ (*α*′ = 10), which corresponds to 0.05% feature values sampled from the total number of feature values considered. However, because the number of event mentions in the ECB corpus is smaller than the number of mentions in the ACE corpus, the iFHMM–iHMM model utilizes a larger number of features values (0.91% of feature values selected for *α*′ = 150) extracted from the new corpus in order to obtain most of its best results.

The experiments depicted in Figure 8 were performed by using the *unfiltered* heuristic for selecting feature values in the iFHMM–iHMM model. Similar results were

obtained when considering the rest of the heuristics integrated in the iFHMM–iHMM framework. Also, the iFHMM–iHMM experiments using the *unfiltered*, *discrete*, *median*, and *uniform* heuristics from Tables 2,^{3}^{4}–5 were performed by setting *α*′ to 10, 100, 100, and 50, respectively. For the parameters *γ*′ and *δ*′, we considered a default value of 0.5.

To gain a deeper insight into the behavior of the iFHMM–iHMM model, we show in Figure 9 the performance results obtained by this model for different sets of feature types. For this purpose, we ran the iFHMM–iHMM_{uniform} model with a fixed value of the *α*′ parameter (*α*′ = 50) on increasing fractions of feature types.^{9} The results confirm the fact that the sampling scheme of the feature values used in the iFHMM–iHMM framework does not guarantee the selection of the most salient features. However, the constant trend in the performance values shown in Figure 9 proves that iFHMM–iHMM is a robust generative model for handling noisy and redundant features. For instance, noisy features for our problem can be generated from errors in semantic parsing, event class extraction, POS tagging, and disambiguation of polysemous semantic frames. To strengthen this statement, we also compare in Table 6 the results obtained by an iFHMM–iHMM model that considers all the feature values associated with an observable object (iFHMM–iHMM_{all}) against the iFHMM–iHMM models that use the mIBP sampling scheme and the *unfiltered*, *discrete*, *median*, and *uniform* heuristics. Because of the memory limitation constraints, we performed the experiments listed in Table 6 by selecting only a subset of feature types from the ones that proved to be salient in the HDP experiments. As listed in Table 6, all the iFHMM–iHMM models that used a heuristic approach for selecting feature values significantly outperform the iFHMM–iHMM_{all} model; therefore, this proves that all the feature selection approaches considered in the iFHMM–iHMM framework are able to successfully filter out a significant number of noisy and redundant feature values.

### 7.4 Error Analysis

We performed an error analysis by manually inspecting both system and gold-annotated data in order to track the most common errors made by our models. One frequent error occurs when a more complex form of semantic inference is needed to find a correspondence between two event mentions of the same individuated event. For instance, because all properties and participants of *em*_{3}(*acquisition*) are omitted in Example (1), and no common features exist between *em*_{2}(*buy*) and *em*_{3}(*acquisition*) to indicate a similarity between these mentions, they will most probably be assigned to different clusters. This example also suggests the need for a better modeling of the discourse salience for event mentions.

Another common error is made when matching the semantic roles corresponding to coreferential event mentions. Although we simulated entity coreference by using various semantic features, the task of matching participants and properties associated with coreferential event mentions is not completely solved. This is because, in many coreferential cases, partonomic relations between semantic roles need to be inferred.^{10} Examples of such relations extracted from ECB are , , . Similarly for event properties, many coreferential examples do not specify a clear location and time interval (e.g., , ). In future work, we plan to build relevant clusters using partonomies and taxonomies such as the WordNet hierarchies built from meronymy/holonymy and hypernymy/hyponymy relations, respectively.^{11}

## 8. Conclusion

We have described a new class of unsupervised, nonparametric Bayesian models designed for the purpose of solving the problem of event coreference resolution. Specifically, we have shown how already existing models can be extended in order to relax some of their limitations and how to better represent the event mentions from a particular document collection. In this regard, we have focused on devising models for which the number of clusters and the number of feature values corresponding to event mentions can be automatically inferred from data.

Our experimental results for solving the problem of event coreference proved that these models are able to successfully handle such types of requirements on a real data application. Based on these results, we also demonstrated that the new HDP extension, which is able to model observable objects characterized by multiple properties, is a better fit for this type of problem than the original HDP model. Moreover, we believe that the HDP extension can be used for solving clustering problems that involve a small number of feature types and a priori known facts about the salience of these feature types. On the other hand, when no such prior information is known with respect to the number of feature types, or the total number of features is relatively large, we believe that the iFHMM–iHMM model is a more suitable choice. The main reason is because the new hybrid model is able to perform an automatic selection of feature values. As shown in our experiments, this model was capable of achieving competitive results even when only 2% of feature values were selected from the entire set of features encoded in the ACE, OntoNotes, and ECB data sets.

## Acknowledgments

The authors would like to thank the anonymous reviewers, whose insightful comments and suggestions considerably improved the quality of this article.

## Notes

The ECB corpus is available at http://www.hlt.utdallas.edu/˜ady/data/ECB1.0.tar.gz.

A seminal event in a document is the event that triggers the topic of the document and has interconnections with the majority of events from its surrounding textual context. Furthermore, the set of documents describing the same seminal event defines a **topic**. A more detailed description of seminal events can be found in topic detection and tracking literature (Allan 2002).

Tango is a tool designed for annotating relations between the event mentions encoded in the TimeML format and is available at http://timeml.org/site/tango/tool.html.

**Z**^{− i,j} represents a notation for **Z** − {*Z*_{i,j}}.

Because, in general, most of the counts corresponding to each feature value are assigned to a single cluster, a partial solution for this problem would be an efficient way of managing the sparsity in the counting matrices. However, the main issue of representing the entire set of features into the HDP models remains unaddressed.

Technical details for computing this probability are described in Van Gael, Teh, and Ghahramani (2008).

The selection of features into the increasing fractions of feature types was randomly performed. The fraction corresponding to the 100% experiment in Figure 9 contains all 132 feature types.

This observation was also reported in Hasler and Orasan (2009).

This task is not trivial, because if applying the transitive closure on these relations, all words will end up being part of the same cluster with *entity* for instance.

## References

*Essays on Actions and Events*. 2001, Oxford: Clarendon Press

*Reply to Quine on Events*.

*Events*. 1996, Aldershot, Dartmouth, pages 107–116

## Author notes

Department of Biomedical Informatics, School of Medicine, Vanderbilt University, 400 Eskind Biomedical Library, 2209 Garland Avenue, Nashville, TN 37232, USA. E-mail: adi.bejan@vanderbilt.edu.

Human Language Technology Research Institute, Department of Computer Science, University of Texas at Dallas, 800 West Campbell Road, Richardson, TX 75080, USA. E-mail: sanda@hlt.utdallas.edu.