Generalizing Cross-Document Event Coreference Resolution Across Multiple Corpora

Cross-document event coreference resolution (CDCR) is an NLP task in which mentions of events need to be identified and clustered throughout a collection of documents. CDCR aims to benefit downstream multidocument applications, but despite recent progress on corpora and system development, downstream improvements from applying CDCR have not been shown yet. We make the observation that every CDCR system to date was developed, trained, and tested only on a single respective corpus. This raises strong concerns on their generalizability—a must-have for downstream applications where the magnitude of domains or event mentions is likely to exceed those found in a curated corpus. To investigate this assumption, we define a uniform evaluation setup involving three CDCR corpora: ECB+, the Gun Violence Corpus, and the Football Coreference Corpus (which we reannotate on token level to make our analysis possible). We compare a corpus-independent, feature-based system against a recent neural system developed for ECB+. Although being inferior in absolute numbers, the feature-based system shows more consistent performance across all corpora whereas the neural system is hit-or-miss. Via model introspection, we find that the importance of event actions, event time, and so forth, for resolving coreference in practice varies greatly between the corpora. Additional analysis shows that several systems overfit on the structure of the ECB+ corpus. We conclude with recommendations on how to achieve generally applicable CDCR systems in the future—the most important being that evaluation on multiple CDCR corpora is strongly necessary. To facilitate future research, we release our dataset, annotation guidelines, and system implementation to the public.1


Introduction
To move beyond interpreting documents in isolation in multidocument NLP tasks such as multidocument summarization or question answering, a text understanding technique is needed to connect statements from different documents. A strong contender for this purpose is cross-document event coreference resolution (CDCR). In this task, systems need to (1) find mentions of events in a collection of documents and (2) cluster those mentions together that refer to the same event (see Figure 1). An event refers to an action taking place at a certain time and location with certain participants (Cybulska and Vossen 2014b). CDCR requires deep text understanding and depends on a multitude of other NLP tasks such as semantic role labeling (SRL), temporal inference, and spatial inference, each of which is still being researched and not yet solved. Furthermore, CDCR systems need to correctly predict the coreference relation between any pair of event mentions in a corpus. Because the number of pairs grows quadratically with the number of mentions, achieving scalable text understanding becomes an added challenge in CDCR.
In recent years, new CDCR corpora such as the Gun Violence Corpus (GVC)  and Football Coreference Corpus (FCC) (Bugert et al. 2020) have been developed, and the state-of-the-art performance on the most commonly used corpus ECB+ (Cybulska and Vossen 2014b) has risen steadily (Kenyon-Dean, Cheung, and Precup 2018; Barhom et al. 2019;Meged et al. 2020). We believe that CDCR can play a vital role for downstream multidocument tasks, and so do other researchers in this area (Bejan and Harabagiu 2014;Yang, Cardie, and Frazier 2015;Upadhyay et al. 2016;Choubey and Huang 2017;Choubey, Raju, and Huang 2018;Choubey and Huang 2018;Kenyon-Dean, Cheung, and Precup 2018;Barhom et al. 2019). Yet, despite the progress made so far, we are not aware of a study that demonstrates that using a recent CDCR system is indeed helpful downstream. We make the key observation that all existing One half of the final is set after France moved past Belgium on Tuesday afternoon. The victory gives France its third-ever berth at a World Cup Final.
The French will start as favourites to claim the trophy but who will they face ? "I expect England to beat Croatia," former Scotland winger Pat Nevin said.  ...and England vs. Croatia Documents reporting on France vs. Belgium...

Figure 1
Cross-document event coreference resolution (CDCR) example with excerpts of three documents from our token-level reannotation of the Football Coreference Corpus (FCC-T). The seven indicated event mentions refer to four different events. For the "victory" event mention, three participant mentions and one temporal mention are additionally marked. CDCR systems (Kenyon-Dean, Cheung, and Precup 2018;Mirza, Darari, and Mahendra 2018;; Barhom et al. 2019;Cremisini and Finlayson 2020;Meged et al. 2020) were designed, trained, and evaluated on a single corpus respectively. This points to a risk of systems overspecializing on their target corpus instead of learning to solve the overall task, rendering such systems unsuitable for downstream applications where generality and robustness is required. The fact that CDCR annotation efforts annotated only a subset of all coreference links to save costs (Bugert et al. 2020) further aggravates this situation.
We are, to the best of our knowledge, the first to investigate this risk. In this work, we determine the state of generalizability in CDCR with respect to corpora and systems, identify the current issues, and formulate recommendations on how CDCR systems that are robustly applicable in downstream scenarios can be achieved in the future. We divide our analysis into five successive stages: 1.
Cross-dataset modeling of CDCR is made difficult by annotation differences between the ECB+, FCC, and GVC corpora. We establish compatibility by annotating the FCC-T, an extension of the FCC reannotated on the token level.

2.
Analyzing generalizability across corpora is best performed with an interpretable CDCR system that is equally applicable on all corpora. To fulfill this requirement, we develop a conceptually simple mention-pair CDCR system that uses the union of features found in related work.

3.
To compare the generalization capabilities of CDCR system architectures, we train and test this system and a close to state-of-the-art neural system (Barhom et al. 2019) on the ECB+, FCC-T, and GVC corpora. We find that the neural system does not robustly handle CDCR on all corpora because its input features and architecture require ECB+-like corpora.

4.
There is a lack of knowledge on how the CDCR task manifests itself in each corpus, especially with regard to which pieces of information (out of event action, participants, time, and location) are the strongest signals for event coreference. Via model introspection, we observe significant differences between corpora, finding that decisions in ECB+ are strongly driven by event actions whereas FCC-T and GVC are more balanced and additionally require text understanding of event participants and time.

5.
Finally, we evaluate our feature-based system in a cross-dataset transfer scenario to analyze the generalization capabilities of trained CDCR models. We find that models trained on a single corpus do not perform well on other unseen corpora.
Based on these findings, we conclude with recommendations for the evaluation of CDCR, which will pave the way for more general and comparable systems in the future. Most importantly, the results of our analysis unmistakably show that evaluation on multiple corpora is imperative given the current set of available CDCR corpora.
Article Structure. The next section provides background information on the CDCR task, corpora, and systems, followed by related work on feature importance in CDCR (Section 3). Section 4 covers the re-annotation and extension of the FCC corpus. We explain the feature-based CDCR system in Section 5, before moving on to a series of experiments: We compare this system and the neural system of Barhom et al. (2019) in Section 6. In Section 7 we analyze the signals for event coreference in each corpus. Lastly, we test model generalizability across corpora in Section 8. We discuss the impact of these experiments and offer summarized recommendations on how to achieve general CDCR systems in the future in Sections 9 and 10. We conclude with Section 11.

Background on CDCR
We explain the CDCR task in greater detail, report on the most influential CDCR datasets, and cover notable coreference resolution systems developed for each corpus.

Task Definition
The CDCR task is studied for several domains including news events in (online) news articles, events pertaining to the treatment of patients in physician's notes (Raghavan et al. 2014;Wright-Bettner et al. 2019), or the identification and grouping of biomedical events in research literature (Van Landeghem et al. 2013). In this work, we restrict ourselves to the most explored variant of CDCR in the news domain. We follow the task definition and terminology of Cybulska and Vossen (2014b). Here, events consist of four event components-an action, several human or nonhuman participants, a time, and a location. Each of these components can be mentioned in text, that is, an action mention would be the text span referencing the action of an event instance. An example is shown in Figure 1, where the rightmost document references a football match between England and Sweden. The action mention for this event is "victory," alongside three entity mentions "England" (the population of England), "fans" (English football fans), and "Sweden" (the Swedish national football team) who took part in the event. The temporal expression "on Saturday" grounds the event mention to a certain time, which in this case depends on the date the news article was published on. Different definitions have been proposed for the relation of event coreference. Efforts such as ACE (Walker et al. 2006) only permit the annotation of identity between event mentions, whereas Hovy et al. (2013) further distinguish subevent or membership relations. Definitions generally need to find a compromise between complexity and ease of annotation, particularly for the cross-document case (see Wright-Bettner et al. [2019] for a detailed discussion). We follow the (comparatively simple) definition of Cybulska and Vossen (2014b), in which two action mentions corefer if they refer to the same real-world event, meaning their actions and their associated participants, time, and location are semantically equivalent. Relevant examples are shown in Figure 1, where all action mentions of the same color refer to the same event. The two steps a CDCR system needs to perform therefore are (1) the detection of event actions and event components and (2) the disambiguation of event actions to produce a cross-document event clustering. A challenging aspect of CDCR is the fact that finding mentions of all four event components in the same sentence is rare, meaning that information may have to be inferred from the document context or, in some cases, it may not be present in the document at all. The second challenge is efficiently scaling the clustering process to large document collections with thousands of event mentions since every possible pair of event mentions could together form a valid cluster.

System Requirements
The requirements that downstream applications place on systems resolving crossdocument event coreference can be diverse. We establish high-level requirements that a system performing CDCR on news text should meet: • Datasets may consist of many interwoven topics. Systems should perform well on a broad selection of event types of different properties (punctual events such as accidents, longer-term events such as natural disasters, pre-planned events such as galas or sports competitions).
• To provide high-quality results, systems should fully support the definition of event coreference mentioned previously, meaning they find associations between event mentions at a level human readers would be able to by inferring temporal and spatial clues from the document context and reasoning over event action and participants.
• Datasets may consist of a large number of documents containing many event mentions. We expect CDCR systems to be scalable enough to handle 100k event mentions in a reasonable amount of time (less than one day on a single-GPU workstation).

Corpora
The corpus most commonly associated with CDCR is EventCorefBank+ (ECB+). Originally developed as the EventCorefBank (ECB) corpus (Bejan and Harabagiu 2010), it was enriched with entity coreference annotations by Lee et al. (2012) to form the Extended EventCorefBank corpus. This corpus was later extended with 500 additional documents by Cybulska and Vossen (2014b) to create the ECB+ corpus. This most recent version contains 982 news articles on 43 topics. The topics were annotated separately, meaning that there are no coreference links across topics. For each topic (e.g., "bank explosions"), there are two main events ("Bank explosion in Oregon 2008" and "Bank explosion in Athens 2012") and several news documents that report on either of those two events. The set of documents reporting on the same event is commonly referred to as a subtopic. ECB+ is the only corpus of those discussed here that does not provide the publication date for each document. It does, however, contain annotations for all four event components as well as additional cross-document entity coreference annotations for participants, time, and location mentions. The Football Coreference Corpus (FCC) (Bugert et al. 2020) contains 451 sports news articles on football tournaments annotated with cross-document event coreference. The annotation was carried out via crowdsourcing and focused on retrieving cross-subtopic event coreference links. Following the nomenclature of Bugert et al. (2020), a within-subtopic coreference link is defined by a pair of coreferring event mentions, which originate from two documents reporting about the same overall event. For example, in ECB+, two different news articles reporting about the same bank explosion in Athens in the year 2012 may both mention the event of the perpetrators fleeing the scene. For a cross-subtopic event coreference link, two event mentions from articles on different events need to corefer. A sports news article summarizing a quarterfinal match of a tournament could for example recommend watching the upcoming semifinal, whereas an article written weeks later about the grand final may refer to the same semifinal in an enumeration of a team's past performances in the tournament.
A concrete example is shown in Figure 1, where the mentions "beat" and "test" corefer while belonging to different subtopics. Cross-subtopic coreference links are a crucial aspect of CDCR since they connect mentions from documents with low content overlap, forming far-reaching coreference clusters that should prove particularly beneficial for downstream applications (Bugert et al. 2020). In FCC, event mentions are annotated only at the sentence level contrary to ECB+ and GVC which feature token level annotations.
The Gun Violence Corpus (GVC) ) is a collection of 510 news articles covering 241 gun violence incidents. The goal was to create a challenging CDCR corpus with many similar event mentions. Each news article belongs to the same topic (gun violence) and only event mentions related to gun violence were annotated ("kill," "wounded," etc.). Cross-subtopic coreference links were not annotated. Table 1 presents further insights into these corpora. There, we report the total number of event coreference links in each corpus and categorize them by type. Note that in ECB+ and GVC, nearly all cross-document links are of the within-subtopic kind whereas FCC focused on annotating cross-subtopic links. The stark contrast in the number of coreference links between FCC and ECB+/GVC can be attributed to the facts that (1) the number of coreference links grows quadratically with the number of mentions in a cluster and (2) FCC contains clusters with more than 100 mentions (see Figure 2).
While the annotation design of each of these corpora has had different foci, they share commonalities. The structure of each corpus can be framed as a hierarchy with three levels: there are one or more topics/event types, which each contain subtopics/event  instances, which each contain multiple documents. Both ECB+ and GVC annotate event mentions on the token level in a similar manner. Because FCC is the only CDCR corpus missing token level event mention annotations, we add these annotations in this work to produce the FCC-T corpus (see Section 4). With this change made, it is technically and theoretically possible to examine these CDCR corpora jointly.

Systems
We here summarize the principles of CDCR systems, followed by the state-of-the-art systems for each CDCR corpus.
2.4.1 System Principles. Given a collection of event mentions, a discrete or vectorized representation needs to be created for each mention so that the mentions can be clustered. Following the definition of the CDCR task, a representation should contain information on the action, participants, time, and location of the event mention. This information may be scattered throughout the document and needs to be extracted first. To do this, CDCR may preprocess documents via SRL, temporal tagging, or entity linking. Two general strategies exist for computing the distances between mentions that are needed for clustering: representation learning and metric learning (Hermans, Beyer, and Leibe 2017). Representation learning approaches produce a vector representation for each event mention independently. The final event clustering is obtained by computing the cosine distance between each vector pair, followed by agglomerative clustering on the resulting distance matrix. Most approaches belong to the group of conceptually simpler metric learners, which predict the semantic distance between two mentions or clusters based on a set of features. By applying the metric on all n 2 pairs for n mentions, a distance matrix is obtained that is then fed to a clustering algorithm. Any probabilistic classifier or regression model may be used to obtain the mention distances. Metric learning approaches can be further divided into mention pair approaches that compute the distance between each mention pair once and cluster pair approaches that recompute cluster representations and distances after each cluster merge. Computing the distance between all mention pairs can be a computationally expensive process. Some metric learner approaches therefore perform a separate document preclustering step to break down the task into manageable parts. The metric learning approach is then applied on each individual cluster of documents and its results are combined to produce the final coreference clustering.
Common types of features used by CDCR systems are text similarity features (string matching between event mention actions), semantic features (the temporal distance Table 2 Preprocessing steps, representations, and features used by CDCR systems. We mark implictly learned neural features with ( ). emb. = embeddings. w.r.t. = with respect to. C R 2 0 2 0 M E 2 0 2 0 B A 2 0 1 9 K E 2 0 1 8 V O 2 0 1 6 C Y 2 0 1 5 Y A 2 0 1 5 L E 2 0 1 2 M I 2 0 1 8 V O 2 0 1 8 o u r s Prepro-Fact KB entity linking · · · · · · · cessing Lexical KB entity linking · · · · · Semantic role labeling · · · · Temporal tagging · · · · · · · · Mention or Bag of words between mentions), features using world knowledge (the spatial distance between the locations of mentions), or discourse features (the position of a mention in the document), as well as latent neural features. 2 Table 2 shows the types of features that existing CDCR systems rely on.
2.4.2 Notable CDCR Systems. Table 3 shows a comparison of the core principles of several CDCR systems in terms of their mention distance computation, learning approach, and more. We compare the systems of Cremisini and Finlayson 2020 (CR2020), Meged et al. 2020  At the time of writing, the state-of-the-art system on ECB+ is Meged et al. (2020), a cluster-pair approach in which a multilayer perceptron is trained to jointly resolve entity and event coreference. It is an extension of Barhom et al. (2019), adding paraphrasing features. The system performs document preclustering prior to the coreference resolution step.
GVC was used in SemEval 2018 Task 5 which featured a CDCR subtask . The best performing system was Mirza, Darari, and Mahendra (2018), which clusters documents using the output of a word sense disambiguation system, person and location entities, and event times. Based on the assumption that each document mentions up to one event of each event type, the system puts all event mentions of the same event type in the same cross-document event coreference cluster. Due to the nature of the shared task, the system is specialized on a limited number of event types. VO2016 and VO2018 are based on the NewsReader pipeline, which contains several preprocessing stages to perform event mention detection, entity linking, word sense disambiguation, and more. Using this information, one rule-based system was defined per corpus (ECB+ and GVC) that is tailored to the topics and annotations present in the respective corpus.
The FCC is the most recently released corpus of the three. We are not aware of any publications reporting results for this corpus.

On the Application of Event Mention Detection.
With respect to the two steps a CDCR system needs to perform (event mention detection and event coreference resolution), several authors have decided to omit the first step and work on gold mentions alone (Cybulska and Vossen 2015;Kenyon-Dean, Cheung, and Precup 2018;Barhom et al. 2019;Meged et al. 2020), which simplifies the task and system development. Systems that include a mention detection step (Lee et al. 2012;Yang, Cardie, and Frazier 2015;Vossen and Cybulska 2016;Choubey and Huang 2017;Cremisini and Finlayson 2020) are more faithful to the task but risk introducing another source of error. Compared to using gold event mentions, performance drops from 20 percentage points (pp) CoNLL F1 (Vossen and Cybulska 2016) to 40 pp CoNLL F1 (Cremisini and Finlayson 2020) have been observed on ECB+. Vossen and Cybulska derive from these results that event detection "is the most important factor for improving event coreference" (Vossen and Cybulska 2016, page 518).
We think that the root cause for these losses in performance are not the event detection approaches themselves but rather intentional limitations in the event mention annotations of CDCR corpora. We take the ECB+ corpus as an example. Based on the event definition stated in the annotation guidelines, several hundred event mentions would qualify for annotation in each news document. To keep the annotation effort manageable, only event mentions of the document's seminal event (the main event the article is reporting about) and mentions of other events in the same sentence were annotated (Cybulska and Vossen 2014a, page 9). Conversely, the corpus contains a large amount of valid event mentions that were deliberately left unannotated. 3 A mention detection system will (unaware of this fact) predict these event mentions anyway and will be penalized for producing false positive predictions. In the subsequent mention clustering step, coreference chains involving these surplus mentions increase the risk of incorrect cluster merges between valid mentions and will overall lead to lower precision. A general purpose mention detection system may perform poorly on the FCC and GVC corpora in similar fashion. For these corpora, affordability of the annotation process was achieved by restricting event mentions to certain action types, which lowers the overall number of to-be-annotated event mentions.
We therefore think that, as long as no CDCR corpus exists in which every single event mention is annotated, event detection and event coreference resolution should be treated separately, meaning that event coreference resolution performance should be reported on gold event mentions. For this reason, and because of the different approaches for limiting the number of event mentions in each of the three corpora, we perform all experiments on gold event mention spans in this work.

Related Work
Prior work has examined feature importance in CDCR systems. Cybulska and Vossen (2015) tested different combinations of features with a decision tree classifier on ECB+. They find that system performance majorly stems from a lemma overlap feature and that adding discourse, entity coreference, and word sense disambiguation features improves BLANC F1 by only 1 pp. Cremisini and Finlayson (2020) conducted a study in which they built a feature-based mention pair approach for ECB+ to gain deeper insights into the importance of features and on the performance impact of document preclustering. Among four features (fastText [Bojanowski et al. 2017] word embedding similarity between event actions, event action word distribution, sentence similarity, and event action part-of-speech comparison), the embedding similarity feature was found to be the most important by far. The use of document preclustering caused an improvement of 3 pp CoNLL F1, leading Cremisini and Finlayson to encourage future researchers in this field to report experiments with and without document preclustering.
Our work significantly deepens these earlier analyses. Because research on CDCR systems has so far only focused on resolving cross-document event coreference in individual corpora, we tackle the issue of generalizability across multiple corpora. We use a broader set of features and by comparing two CDCR approaches, while previous work focused on the ECB+ corpus using the aforementioned smaller sets of features. We (1) develop a general feature-based CDCR system, (2) apply it on each of the corpora mentioned above, and (3) analyze the information sources in each corpus that are most informative to cross-document event coreference. We thereby provide the first comparative study of CDCR approaches, paving the way for general CDCR, which will aid downstream multidocument tasks.

FCC Reannotation
We reannotate the Football Coreference Corpus (FCC) to improve its interoperability with ECB+ and the Gun Violence Corpus (FVC). 4 The FCC was recently introduced by Bugert et al. (2020) as a CDCR corpus with sentence-level event mention annotations (see Section 2.3). We reannotate all event mentions on token level, add annotations of event components, and annotate additional event mentions to produce the FCC-T corpus (T for token level). The following sections cover our annotation approach, inter-annotator agreement, and the properties of the resulting corpus.

Annotation Task Definition
In the original FCC annotation, crowd annotators were given a predefined set of events and sentences of news articles to work on. Each sentence had to be marked with the subset of events referenced in the sentence. We take these sentences and annotate the action mention of each referenced event on token level. For each event, we additionally annotate the corresponding participants, time, and location mentions appearing in the same sentence as the action mention. To achieve maximum compatibility with existing corpora, we adopted the ECB+ annotation guidelines (Cybulska and Vossen 2014a). 5 We distinguish between different subtypes of participants (person, organization, etc.), time, and location as done by Cybulska and Vossen (2014a). We do not differentiate between action types because all events (pre-)annotated in FCC should belong to the OCCURRENCE type (see Cybulska and Vossen 2014a, page 14). We do not annotate (cross-document) entity coreference. We do annotate a rudimentary kind of semantic roles that we found are crucially missing in ECB+: We instruct annotators to link mentions of participants, time, and location to their corresponding action mention.
While developing the guidelines, we noticed cases where sentence-level mentions are evidently easier to work with than token-level mentions. For example, enumerations or aggregated statements over events (such as "Switzerland have won six of their seven meetings with Albania, drawing the other.") are difficult to break down into token-level event mentions. Cases like these are not covered by the ECB+ annotation guidelines and were removed in the conversion process. A similar issue is caused by coordinate structures such as "Germany beat Algeria and France in the knockout stages," where two football match events are referenced by the same verb. To handle these cases, we annotated two separate event mentions sharing the same action mention ("beat"). Because superimposed mention spans are not supported by coreference evaluation metrics, we additionally provide a version of the corpus in which these mentions are removed.
In FCC, crowdworkers identified a further 1,100 sentences that mention one or more football-related events outside of the closed set of events they were provided with during the annotation. These event mentions were left unidentified by Bugert et al. (2020). We instructed annotators to manually link each event mention in this extra set of sentences to a database of 40k international football matches 6 and again marked and linked the token spans of actions, participants, times, and locations.
Annotators were given the option to mark sentences they found unclear or that were incorrectly annotated by crowdworkers in the original dataset. We manually resolved the affected sentences on a case-by-case basis.

Annotation Procedure and Results
The annotation was carried out with the INCEpTION annotation tool (Klie et al. 2018). We trained two student annotators on a set of 10 documents. The students were given Table 4 Inter-annotator agreement (α U ). action mentions 0.80 participants, time, location (spans only) 0.67 participants, time, location (incl. subtype) 0.57 feedback on their work and afterward annotated a second batch of 22 documents independently. Table 4 shows the inter-annotator agreement on this second batch. We report Krippendorff's α U (Krippendorff 1995), which measures the agreement in span overlap on character level as the micro average over all documents.
For the annotation of action mention extents, which is the most important step in our re-annotation effort, we reach 0.80 α U , indicating good reliability between annotators (Carletta 1996;Artstein and Poesio 2008). The agreement for the annotation of participants, time, and location is lower, at 0.57 α U . We found that this mostly stems from the annotation of participants: In the guidelines, we specify that annotators should only mark an entity as a participant of an event if it plays a significant role in the event action. The larger and more coarse an event is, the more difficult this decision becomes for annotators. One such case is shown in Example 1, where it is debatable if "Christian Teinturier" is or is not significantly involved in the tournament event.
Example 1 "Earlier today, French Football Federation vice-president Christian Teinturier said if there was any basis to the reports about Anelka then he should be sent home from the tournament immediately." A second reason is that we do not annotate entity coreference, so only a single entity mention is meant to be annotated for each entity participating in an event. In case the same entity appears twice in a sentence, we instruct annotators to choose the more specific description. If the candidates are identical in surface form, annotators are meant to choose the one closer (in word distance) to the event action. There remains a level of subjectivity in these decisions, leading to disagreement.
Overall, we concluded that the annotation methodology produced annotations of sufficient quality. The remaining 419 documents were divided among both annotators. The corpus re-annotation required 120 working hours from annotators (including training and the burn-in test). We fixed a number of incorrect annotations in the crowdsourced FCC corpus. For example, we removed several mentions of generic events ("winning a World Cup final is every player's dream") that were incorrectly marked as referring to a concrete event. Table 1 shows the properties of the resulting FCC-T corpus alongside ECB+, GVC, and our cleaned version of the sentence-level FCC corpus. Compared with the original FCC corpus, our token-level reannotation offers 50% more event mentions and twice as many annotated events. With respect to the SRL annotations, we analyzed how frequently event components of each type were attached to action mentions. We found that 95.7% of action mentions have at least one participant attached, 41.6% at least one time mention, and 15.8% at least one location mention. We mentioned earlier that cases exist where two or more action mentions share the same token span. A total of 340 out of all 3,563 annotated event mentions in FCC-T fall into this category. A further 154 event mentions did not have a counterpart in the event database (such as matches from 586 national football leagues). We jointly assigned these mentions to the coreference cluster other event.
By creating the FCC-T, a reannotation and extension of FCC on token level, we provide the first CDCR corpus featuring a large body of cross-subtopic event coreference links that is compatible with the existing ECB+ and GVC corpora. 7 This greatly expands the possibilities for CDCR research over multiple corpora, as we will demonstrate in Sections 6 to 8.

Defining a General CDCR System
Recent CDCR approaches such as neural end-to-end systems or cluster-pair approaches were shown to offer great performance (Barhom et al. 2019), yet their black box nature and their complexity makes it difficult to analyze their decisions. In particular, our goal is to identify which aspects of a CDCR corpus are the strongest signals for event coreference that cannot be adequately investigated with recent CDCR systems. We therefore propose a conceptually simpler mention pair CDCR approach that uses a broad set of handcrafted features for resolving event coreference in different environments. We thus focus on developing an interpretable system, whereas reaching state-of-the-art performance is of secondary importance. This section explains the inner workings of the proposed system.

Basic System Definition
We resolve cross-document event coreference by considering pairs of event mentions. At training time, we sample a collection of training mention pairs. For each pair, we extract handcrafted features with which we train a probabilistic binary classifier that learns the coreference relation between a pair (coreferring or not coreferring). The classifier is followed by an agglomerative clustering step that uses each pair's coreference probability as the distance matrix. At prediction time, all n 2 mention pairs are being classified without prior document preclustering. For the reasons outlined in Section 2.4.3, we choose to omit the mention detection step and work with the gold event mentions of each corpus throughout all experiments.

Pair Generation for Training
We explain three issues that arise when sampling training mention pairs and how we address them in our system.
The straightforward technique for sampling training pairs is to sample a fixed number of all possible coreferring and non-coreferring mention pairs. Due to the sparsity of the CDCR relation, the resulting set of pairs would mostly consist of non-coreferring pairs when using this technique, with the majority of coreferring pairs left unused. This issue has been partially addressed in the past with weighted sampling to increase the ratio of coreferring pairs (Lee et al. 2012;Barhom et al. 2019).
We identified a second issue, namely, the underrepresentation of mention pairs from the long tail, which weighted sampling does not address: We previously established that cluster sizes in corpora are imbalanced (see Figure 2). If all n 2 coreferring pairs are generated for each cluster, the generated pairs will largely consist of pairs from the largest clusters. 8 Manual inspection revealed that the variation in how events are expressed is limited, with large clusters exhibiting many action mentions with (near-)identical surface forms. 9 Consequentially, with common pair generation approaches, there is a high chance of generating many mention pairs that carry little information for the classifier, while mention pairs from clusters in the long tail are unlikely to be included.
Another issue we have not yet seen addressed in related work is the distribution of link types in the body of sampled pairs: In terms of the number of mention pair candidates available for sampling, the cross-topic link candidates strongly outnumber the cross-subtopic link candidates, who in turn strongly outnumber the within-subtopic link candidates (and so on) by nature of combinatorics. This particularly concerns the large body of non-coreferring pairs. An underrepresentation of one of these types during training could cause deficiencies for the affected type at test time, hence care must be taken to achieve a balanced sampling.
We address these three issues as follows: (1) We use the distribution of cluster sizes in the corpus to smoothly transition from generating all n 2 coreferring pairs for the smallest clusters to generating (n − 1) · c pairs for the largest clusters, where c ∈ R + is a hyperparameter.
(2) For each type of coreference link (within-document, withinsubtopic, etc.), we sample up to k non-coreferring mention pairs for each coreferring pair previously sampled for this type. Details on the sampling approach are provided in Appendix H.

Features and Preprocessing
Related work has demonstrated a great variety in the representations and features used to resolve cross-document event coreference (see Table 2), yet it remains unclear which features contribute the most to the coreference resolution performance on each of the three corpora. We therefore chose to implement a series of preprocessing steps and feature extractors that cover the majority of features used in previous systems.

Preprocessing.
We perform lemmatization and temporal expression extraction with CoreNLP (Chang and Manning 2012;Manning et al. 2014), using document publication dates to ground temporal expressions for GVC and FCC-T. We manually converted complex TIMEX expressions into date and time (so that 2020-01-01TEV becomes 2020-01-01T19:00). For ECB+ and GVC where participant, time, and location mentions are not linked to the action mention, we applied the SRL system by Shi and Lin (2019) as implemented in AllenNLP (Gardner et al. 2018). We map spans with labels ARGM-DIR or ARGM-LOC to the location, ARGM-TM to the time, and ARG0 or ARG1 to the participants of each respective event mention. For all corpora we perform entity linking to DBPedia 10 via DBPedia Spotlight (Mendes et al. 2011).

Features.
The list of handcrafted mention pair features includes (1) string matching on action mention spans, (2) cosine similarity of TF-IDF vectors for various text regions, (3) the temporal distance between mentions, (4) the spatial distance between event actions based on DBPedia, and (5) multiple features comparing neural mention representations. These include representations of action mentions, embeddings of the surrounding sentence, and embeddings of Wikidata entities that we obtained via the DBPedia entity linking step. Details on each feature are reported in Appendix C.

Implementation Details
We implemented the system using Scikit-learn (Pedregosa et al. 2011). To obtain test predictions, we applied the following steps separately for each corpus: We perform feature selection via recursive feature elimination (Guyon et al. 2002) on the respective development split. We use a random forest classifier tasked with classifying mention pairs as "coreferring" / "not coreferring" as an auxiliary task for this stage. We then identified the best classification algorithm to use as the probabilistic mention classifier. We tested logistic regression, a multi-layer perceptron, a probabilistic SVM, and XGBoost (Chen and Guestrin 2016). We tuned the hyperparameters of each classifier via repeated 6-fold cross-validation for 24 hours on the respective training split. 11 Using the best classifier, we optimized the hyperparameters of the agglomerative clustering step (i.e., the linkage method, cluster criterion, and threshold) for another 24 hours on the training split. For each experiment, we train five models with different random seeds to account for non-determinism. At test time we evaluate each of the five models and report the mean of each evaluation metric. 12

Generalizability of CDCR Systems
We train and test two CDCR systems and several baselines separately on the ECB+, FCC-T, and GVC corpora to evaluate how flexibly these systems can be applied to different corpora (i.e., whether their overall design is sufficiently general for resolving cross-document event coreference in each corpus). The two systems are our proposed general system (see Section 5) and the system of Barhom et al. (2019) (BA2019). We chose BA2019 because it is the best-performing ECB+ system for which an implementation is available.
In Sections 6.1 and 6.2, we define evaluation metrics and baselines. We then establish the performance of the feature-based system (Section 6.3) on the three corpora, including a detailed link-level error analysis which we can only perform with this system. In Section 6.4, we explain how we apply BA2019 and compare its results to those of the feature-based system, analyzing the impact of document preclustering on the coreference resolution performance in the process.

Evaluation Metrics
Related work on CDCR has so far only scored predictions with the CoNLL F1 (Pradhan et al. 2014) metric (and its constituent parts MUC [Vilain et al. 1995], CEAF e [Luo 2005], and B 3 [Bagga and Baldwin 1998]). We additionally score predictions with the LEA metric (Moosavi and Strube 2016). LEA is a link-based metric which, in contrast to other metrics, takes the size of coreference clusters into account. The metric penalizes incorrect merges between two large clusters more than incorrect merges of mentions from two singleton clusters. As we have shown that cluster sizes in CDCR corpora vary considerably (see Figure 2), this is a particularly important property. LEA was furthermore shown to be more discriminative than the established metrics MUC, CEAF e , B 3 , and CoNLL F1 (Moosavi and Strube 2016).

Baselines
We report the two commonly chosen baselines lemma and lemma-δ as well as a new lemma-time baseline based on temporal information:

1.
lemma: Action mentions with identical lemmas are placed in the same coreference cluster.
2. lemma-δ: Document clusters are created by applying agglomerative clustering with threshold δ on the TF-IDF vectors of all documents, then lemma is applied to each document cluster. For hyperparameter δ, we choose the value that produces the best LEA F1 score on the training split.
3. lemma-time: A variant of lemma-delta based on document-level temporal information. To obtain the time of the main event described by each document, we use the first occurring temporal expression or alternatively the publication date of each document. We create document clusters via agglomerative clustering where the distance between two documents is defined as the difference of their dates in hours. We then apply lemma to each document cluster. Here, the threshold δ represents a duration which is optimized as in lemma-delta.

Establishing the Feature-based System
We run in-dataset experiments to determine the performance of the feature-based CDCR approach on each individual corpus. Details on the splits used for each corpus are reported in Appendix D. When generating mention pairs for training, we undersample coreferring pairs (see Section 5.2) using hyperparameters c = 8 and k = 8. In experiments involving FCC-T, we use c = 2 and k = 8 to compensate for the large clusters in this corpus. Details on the choice of hyperparameters are provided in Appendix H. On all three corpora, the best mention pair classification results were obtained with XGBoost, which led us to use it for all subsequent experiments with this system.
6.3.1 Mention Clustering Results. The results are shown in Table 5. For brevity, we only report cross-document performance. It is obtained by applying the evaluation metrics on modified gold and key files in which all documents were merged into a single meta document (Upadhyay et al. 2016). As was initially shown by Upadhyay et al. (2016), the lemma-δ baseline is a strong baseline on the ECB+ corpus. The feature-based system performs on par with this baseline.
For FCC-T, the optimal δ produces a single cluster of all documents, which leads to identical results for the lemma and lemma-δ baselines. This is a direct consequence of the fact that in this corpus, the majority of event coreference links connect documents from different subtopics. In contrast to ECB+, where preclustering documents by textual content produces document clusters that are near-identical to the gold subtopics  (Barhom et al. 2019;Cremisini and Finlayson 2020), such a strategy is disadvantageous for FCC-T because the majority of coreference links would be irretrievably lost after the document clustering step. The lemma-time baseline performs worse on FCC-T than lemma-δ, indicating that the document publication date is less important than the document content. The feature-based approach outperforms the baselines on FCC-T, showing higher recall but lower precision, which indicates a tendency to overmerge clusters.
The lemma baselines perform worse on GVC than on ECB+ in absolute numbers, which can be attributed to the fact that Vossen et al. (2018) specifically intended to create a corpus with ambiguous event mentions. Furthermore, the baseline results show that knowing about a document's publication date is worth more than knowing its textual content (at least for this corpus). The feature-based system mostly improves over the baselines in terms of recall.
Another noteworthy aspect in Table 5 are the score differences between CoNLL F1 and LEA F1. In the within-document entity coreference evaluations performed by Moosavi and Strube (2016) alongside the introduction of the LEA metric, the maximum difference observed between CoNLL F1 and LEA F1 were roughly 10 pp. Our experiments exhibit differences of 14 pp for systems and up to 20 pp for baselines due to imbalanced cluster sizes in CDCR corpora.

Mention Pair Classifier Results.
To evaluate the probabilistic mention pair classifier in isolation for different corpora and coreference link types, we compute binarized recall, precision, and F1 with respect to gold mention pairs. 13 The results are reported in Table 6. The GVC test split does not contain coreferring cross-subtopic event coreference 13 Note that this approach puts higher weight on large clusters, as these produce a greater number of mention pairs. It is nonetheless the only evaluation approach we are aware of that permits analyzing performance per link type. Link-based coreference metrics such as MUC (Vilain et al. 1995) cannot be used as a replacement, as these (1) require a full clustering opposed to one score per pair and (2) by design abstract away from individual links in a system's response.

Table 6
Mention pair classifier performance of the feature-based system for each cross-document coreference link type. "Coreferring" is used as the positive class. The "Links" column shows the total number of links (coreferring and non-coreferring) per type and corpus based on which P/R/F1 were calculated. 0.0 0.0 0.0 518k 53.7 36.3 43.3 435k n/a n/a n/a links; therefore these cells are marked with "n/a." Five links of this type are present in the ECB+ test split, of which none were resolved by the in-dataset ECB+ model. For FCC-T and GVC, the performance in resolving within-document, within-subtopic, and cross-subtopic event coreference links decreases gradually from link type to link type. This suggests that the greater the distance covered by an event coreference link is in terms of the topic-subtopic-document hierarchy of a corpus, the more difficult it becomes to resolve it correctly.

Error Analysis.
To gain a better understanding of the system's limitations, we manually analyzed predictions of the mention pair classifier. We analyzed five falsepositive and five false-negative cases for each link type and corpus (roughly 90 mention pairs in total). We found that textual similarity between action mentions accounts for a large portion of mistakes on the ECB+ and GVC corpora: Unrelated but similar action mentions caused false-positive cases and, vice versa, coreferring but merely synonymous action mentions led to false-negative cases. The FCC-T model did not resolve coreference well between mentions like "the tournament," "this year's cup," and "2018 World Cup," contributing to false-negative cases. Also, the model showed a tendency of merging event mentions prematurely when the action mention and at least one participant matched (see example 1 in Table 7), which would explain the high recall and low precision results seen in Table 5. For all three models, we noticed misclassifications when a sentence contained multiple event mentions (see example 2 in Table 7). In the given example, it is likely that information from the unrelated "shot in his leg" mention leaked into the representation of the "grazed" event, which contributed to the incorrect classification. For ECB+, we noticed that the lack of document publication date information makes certain decisions considerably harder. For example, the earthquake events seen in example 3 are unrelated and took place four years apart. Although one could come to this conclusion with geographic knowledge alone (the provinces lie on opposite sides of Indonesia), date information would have made this decision easier.
It is reassuring that many of the shortcomings we found would be fixable with a cluster-level coreference resolution approach, (joint) resolution of entity coreference, injection of corpus-specific world knowledge (a football match must take place between exactly two teams, etc.), or with annotation-specific knowledge (e.g., knowledge of Vossen et al.'s [2018] domain model for the annotation of GVC). Our system could be improved by incorporating these aspects, however at the cost of becoming more corpusspecific and less interpretable.

Comparison to Barhom et al. (2019)
We test an established CDCR system, the former state-of-the-art neural CDCR approach of Barhom et al. (2019) (BA2019), for its generalization capabilities on the three corpora.
6.4.1 Experiment Setup. We trained one model of BA2019 for each corpus. BA2019 can resolve event and entity coreference jointly. For the sake of comparability we only use the event coreference component of this system for all experiments since FCC-T and GVC do not contain entity coreference annotations. We replicate the exact data preprocessing steps originally used for ECB+ on FCC-T and GVC. This includes the prediction of semantic roles with the SwiRL SRL system (Surdeanu et al. 2007). The FCC-T corpus mainly consists of cross-subtopic event coreference links (see Section 4.2). The trainable part of the BA2019 system (mention representation and agglomerative mention clustering) is meant to be trained separately on each subtopic of a corpus. This is because at prediction time, the partitioning of documents into subtopics will already be handled by a foregoing and separate document preclustering step. In order not to put BA2019 at a disadvantage for FCC-T, we train it on three large groups of documents that correspond to the three football tournaments present in the FCC-T training split instead of training it on the actual FCC-T subtopics. For this corpus, we also apply undersampling with the same parameters as for the feature-based system.  (2020), we report results for several document preclustering strategies to better distinguish the source of performance gains or losses. We compare (1) no preclustering, (2) the gold document  (3) the k-means clustering approach used in Barhom et al. (2019). The gold document clusters are defined via the transitive closure of all event coreference links. For GVC, this gold document clustering is identical to the corpus subtopics. For the FCC-T test split, the gold clustering is a single cluster containing all documents, which is equivalent to not applying document clustering at all. For ECB+, the gold clustering largely corresponds to the corpus subtopics with the exception of some subtopics that are merged due to cross-subtopic event coreference links. In the k-means approach, all n input documents are represented by TF-IDF vectors based on which all possible k-means clusterings for k = 2, . . . , n are created. From these clusterings, the one with the highest silhouette score (Rousseeuw 1987) is used.
6.4.3 Results. Due to long runtimes of BA2019, 14 results reported from this system stem only from a single execution. We do not report experiments without preclustering on ECB+ and GVC using this system due to scalability issues caused by the greater number of event mentions in these corpora (see Table 1). The results are shown in Table 8. We comment on the most remarkable results. The BA2019 system architecture performs well on GVC, reaching 65.6 LEA F1. Compared to the ECB+ results, there is a notable score difference between the k-means and gold preclustering variants on this corpus. The reason is the same one that led to the lemma-time baseline outperforming lemma-δ on this corpus-preclustering documents by textual content is less effective on a corpus with a single topic, and BA2019 does not make use of document publication date annotations. For FCC-T, applying BA2019 out-of-the-box with k-means preclustering performs much worse than when the preclustering step is omitted due to the large amount of cross-subtopic links being cut off.
When comparing systems against each other, BA2019 performs better than the feature-based approach on ECB+ and GVC. The opposite is the case for FCC-T, where the neural model shows greatly reduced recall in comparison to the feature model. This is surprising to some extent since BA2019 is a more powerful cluster-level approach compared to the mention pair approach. A plausible explanation for the performance drop on FCC-T is the narrower set of features in BA2019. Notably, this system lacks world knowledge on locations and participants and does not explicitly model temporal information, all of which would make intuitive sense to have for a corpus mentioning a variety of football players and matches happening on specific dates. The next section adds evidence to this intuition by analyzing in greater depth the information necessary for resolving event coreference in each corpus.
With respect to the experiments conducted in this section, we have shown that it cannot be taken for granted that CDCR systems are sufficiently general to perform equally well on different corpora. This concerns both the quality of their results (which can fluctuate) as well as more fundamental aspects such as their computational complexity (which may preclude their applicability). In the concrete case of BA2019, this comes down to the choice of the input features and the dependency on document preclustering.

Identifying the Signals for Event Coreference
According to the CDCR task definition (see Section 2.1), coreference between a pair of event mentions requires a match between each of their components (action, participants, time, location). We analyze to which extent corpora satisfy this definition in practice, namely, whether inference over all event components is indeed required, or whether certain event components suffice as signals for resolving event coreference. We approach this analysis in two ways: (1) We investigate the most important features per corpus at training time via model introspection, and (2) we mask the mentions of certain event components in the test split and measure the impact on test performance. We explain the two approaches and present their results (Sections 7.1 and 7.2), then jointly discuss their outcome in Section 7.3.

Feature Importance
Our main reason for developing a feature-based system was that, compared to neural systems, it allows one to directly analyze which input information a model is making use of.
In our system architecture, the agglomerative clustering step is preceded by a mention pair classifier, which we found worked best with the decision tree boosting framework XGBoost (see Section 6). For decision trees, feature importance metrics can be derived from trained models. In Table 9, we report the top features selected during feature selection for each corpus. Alongside, we report the importance of each feature at training time according to the gain metric of XGBoost. 15 For ECB+, the selected features only cover event actions and context representations. Few features were selected overall. For FCC-T, event action and temporal information received the greatest attention. There is a notable absence of documentlevel features, which we attribute to the fact that document similarity is not of prime

Masking of Event Components
We want to analyze the impact of each type of event component at test time. To do so, we create variants of the test data where the spans of certain event mentions are masked. We then predict with the models trained in the in-dataset scenario (Section 6) and measure the score delta. When masking mention spans, we replace each token with a unique dummy token. 16 This is to ensure that the string similarity between two mentions is entirely random. For action components, we replace all gold annotated mention spans. For participant, time, and location components, we replace gold annotations as well as any additional entities identified by DBpedia Spotlight. Masked spans are also removed from semantic role arguments. For the FCC-T and GVC corpora, we additionally mask the document publication date. This masking approach is not without limitations. In FCC-T and GVC, only a subset of all events was annotated. For participants and actions, the three corpora annotate only the head of the phrase. Both these cases may lead to information-bearing tokens leaking into the masked dataset. We nevertheless believe that our approach is an effective approach for analyzing the impact of specific event components at test time.
The results are shown in Table 10. On the ECB+ corpus, masking event actions has the strongest impact on performance. This is to be expected since the majority of features Table 10 Impact on test performance when masking spans of certain event components. We report the score deltas of the feature-based system with respect to the scores from Table 5. used by the model are action-related features. For FCC-T, masking intensifies the issue of cluster overmerging. Between action, participants, and time, the drop in LEA F1 is comparable. When interpreting the effects of masking on FCC-T, it is important to keep in mind that all events annotated in this corpus are planned events whose time, location, and (to some extent) participants are known in advance. This increases the frequency by which these event components are mentioned in text, and on the flipside should cause stronger losses in performance compared to ECB+ and GVC, which contain a smaller proportion of planned events. The fact that scores drop for FCC-T only when the document publication date and temporal expressions are removed indicates that the document publication date was not used for grounding temporal expressions in text. In terms of the GVC, action and time stand out as the event components with the highest impact whereas location and participant information barely contribute to the results. We were surprised to see that with regard to temporal information, temporal expressions in text carry more information than the document publication date. Manual inspection revealed that in sentences like "Two-year-old child shot in the chest in Palm Harbor," the entity linker would frequently misclassify "Two-year-old" as a temporal expression instead of a person entity. A portion of the performance loss from masking time expressions may therefore arise from masked participants.

Summary on the Signals for Event Coreference
Answering our initial question, whether established CDCR corpora match the CDCR task definition in that they require inference over each event component, we conclude that this is not the case.
Our experiments demonstrate that CDCR decisions in ECB+ are strongly driven by action mentions. GVC, designed to overcome this shortcoming of ECB+ , necessitates inference over action mentions, time, and, to some level, participants, but does not challenge systems on spatial inference. FCC-T is the most balanced of the three corpora based on the facts that at training time, feature selection yielded a broad selection of features and, at test time, performance decreases similarly when action, location, or time information is removed. Overall, neither corpus requires inference over all four event components which define the cross-document event coreference relation.
This indicates that CDCR systems that focus on solving a single corpus model only a subset of the entire CDCR task, which severely limits their downstream use on unseen data, as this data may reflect the task differently from what was observed at training time. The findings further raise the question to which degree it is possible to resolve cross-document event coreference in the three corpora with a single model, as this would be the most desirable usage scenario for applying CDCR downstream. We address this question in the following section.

Generalizability of Trained CDCR Models
All preceding experiments in this work have addressed the ECB+, FCC-T, and GVC corpora in isolation, training a separate model per corpus. In downstream application scenarios, such a differentation is not possible-here, a CDCR system is expected to resolve cross-document event coreference in a robust manner regardless of the selection of topics or underlying structure of a given collection of documents. To gain insights into which performance to expect in such a scenario, we test models of the feature-based system in a cross-dataset transfer scenario on unseen corpora. 17 Furthermore, recent research on the question answering (QA) task has shown that training systems jointly on multiple datasets can improve model robustness and can boost performance (Fisch et al. 2019;Talmor and Berant 2019;Guo et al. 2021). In this work, we have established compatibility between the ECB+, FCC-T, and GVC corpora and have identified the different ways in which each corpus models event coreference. We test whether benefits similar to those observed for question answering are possible for CDCR by training the feature-based system on multiple CDCR corpora.

Experiment Setup
We use the same splits for all corpora as in previous experiments. In ECB+, a number of topics cover sports news or news related to gun violence. We refer to these corpus subsets as ECB+ sports and ECB+ guns , respectively, and treat them separately in our experiments. Their contents are shown in Table 11. The ECB+ test split remains unchanged.
When combining two corpora, we use the union of features previously selected during feature selection. The increase in training data leads to an increase in mention pairs, which prolongs the training process. We therefore optimize the hyperparameters of the mention pair classifier and the agglomerative clustering step for three days each.

Results
The results of our experiments, evaluated with LEA, are shown in Table 12. As we have shown in preceding sections, the requirements for resolving cross-document event coreference vary between the corpora, which strongly influences the feature selection and model training processes. We hence expected models trained on a single corpus to 17 We did not test BA2019 in this scenario because of the scalability issues reported in Section 6.4. perform poorly when evaluated on unseen corpora. This is confirmed by the top rows of Table 12, where significant gaps between in-dataset and cross-dataset performance can be observed. When looking at the performance on individual corpora, models trained on multiple corpora perform consistently worse than those trained on a single corpus, with few exceptions (mixing FCC-T with ECB+ sports or GVC during training leads to more balanced LEA recall and precision). However, the best overall result in terms of the whole task (i.e., across all corpora) was achieved with joint training: The model trained on FCC-T and GVC scores 40.9 mean LEA F1 over all corpora, whereas the best singlecorpus model trained on GVC alone only reached 36.4. In conclusion, training on multiple corpora did not boost performance on individual corpora. Nevertheless, joint training on multiple corpora has emerged as an important strategy for reaching general CDCR systems.

598
We have only scratched the surface of joint training for CDCR. Further improvements may be achieved with more sophisticated training approaches, for example, by mixing together different amounts of each corpus (potentially aiming for certain distributions of coreference link types), testing its effects on CDCR systems beyond mention pair approaches or performing training data augmentation with data from other NLP tasks.

Discussion
Despite its importance for downstream applications, the generalizability of CDCR systems over different corpora has not received attention in the past.
Our experiments showed that a system achieving state-of-the-art-level performance on ECB+ does not consistently produce results of the same quality when trained and tested on other CDCR corpora (see Section 6.4). This raises the suspicion that similar systems that were developed for a single corpus lack the capacity of generalizing to unseen corpora. This suspicion is substantiated when looking at the results of our crossdataset experiments. We showed that training a general, feature-based CDCR system on a single corpus yields good results on the test split of that respective corpus, whereas performance on other corpora falls short of these results (see Section 8).
This is due to the fact that the ECB+, FCC-T, and GVC corpora test systems on different, yet equally important, parts of the overall task of performing CDCR in news text (cf. our requirements posed in Section 2.2). Beyond established knowledge, such as ECB+ testing systems on a greater number of topics while offering low variation in event instances , we found that: • The distribution of coreference links in each corpus varies significantly, with GVC offering roughly the same number of within-document and within-subtopic links, whereas FCC-T offers many cross-subtopic links (see Table 1).
• Structural differences between corpora (such as the number of subtopics or mentions) can pose a problem for established CDCR techniques such as document preclustering and lead to performance drops (see Section 6.4.2), and expose or amplify scalability issues in systems (see Section 6.4.3).
• Between the corpora, the relevance of the four event components (action, participants, time, location) for resolving cross-document event coreference varies strongly. In particular, ECB+ stands out for requiring inference over event actions almost exclusively (see Section 7).
This means that by designing a system against a single corpus, significant aspects of CDCR are disregarded. Doing so introduces a bias toward corpora with specific properties, which severely limits a system's usefulness for downstream applications on data that exhibits different properties. Therefore, when claiming that a system is capable of resolving cross-document event coreference in the general case, it is imperative to report its performance on multiple CDCR corpora to certify its robustness to all aspects of CDCR annotated therein.
Related to this finding is the recent trend of ECB+ systems applying document preclustering prior to mention-level event coreference resolution, which deserves special attention. Throughout this work, we have pointed out that by preclustering documents via TF-IDF, one can reproduce the subtopics of a corpus. At test time, this yields an increase in precision for the resolution of within-document and within-subtopic links but has the downside of precluding the resolution of cross-subtopic or cross-topic links. This downside is negligible on ECB+ because in this corpus, within-document and within-subtopic links outnumber cross-subtopic links by a factor of 100 (see Table 1). Many recent ECB+ systems apply preclustering (Lee et al. 2012;Barhom et al. 2019;Cremisini and Finlayson 2020;Meged et al. 2020), yet scores drop sharply when such a system is applied out-of-the-box on a corpus with a different distribution of coreference links (see Table 8). The performance boost from document preclustering therefore comes at the cost of an overspecialization on the coreference link distribution in ECB+, which we consider to be a form of overfitting. This is again a strong point for the evaluation of CDCR systems on multiple corpora by which this (inadvertent) oversight in existing systems can be revealed. 18

Evaluation Recommendations
We summarize our findings with respect to the evaluation of CDCR with four recommendations for future research that pave the way for more general, comparable, and reliably evaluated CDCR systems.
1. CDCR systems should be tested on more than one corpus. The ECB+, FCC-T, and GVC corpora each are unique with respect to their topic structure, selection of topics, and distribution of event coreference links, all of which can have an impact on the performance of a CDCR system. Furthermore, the importance of action, participants, time, and location information varies in each corpus. For systems seeking to solve the CDCR task in general (i.e., where the application scenario does not necessitate the choice of one domain-specific corpus), this prompts for joint evaluation on multiple CDCR corpora to reveal a system's strengths and weaknesses. Where possible, the performance for each link type (within-document, within-subtopic, etc.) should be reported.

The LEA evaluation metric (Moosavi and Strube 2016) should be used as an additional performance indicator for CDCR.
This metric was previously shown to be more discriminative than previous coreference resolution metrics such as CoNLL F1 and takes size differences between clusters into account. We showed that cluster sizes in CDCR corpora vary significantly and observed score deltas between CoNLL F1 and LEA F1 of up to 20 pp, which motivates its use for CDCR over CoNLL F1.
3. In addition to a blind prediction, system performance should be reported when using gold document clusters. Cremisini and Finlayson (2020) request that future system development efforts should report scores with or without document clustering. We agree with this suggestion and refine it further. Given that in the research area of within-document coreference it is commonplace to report separate scores for mention identification and coreference resolution, we think distinguishing scores obtained with and without knowledge on the gold corpus structure would only make sense for CDCR. Researchers must take care to define the gold document clusters based on the transitive closure of event coreference links (see Section 6.4). Using the topics or subtopics of a corpus for this purpose produces incorrect results since cross-subtopic (or cross-topic) coreference links in the corpus are not taken into account.
4. Event detection performance should be evaluated carefully on CDCR corpora. From the point of view of a general-purpose event mention detection system, the event mention annotations in the ECB+, FCC-T, and GVC corpora are incomplete by design (see Section 2.4.3). Care must be taken to not unfairly penalize a system that includes a mention detection step as it may detect valid event mentions for which no gold annotation exists. We recommend computing mention detection performance only on those sentences that contain gold event mention annotations and reporting coreference resolution performance using the gold event mention annotations as a remedy.

Future Work
Having established that evaluation on multiple of the currently available corpora is necessary for a reliable performance assessment of CDCR systems, we consider the development of approaches that show consistent performance in such a scenario as the next short-to medium-term goal for this task.
A key challenge will be achieving systems that scale to collections of 10k-100k documents without precluding the resolution of cross-subtopic and cross-topic links. A foundation has already been laid by Kenyon-Dean, Cheung, and Precup (2018), who investigated scalable representation learning approaches for CDCR. Since current corpora consist of less than 1k documents, this may require the annotation of additional corpora. To keep the costs of annotating corpora of such magnitude manageable, novel semi-automatic annotation techniques would be required. Furthermore, the concept of cross-topic event coreference links has not been investigated yet due to a lack of annotated data. Once sufficient robustness and/or scalability has been achieved, use cases for downstream applications of CDCR could be investigated.

Conclusion
The usefulness of cross-document event coreference resolution for downstream multidocument NLP tasks has not been demonstrated yet. To perform well on unseen data in general, NLP systems need to robustly handle variations in the data they are applied on. For CDCR, multiple corpora with varying properties have been annotated, yet each CDCR system to date was developed, trained, and evaluated on only a single one of them. Besides hurting comparability, this currently allows little conclusions to be drawn on their robustness and generalizability, which contributes to the initially stated problem. We addressed this situation in several ways: We eliminated the remaining hurdles that rendered joint training and evaluation on multiple CDCR corpora difficult by creating FCC-T, a reannotation and extension of the Football Coreference Corpus (FCC) on token level.
To identify the unique properties of each corpus for resolving event coreference in practice, we developed a mention pair CDCR system with a broad set of handcrafted features and applied it on the EventCorefBank+ (ECB+), FCC-T, and Gun Violence Corpus (GVC) corpora. Using this system, we found that only a subset of all components by which events are commonly defined (action, participants, time, location) are required for resolving CDCR in each corpus in practice. In particular, ECB+ only focuses on resolving event actions whereas GVC and FCC-T are more balanced and additionally demand interpretation of event dates and its participants. Link-level analysis of this system revealed that mention distance (with respect to the topic-subtopic-document hierarchy of a corpus) positively correlates with difficulty in resolving cross-document event coreference links.
In the first uniform evaluation scenario involving multiple CDCR systems and corpora, we compared the neural ECB+ system of Barhom et al. (2019) to the featurebased system. First, we found that the neural system performs well on GVC but is outperformed by the conceptually simpler mention pair approach on FCC-T. Second, we deduced from these experiments that systems that are developed for ECB+ and that apply document preclustering overfit to the link distribution in this corpus.
In brief experiments with joint training on multiple corpora, we achieve a combined LEA F1 of 40.9 across all three corpora with the feature-based system-over 4.5 pp better than the same system trained on either corpus in isolation.
We offered four recommendations for future research on CDCR. Most importantly, we advocate evaluation on multiple corpora after having provided conclusive evidence that evaluating on a single corpus is and was insufficient.
All in all, with our annotation effort, corpus analyses, experiments, and open source implementation, we have laid a solid foundation for future research on robust and general CDCR systems. Achieving such systems then constitutes a big step forward toward CDCR becoming an integral part of the multidocument NLP pipeline.

Appendix A: Full In-Dataset CDCR Results
String distance on two action mentions. We compare surface form and lemma for identity and compute Levenshtein and MLIPNS (Shannaq and Alexandrov 2010) distances for lexical and phonetic distances. TF-IDF • document-similarity • surrounding-sentence-similarity • sentence-context-similarity We fit TF-IDF vectors on all given documents. We then compute the cosine similarity between TF-IDF vectors of text regions belonging to two mention pairs. For the regions we use (1) the full document, (2) the sentence surrounding the event mention, and (3) a sentence window of 5 sentences surrounding the mention (i.e., its context). Sentence embedding similarity • surrounding-sentence • doc-start We compute the cosine similarity between sentence representations of a sentence pair originating from a mention pair. We compare the sentences surrounding each event mention and the first sentence of a mention's document. Sentence representions are computed with the Sentence-BERT framework (Reimers and Gurevych 2019), using the pretrained distilbertbase-nli-stsb-mean-tokens model. Action mention embedding similarity

• action-mention
We compute a contextualized span representation of the action mention of each event mention and compute the cosine similarity of these representations for each mention pair. Span representations are created from the pretrained SpanBERT large model (Joshi et al. 2020) using a window of five sentences surrounding each event mention.
We obtain a location from each mention in five ways: (1) document-level, where we pick the first entity-linked location in a document, (2) SRL-level, where we use semantic role labeling (SRL) to find the linked location expression attached to the mention action, (3) sentence-level, where we use the location expression closest to the mention action in the same sentence, (4) closest-sentence-level, where we use the closest preceding location expression from all previous sentences and (5), a combination which applies (2), (3), (4) in order until a location expression is found. For each location pair, we compute distances in two ways: (1) We compute the geodesic distance between the coordinates of both locations.
(2) For each location, we follow the subdivision and country relations in DBpedia upwards (from more specific to less specific locations) to find a match between the two locations. The earlier We obtain Wikidata QIDs for each DBpedia entity and map these to pretrained embeddings from PyTorch-BigGraph (Lerer et al. 2019). For each mention in a mention pair, we look up the vectors of (1) the linked action mention (where available), (2) of all event components, (3) of all linked entities in the surrounding sentence, (4) of all linked entities in a 5-sentence window around the mention and (5) of all linked entities in the first three document sentences. Between each of these groups, we compute the pairwise cosine similarity between all vectors and retain the mean, variance, minimum and maximum similarity, respectively.

Appendix F: ECB+ Publication Date Annotation
The document publication date is an important piece of information for grounding temporal expressions, particularly in news text. ECB+ is the only CDCR corpus covered in this work for which document publication dates were not annotated. Source URLs of the corpus articles, from which the publication date could have been extracted automatically, are unfortunately only provided for documents in the ECB+ subtopics added at a later point by Cybulska and Vossen (2014b). For the initial EventCorefBank (ECB), no URLs are present. We manually annotated the missing date by inspecting the first few sentences of each document. Table F.1 displays the number of documents for which the publication date could be identified. Unfortunately, no publication dates were found in the ECB half of the corpus. In light of these results, we decided against using these annotations in our experiments since it may have given systems an unfair advantage in deciding whether a document belongs to one of the ECB or ECB+ subtopics. We nevertheless release our annotations in the hopes that they will be useful for future research.

Appendix G: Hyperparameter Optimization Procedure
We describe our approach for optimizing the hyperparameters of the feature-based system.
We apply repeated k-fold cross-validation to obtain reliable results. To define folds, we first partition the documents based on their topic (ECB+) or subtopic (FCC-T, GVC). From these partitioned document sets, folds are created, based on which we generate mention pairs. Compared to the naïve approach of creating folds from all possible mention pairs of a corpus split, this approach guarantees that each mention in the respective test fold is unseen, opposed to just the mention pair being unseen (with the two constituent mentions likely having been seen at training time), which provides a more faithful testing scenario. By using topics or subtopics for partitioning, we ensure a high number of coreferring pairs in the folds and guide the hyperparameter search toward models that should generalize better across topics or subtopics. The optimization algorithm is shown in Figure G.1. We use the optuna framework (Akiba et al. 2019) for sampling increasingly optimal sets of hyperparameters and use a configurable maximum duration as the stopping criterion. Optimization of the agglomerative clustering step is performed similarly, with the difference of generating a test clustering in line 12 and using the LEA F1 metric instead of F1 for binary classification in line 13.

Appendix H: Mention Pair Generation at Training Time
Approach. We explain our approach for determining the number of coreferring mention pairs to randomly sample for each event during training. Given a set of events E = {e 1 , e 2 , . . .} and the function m : E → N providing the number of mentions for an event (i.e., the cluster size), we define pairs coref : E → N, the number of coreferring pairs to sample for an event, as: For the event e largest with the most mentions in the given data split, this results in cdf(m(e largest )) = 1, therefore pairs coref (e largest ) = (m(e) − 1) · c . Hence, c controls the amount of coreferring mention pairs sampled from large clusters. The number of pairs to sample transitions smoothly from linear to quadratic the smaller a cluster is with respect to the overall distribution of cluster sizes in the dataset.
Having sampled coreferring pairs for each event, we determine their coreference link type (within-document, within-subtopic, etc.). For each type of coreference link (within-document, within-subtopic, etc.), the maximum number of non-coreferring pairs generated is k times the number of generated coreferring pairs. We ensure that the number of non-coreferring pairs increases per link type (so that within-document < within-subtopic < . . .).
For each (c, k) combination, we compute precision, recall, and F1 for each type of coreference link (using the mean of five independent trials to mitigate noise). We further aggregate these results by computing the macro-average over the four link types to produce one precision/recall/F1 value each per (c, k) tuple, shown in Figure H.1.
As visualized by the plots, k controls the precision/recall tradeoff, with higher k (larger proportion of non-coreferring training pairs) leading to high precision but low recall. The choice of c has a smaller impact on performance. Overall, considering F1 scores, the amount of coreferring mention pairs generated from large clusters can be reduced significantly (with c chosen as low as 2 −3 ) without loss in performance, unless many non-coreferring pairs are used (high k). This indicates that, for ECB+, there is little benefit in generating all possible coreferring mention pairs for training, and that achieving a broad selection of mention pairs from many different events is more important. Based on these results, and taking into account the distribution of cluster sizes in each corpus (see Figure 2), we chose (c = 8, k = 8) for ECB+ and the similarly distributed GVC in the main experiments. For experiments involving FCC-T, we chose (c = 2, k = 8) to reduce the impact of its few large, mostly redundant clusters on training.