Abstract
In contrast to identity anaphors, which indicate coreference between a noun phrase and its antecedent, bridging anaphors link to their antecedent(s) via lexico-semantic, frame, or encyclopedic relations. Bridging resolution involves recognizing bridging anaphors and finding links to antecedents. In contrast to most prior work, we tackle both problems. Our work also follows a more wide-ranging definition of bridging than most previous work and does not impose any restrictions on the type of bridging anaphora or relations between anaphor and antecedent.
We create a corpus (ISNotes) annotated for information status (IS), bridging being one of the IS subcategories. The annotations reach high reliability for all categories and marginal reliability for the bridging subcategory. We use a two-stage statistical global inference method for bridging resolution. Given all mentions in a document, the first stage, bridging anaphora recognition, recognizes bridging anaphors as a subtask of learning fine-grained IS. We use a cascading collective classification method where (i) collective classification allows us to investigate relations among several mentions and autocorrelation among IS classes and (ii) cascaded classification allows us to tackle class imbalance, important for minority classes such as bridging. We show that our method outperforms current methods both for IS recognition overall as well as for bridging, specifically. The second stage, bridging antecedent selection, finds the antecedents for all predicted bridging anaphors. We investigate the phenomenon of semantically or syntactically related bridging anaphors that share the same antecedent, a phenomenon we call sibling anaphors. We show that taking sibling anaphors into account in a joint inference model improves antecedent selection performance. In addition, we develop semantic and salience features for antecedent selection and suggest a novel method to build the candidate antecedent list for an anaphor, using the discourse scope of the anaphor. Our model outperforms previous work significantly.
1. Introduction
An anaphor is an expression whose interpretation depends upon a previous expression in the discourse (the antecedent). Figure 1 shows an excerpt of a news article with three anaphoric references: “its” is a pronominal anaphor referring back to the antecedent “The business,” which itself refers back to “The Bakersfield Supermarket.” Both of these two anaphors refer to the same entity as their antecedents. Differently, the bridging anaphor “friends” does not refer to the same entity as its antecedent “its owner.” The phenomena illustrated in (1) and (2) have attracted a lot of interest under the heading of coreference resolution (Hobbs 1978; Hirschman and Chinchor 1997; Soon, Ng, and Lim 2001; Lee et al. 2013, 2017, inter alia). This article, however, focuses on the phenomenon illustrated in (3), known as bridging (Clark 1975) or associative anaphora (Hawkins 1978). Bridging anaphors are anaphoric noun phrases that are not coreferent but instead linked via associative relations to the antecedent.
Bridging plays an important role in establishing entity coherence in a text. Barzilay and Lapata (2008) model local coherence with the entity grid based on coreference only. However, Example (1) does not exhibit any coreferential entity coherence, and therefore entity coherence can only be established when bridging is resolved. Furthermore, text understanding applications such as textual entailment (Mirkin, Dagan, and Padó 2010), context question answering (Voorhees 2001), and opinion mining (Kobayashi, Inui, and Matsumoto 2007) have been shown to benefit from bridging resolution.
The main contributions presented in this article lie in the following aspects:
- 1.
We present an English corpus (ISNotes) annotated for a wide range of information status (IS) categories as well as full anaphoric information for three anaphora types (coreference, bridging, and comparative; Section 3). Importantly, we impose no syntactic or relational restrictions on bridging—that is, bridging anaphora are not limited to definite noun phrases as in most previous work; antecedents can be noun phrases, verb phrases, or even clauses; and bridging relations are not restricted to meronymy. We show that bridging anaphora are very diverse. The overall annotation scheme is highly reliable, with the bridging category being annotated marginally reliably.2 The corpus is available as an OntoNotes annotation layer via http://www.h-its.org/en/research/nlp/isnotes-corpus/.
- 2.
We model bridging anaphora recognition as a subtask of learning fine-grained information status (Section 4). We integrate discourse structure, lexico-semantic, and genericity detection features into a cascading collective classification algorithm. Collective classification investigates relational autocorrelation among several IS classes whereas cascading classification addresses the multi-class imbalance problem, in particular the relative rarity of bridging compared to many other IS classes. Our model combines these two advantages by using binary classifiers for minority categories and a collective classifier for all categories. We beat current models both for overall IS classification accuracy as well as for bridging anaphora recognition on ISNotes.
- 3.
We explore a joint inference framework for bridging antecedent selection (Section 5). This model expresses an interesting topological property of bridging not used before—namely, that semantically or syntactically related anaphors are likely to share the same antecedent (such as The windows and walls in Example (1)). In addition, we develop semantic, syntactic, and salience features based on linguistic insights and present a novel method for constructing candidate antecedent lists, according to the anaphor’s discourse scope. Our model significantly outperforms prior work.
- 4.
Finally, we evaluate bridging resolution as a pipeline consisting of bridging recognition and antecedent selection (Section 6). This is the first full bridging resolution system that attempts the unrestricted phenomenon in a real setting.
All our experiments are performed on ISNotes and therefore all our claims hold only for the news genre. Although we believe the benefit of joint optimization to hold across other genres, several of our features are optimized for that particular corpus and therefore our figures indicate the best possible performance of our approach. The adaptation to other corpora will likely need additional fine-tuning.3
Connection to previous conference publications. This article synthesizes Markert, Hou, and Strube (2012) and Hou et al. (2013a, 2013b). It provides more technical details, error analyses, and also includes the following new aspects. For the corpus, we now include a detailed analysis of our bridging cases (Section 3.3). In bridging recognition, we now use Markov Logic Networks instead of iterative collective classification to unify the approaches to the two tasks.4 With regard to antecedent selection, we introduce several new features as well as the notion of using the discourse scope of an anaphor to adjust the set of potential antecedents it can refer back to (Section 5.3). We also now consider different evaluation paradigms dependent on whether one has access to full coreference information prior to bridging antecedent selection (mention-entity model) or not (mention-mention model), whereas before we only considered the mention-entity model (Section 5.4.4). Finally, we include a pipeline of the two models for bridging recognition and antecedent selection to evaluate performance of the full task (Section 6).5
2. Related Work
We first review theoretical studies related to bridging (Section 2.1) before discussing corpus studies in Section 2.2. Section 2.3 reviews automatic algorithms for bridging resolution and Section 2.4 discusses bridging and implicit semantic role labeling.
2.1. Bridging: Theoretical Studies
Theoretical studies on bridging include linguistic (Hawkins 1978; Prince, 1981, 1992), psycholinguistic (Clark 1975; Clark and Haviland 1977; Garrod and Sanford 1982), pragmatic and cognitive (Erkü and Gundel 1987; Gundel, Hedberg, and Zacharski 2000; Matsui 2000; Schwarz 2000), as well as formal accounts (Hobbs et al. 1993; Bos, Buitelaar, and Mineur 1995; Asher and Lascarides 1998; Löbner 1998; Cimiano 2006; Irmer 2009).
Our concept of bridging is closest to the notions of associative anaphora in Hawkins (1978) and (noncontained) inferrables in Prince (1981): noun phrases (NPs) that are not coreferent to a previous mention but the referent of which is identifiable via a lexico-semantic, frame, or encyclopedic relation to a previous mention, with this relation not being syntactically expressed.
Relation types used are very diverse and antecedents can be noun phrases, verb phrases, or even whole sentences (Clark 1975; Asher and Lascarides 1998, inter alia). Several studies, such as Hawkins (1978) and Löbner (1998), limit bridging to definite NPs; we, however, believe that there is no clear difference in information status between the windows, on the one hand, and walls, on the other hand, in Example (1).6
Our bridging notion differs from Clark (1975) in that we do not include coreferential cases: We believe coreference is different both from an IS viewpoint (always being discourse-old) as well as from a computational perspective in that coreference needs different methods to resolve than bridging.
2.2. Bridging: Corpus Studies
Fraurud (1990) annotated first-mentioned NPs (which included bridging) versus subsequent mention NPs. Thirty-six percent of first-mentioned definite NPs have interpretations that “appear to involve a relation to contextual elements outside the definite NP itself” (Fraurud 1990, page 406), similar to our bridging definition.
The Vieira/Poesio data set (Poesio and Vieira 1998) contains 150 anaphoric definite NPs without a head match to their antecedents. These cases include what we call bridging as well as coreferential NPs without the same head. We will call this definition of bridging lenient bridging from now on. The corpus was used later to develop computational models (Section 2.3). In a second experiment, the authors delimited bridging proper from coreferential cases with very low agreement (31% per-class agreement).
Similarly, bridging recognition proved difficult for annotators of the GNOME corpus (Poesio 2004), where only 22% of bridging references were annotated in the same way by both annotators, although bridging relations were limited to set membership, subset, and generalized possession (part-of and ownership relations).
Nissim et al. (2004) is the first large-scale annotation study for IS for English. Based on Prince (1992) and Eckert and Strube (2000), they annotated NP types with three main categories: an old entity is known to the hearer and has been mentioned in the conversation; a new entity is unknown to the hearer and has not been previously referred to; a mediated entity is newly mentioned in the dialogue but is inferrable from previously mentioned entities, or generally known to the hearer. Four of the nine subtypes of the mediated category (part, set, situation, and event) include bridging instances. Nissim et al. (2004) reported high agreement for the overall fine-grained IS annotation (with κ = 0.788) on 147 Switchboard dialogues (LDC 1993). The κ scores for the four bridging subtypes are mostly marginally reliable, between 0.594 and 0.794. However, the corpus cannot easily be used for a computational study of bridging anaphora resolution for the following reasons. First, antecedents for bridging NPs are not annotated. Second, the four subcategories used to mark up bridging also contain non-anaphoric cases, such as syntactically linked part-of relations (Example: the house’s door). In addition, any such study would be limited with regard to relation types as several of the bridging cases are only annotated if the relation to the antecedent is part of certain knowledge bases (i.e., part-of relations must be part of WordNet and situation relations part of FrameNet).
The German DIRNDL corpus (Eckart, Riester, and Schweitzer 2012; Björkelund et al. 2014) contains IS annotations for all NPs following the scheme by Riester, Lorenz, and Seemann (2010). Bridging is one IS category but only used for definite expressions. They achieved a kappa score of 0.78 for six top-level categories. However, the confusion matrix in Riester, Lorenz, and Seemann (2010) shows that the anaphoric bridging category is frequently confused with other categories: The two annotators agreed on fewer than a third of bridging anaphors.
These previous corpus studies on bridging differ from ours in several ways. First, the definition of bridging is sometimes extended to include coreferential NPs with lexical variety (Vieira 1998) or non-anaphoric NPs (Nissim et al. 2004). Second, they put more restrictions on bridging than we do, limiting to definite NP anaphora (Poesio and Vieira 1998; Gardent and Manuélian 2005; Caselli and Prodanof 2006; Riester, Lorenz, and Seemann 2010), to NP antecedents (all prior work), or to few relation types between anaphor and antecedent (Poesio 2004). Apart from these differences in definition of bridging, often reliability is not measured or low, especially for bridging recognition (Fraurud 1990; Poesio and Vieira 1998; Gardent and Manuélian 2005; Nedoluzhko, Mírovskỳ, and Pajas 2009; Riester, Lorenz, and Seemann 2010).
2.3. Bridging: Computational Approaches
Most computational approaches for resolving bridging focus on antecedent selection. Some handle bridging anaphora recognition when recognizing fine-grained IS. Only a few works tackle full bridging resolution—that is, recognizing bridging anaphors and finding links to antecedents.
Bridging anaphora recognition.
Fine-grained IS classification for Switchboard (Nissim et al. 2004) has been implemented via a combination of rules and a multiclass SVM (Rahman and Ng 2012). F-scores for the four categories that include bridging (part, situation, event, set) ranged from 63.3 to 87.2. These results do not necessarily reflect the real difficulty of the problem, however, because of the restrictions posed on bridging in the underlying annotation and the inclusion of non-anaphoric cases (Section 2.2).
Cahill and Riester (2012) trained a CRF model for fine-grained IS classification on the German DIRNDL radio news corpus (Riester, Lorenz, and Seemann 2010), making use of the assumption that IS classes within sentences tend to follow certain orderings, for example, old > mediated > new. They did not report the result for the bridging subcategory.
An attention-based long short-term memory model with pre-trained word embeddings and simple features achieved competitive results on ISNotes compared to our collective classification approach (Hou 2016).
Bridging antecedent selection.
Based on the Vieira/Poesio data set (Section 2.2), various studies resolved “lenient” definite bridging references. Vieira and Teufel (1997) and Poesio, Vieira, and Teufel (1997) used heuristics for antecedent selection, exploiting WordNet relations such as synonymy/hyponymy/meronymy. Schulte im Walde (1998) used word clustering. The bridging anaphors were resolved to the closest antecedent candidate in a high-dimensional space, the best result being an accuracy of 22.7%.
Poesio et al. (2002) and Markert, Nissim, and Modjeska (2003) acquired mereological knowledge for bridging resolution by using syntactic patterns (such as the NP of NP) on the British National Corpus and the Web, respectively. All of this work was done on small data sets, numbering in the 10s for test bridging cases when excluding coreferential cases.
Another line of work applied machine learning techniques. The pairwise model in Poesio et al. (2004a) combines lexico-semantic and salience features to resolve mereological bridging in the GNOME corpus. However, their results came from a limited evaluation setting: In the first two experiments they distinguished only between the correct antecedent and one or three false candidates. The more realistic scenario of finding the correct antecedent among all possible candidates was tried for just six bridging anaphors. On the basis of this method, Lassalle and Denis (2011) developed a system that resolves mereological bridging in French, with meronymic information extracted from raw texts using a bootstrapping method. They reported an accuracy of 23% for over 300 meronymy bridging anaphors using the realistic evaluation scenario.
Full bridging resolution.
The rule-based system for processing definite NPs in Vieira and Poesio (2000) includes bridging cases (using the lenient definition of bridging discussed in the previous sections) but they do not report results for the bridging category.
Hahn, Strube, and Markert (1996) distinguish bridging resolution from other anaphora resolution. Their rule-based framework integrates language-independent conceptual criteria and language-dependent functional constraints. Their conceptual criteria were based on a knowledge base from the information technology domain that consists of 449 concepts and 334 relations. They focused on definite bridging anaphora and certain types of relations only (e.g., has-property, has-physical-part). On a small-scale technical domain data set (5 texts in German with 109 bridging anaphors), they achieved a recall of 55.0% and precision of 73.2%. Although the results seem satisfactory, the system is heavily dependent on the domain knowledge resource.
Sasano and Kurohashi (2009) resolved bridging and zero anaphora in Japanese simultaneously, using automatically acquired case frames in a probabilistic model. Although it is not clear how bridging anaphora are distributed in their corpus and whether this approach can be effectively applied to other languages, the lexical knowledge resource constructed is general and can capture diverse bridging relations.
Rösiger and Teufel (2014) extended a coreference resolution system with semantic features from WordNet (e.g., hypernymy, meronymy) to find bridging links in scientific text, considering definite NPs only. They used the CoNLL scorer for evaluation. However, a coreference resolution system and evaluation metric are not suitable for bridging resolution because bridging is not a set problem.
Discussion.
Our study departs from related work by modeling bridging on the discourse level without limiting it to definite NPs or to certain bridging relations (e.g., part-of). For bridging anaphora recognition, our cascading collective classification model (Section 4) addresses multi-class imbalance while keeping the strength of collective classification. For bridging antecedent selection, our joint inference model (Section 5) integrates bridging resolution with clustering anaphors that share the same antecedent. Furthermore, unlike previous work that uses a sentence window to form the set of antecedent candidates, we propose a method to select antecedent candidates using a flexible notion of discourse scope of an anaphor. The latter makes use of the discourse relation Expansion and models salience.
2.4. Implicit Semantic Role Labeling
Semantic role labeling is the task of assigning semantic roles (such as Agent or Theme) to the semantic arguments associated with a predicate (e.g., a verb or a noun). In frame semantics (Baker, Fillmore, and Lowe 1998), core semantic roles (also called Core Frame Elements) are essential to the meaning of semantic situations while non-core semantic roles (e.g., time, manner) are less central.
In addition, implicit semantic role labeling for nominal predicates tries to link all possible implicit core roles for the nominal predicate in question. Yet not every nominal predicate under consideration is a bridging anaphor.
Despite differences between implicit semantic role labeling and bridging resolution, these two tasks can benefit from each other. We explore statistics from NomBank (Meyers et al. 2004) to predict bridging anaphors (Section 4.3.2). Some of our features for bridging antecedent selection are inspired by Laparra and Rigau (2013) (Section 5.2.2).
3. ISNotes: A Corpus for Information Status
ISNotes contains 50 texts from the Wall Street Journal portion of OntoNotes (Weischedel et al. 2011), in which all mentions (10,980 overall) are annotated for IS. The corpus can be downloaded from http://www.h-its.org/en/research/nlp/isnotes-corpus/.
3.1. ISNotes Annotation Scheme
Information status in ISNotes.
Information status describes the degree to which a discourse entity is available to the hearer regarding the speaker’s assumption about the hearer’s knowledge and beliefs (Prince 1992; Nissim et al. 2004). We distinguish eight IS categories, inspired by Nissim et al. (2004), although with some variations.
A mention is old if it is either coreferent with a previous mention (based on the OntoNotes coreference annotation), or if it is a generic or deictic pronoun.
- •
mediated/worldKnowledge (abbreviated as mediated/WK) mentions are generally known to the hearer. This category includes many proper names, such as Poland.
- •
mediated/syntactic mentions are syntactically linked via a possessive relation, a proper name premodification or a prepositional phrase postmodification to other old or mediated mentions, such as:
- •
[[their]oldliquor store]mediated/syntactic,
- •
[the [Federal Reserve]mediated/WKboss]mediated/syntactic, and
- •
[the main artery into [San Francisco]mediated/WK]mediated/syntactic
- •
- •
mediated/comparative mentions are non-coreference anaphors where the anaphor is compared to the antecedent (and where both are therefore often of the same semantic type). They usually include a premodifier or head that makes clear that this entity is compared to a previous one, such as others in Example (3).7,8
- •
mediated/bridging mentions are non-coreference anaphors where a frame, lexico-semantic, or world knowledge relation holds between anaphor and antecedent, such as the streets in Example (4) and The reason in Example (5).
- •
mediated/aggregate mentions are coordinated mentions where at least one element in the conjunction is old or mediated, such as [Not only [George Bush]mediated/WKbut also [Barack Obama]mediated/WK]mediated/aggregate.
- •
mediated/function mentions refer to a value of a previously mentioned function (e.g., 3 points in Example (6)). The function needs to be able to rise and fall (e.g., were down in Example (6)).
New mentions have not yet been introduced in the discourse and the entity they refer to cannot be inferred from either previously mentioned entities/events or general world knowledge.
Antecedents for mediated/bridging and mediated/comparative.
3.2. Agreement Study
An agreement study was carried out among three annotators. Annotator A is the scheme developer and a computational linguist. Annotators B and C have no linguistic training or education. Annotator A and B are fluent English speakers, living in English-speaking countries, but are not native speakers. Annotator C is a native speaker of English.
All potential mentions were pre-marked automatically using the WSJ syntactic noun phrase annotation. All non-initial mentions in an OntoNotes coreference chain were pre-marked as old. The annotation task consisted of excluding all non-mentions (such as non-referential it) and marking all mentions for their information status as well as the antecedents for comparative and bridging anaphora. The scheme was developed on nine texts, which were also used for training the annotators. Inter-annotator agreement was measured on 26 new texts, which included 5,905 potential mentions. The annotations of 1,499 of these were carried over from OntoNotes coreference annotation, leaving 4,406 potential mentions for annotation and agreement measurement.
Table 1 (top) shows percentage agreement as well as Cohen’s κ (Artstein and Poesio 2008) between all three possible annotator pairings at the coarse-grained (four categories: non-mention, old, new, mediated) and the fine-grained level (nine categories: non-mention, old, new and the six mediated subtypes). As our category distribution is highly unbalanced, Cohen’s kappa is necessary to report as it corrects for chance agreement achieved by just using majority categories.11Table 1 (bottom) shows individual category agreement, computed by merging all categories but one and then computing κ as usual. High reliability is achieved for most individual categories.12 The reliability of the category bridging is marginally reliable and more annotator-dependent, although higher than other previous attempts at bridging annotation (Poesio 2003; Gardent and Manuélian 2005; Riester, Lorenz, and Seemann 2010). The agreement of selecting bridging antecedents is around 80% for all annotator pairings.
. | . | A-B . | A-C . | B-C . |
---|---|---|---|---|
Overall | Percentage coarse | 87.5 | 86.3 | 86.5 |
κ coarse | 77.3 | 75.2 | 74.7 | |
Percentage fine | 86.6 | 85.3 | 85.7 | |
κ fine | 80.1 | 77.7 | 77.3 | |
Individual Categories | κ Non-mention | 81.5 | 78.9 | 86.0 |
κ old | 80.5 | 83.2 | 79.3 | |
κ new | 76.6 | 74.0 | 74.3 | |
κ mediated/worldKnowledge | 82.1 | 78.4 | 74.1 | |
κ mediated/syntactic | 88.4 | 87.8 | 87.6 | |
κ mediated/aggregate | 87.0 | 85.4 | 86.0 | |
κ mediated/function | 6.0 | 83.2 | 6.9 | |
κ mediated/comparative | 81.8 | 78.3 | 81.2 | |
κ mediated/bridging | 70.8 | 60.6 | 62.3 |
. | . | A-B . | A-C . | B-C . |
---|---|---|---|---|
Overall | Percentage coarse | 87.5 | 86.3 | 86.5 |
κ coarse | 77.3 | 75.2 | 74.7 | |
Percentage fine | 86.6 | 85.3 | 85.7 | |
κ fine | 80.1 | 77.7 | 77.3 | |
Individual Categories | κ Non-mention | 81.5 | 78.9 | 86.0 |
κ old | 80.5 | 83.2 | 79.3 | |
κ new | 76.6 | 74.0 | 74.3 | |
κ mediated/worldKnowledge | 82.1 | 78.4 | 74.1 | |
κ mediated/syntactic | 88.4 | 87.8 | 87.6 | |
κ mediated/aggregate | 87.0 | 85.4 | 86.0 | |
κ mediated/function | 6.0 | 83.2 | 6.9 | |
κ mediated/comparative | 81.8 | 78.3 | 81.2 | |
κ mediated/bridging | 70.8 | 60.6 | 62.3 |
We investigated disagreements between Annotators A and B in bridging recognition: Almost all cases are instances where one annotator identified bridging and the other one new. Particularly frequent were borderline cases where the whole document had one major focus and subsequent NPs with a semantic relation to that focus could be seen either as new (interpretable without the major focus) or bridging. As an example, consider a document on the company Toyota and a later sentence stating Output had gone down. According to our guidelines, most of these cases are bridging, but they are easily overlooked.
The bridging annotations of the pairing A-B were used to create a consistent gold standard of the 35 texts (9 training, 26 testing), discussing all disagreed items between the annotators. Finally, Annotator A annotated a further 15 texts singly.
3.3. Corpus Analysis
IS distribution.
Table 2 shows the class distribution. New mentions are the largest category (36.7%). Syntactic mentions are the largest mediated category.
Texts | 50 | ||
Sentences | 1,726 | ||
Mentions | 10,980 | ||
old | 3,237 | 29.5% | |
coreferent | 3,143 | 28.6% | |
generic or deictic pronoun | 94 | 0.9% | |
mediated | 3,708 | 33.8% | |
syntactic | 1,592 | 14.5% | |
world knowledge | 924 | 8.4% | |
bridging | 663 | 6.0% | |
comparative | 253 | 2.3% | |
aggregate | 211 | 1.9% | |
function | 65 | 0.6% | |
new | 4,035 | 36.7% |
Texts | 50 | ||
Sentences | 1,726 | ||
Mentions | 10,980 | ||
old | 3,237 | 29.5% | |
coreferent | 3,143 | 28.6% | |
generic or deictic pronoun | 94 | 0.9% | |
mediated | 3,708 | 33.8% | |
syntactic | 1,592 | 14.5% | |
world knowledge | 924 | 8.4% | |
bridging | 663 | 6.0% | |
comparative | 253 | 2.3% | |
aggregate | 211 | 1.9% | |
function | 65 | 0.6% | |
new | 4,035 | 36.7% |
Bridging anaphora modification.
Table 3 shows the bridging anaphora distribution with regard to determiners: Only 38.5% of bridging anaphors are modified by the, 44.9% of bridging anaphors are not modified by any determiners. This calls into question the strategy of several prior approaches (Vieira and Poesio 2000; Lassalle and Denis 2011; Cahill and Riester 2012) to limit themselves to bridging anaphors modified by the.
Bridging pair distance.
We define the distance between a bridging anaphor and its antecedent as the distance between the anaphor and its closest preceding antecedent instantiation. The distribution of the distance for all 683 anaphor-antecedent pairs is shown in Figure 2.13 We see that 77% of anaphors have antecedents occurring in the same or up to two sentences prior to the anaphor, although that still leaves a substantial number of instances that need relatively distant antecedents.
Bridging relations.
The semantic relations between anaphor and antecedent are extremely diverse. Among 683 bridging pairs, only 2.3% correspond to an action, 6.6% to a set membership (see Example (2)) and 13.5% to a part-of/attribute-of relation between anaphor and antecedent (Table 4). A total of 77.6% of bridging relations fall under the category “other,” without further distinction. This includes encyclopedic relations such as restaurant – the waiter as well as context-specific relations such as palms – the thieves. Among all bridging antecedents, only 39 are represented by verbs or clauses.
Sibling anaphors.
We call bridging anaphors “siblings” if they share the same antecedent (entity), and “non-siblings” are anaphors that do not share an antecedent with any other anaphor. In Example (1), The windows, The carpets, and walls are sibling anaphors. In ISNotes, 61.4% of the bridging anaphors are siblings and we will use this to good effect in our model for bridging antecedent selection.
4. Information Status and Bridging Anaphora Recognition
For IS recognition, each mention is assigned one of the eight classes old, mediated/syntactic, mediated/WK, mediated/bridging, mediated/comparative, mediated/aggregate, mediated/function, and new. We make contributions to bridging recognition as well as for IS recognition in general.
4.1. Motivation for the Task
In a similar vein, Clark (1975) distinguishes between bridging via necessary, probable, and inducible parts/roles. He argues that only in the first case does the antecedent trigger the bridging anaphor in the sense that we already think of the anaphor when we read/hear the antecedent. For instance, walls in Example (1) are necessary parts of the antecedent the Polish center according to common sense knowledge. However, windows and carpets are only probable or inducible parts of a building but still function as bridging anaphors in Example (1).
4.2. Method: Model
4.2.1. Model I: Collective Classification
Motivation.
Two mediated subcategories account for accessibility via syntactic links to another old or mediated mention. Mediated/syntactic is used when at least one child of a mention is mediated or old, with child relations restricted to:
- •
Possessive pronouns or possessive NPs (e.g., [[his]old father]mediated/syntactic)
- •
Of-genitives (e.g., [The alcoholism of [his]old father]mediated/syntactic)
- •
Proper name premodifiers (e.g., [The [Federal Reserve]mediated/WK boss]mediated/syntactic)
- •
Other prepositional phrases (e.g., [professors at [Cambridge]mediated/WK]mediated/syntactic)
The subcategory mediated/aggregate is for coordinations in which at least one of the children is old or mediated, e.g., Not only George Bush but also Barack Obama is mediated/aggregate as Barack Obama is mediated/WK.
In these two cases, a mention’s IS depends directly on the IS of its children. This is therefore a case of so-called autocorrelation, a characteristic of relational data in which the value of one variable for one instance is highly correlated with the value of the same variable on another instance. By exploiting relational autocorrelation, collective classification (Jensen, Neville, and Gallagher 2004; Macskassy and Provost 2007) can significantly outperform independent supervised classification (Taskar, Segal, and Koller 2001; Neville and Jensen 2003; Domingos and Lowd 2009) and has been applied, for example, in part-of-speech tagging (Lafferty, McCallum, and Pereira 2001), Web page categorization (Taskar, Abbeel, and Koller 2002), opinion mining (Somasundaran et al. 2009; Burfoot, Bird, and Baldwin 2011), and entity linking (Fahrni and Strube 2012).
Detailed model.
This log-linear model can be represented using Markov logic networks (MLNs) (Domingos and Lowd 2009). An MLN is a statistical relational learning framework that combines first order logic and Markov networks. It provides us with a simple yet flexible language to construct joint models for bridging resolution. Moreover, our task-specific models can benefit from the advances in inference and learning algorithms for MLNs.
A Markov logic network is defined as a set of pairs (fi, wi), where fi is a formula in first-order logic and wi is a real number (Domingos and Lowd 2009). In first-order logic, formulas are constructed using four types of symbols: constants, variables, functions, and predicates. Constant symbols represent objects that we are interested in (mentions in our problem, such as his father or his), variable symbols range over objects in the domain, function symbols map objects to objects, and predicate symbols represent relations among objects or attributes of objects (e.g., hasIS in Table 5).
Hidden predicates | |
p1 | hasIS(m, s) |
Formulas | |
Hard constraints | |
f1 | ∀m ∈ M: |s ∈ S: hasIS(m, s)| = 1 |
Joint inference formula template | |
fg | (w) ∀mi, mj ∈ M ∀ ∈ S: jointInferenceFormula_Constraint (mi, mj) → hasIS(mi, ) ∧ hasIS(mj, ) |
Non-joint inference formula template | |
fl | (w) ∀m ∈ M ∀s ∈ S: non-jointInferenceFormula_Constraint (m, s) → hasIS(m, s) |
Hidden predicates | |
p1 | hasIS(m, s) |
Formulas | |
Hard constraints | |
f1 | ∀m ∈ M: |s ∈ S: hasIS(m, s)| = 1 |
Joint inference formula template | |
fg | (w) ∀mi, mj ∈ M ∀ ∈ S: jointInferenceFormula_Constraint (mi, mj) → hasIS(mi, ) ∧ hasIS(mj, ) |
Non-joint inference formula template | |
fl | (w) ∀m ∈ M ∀s ∈ S: non-jointInferenceFormula_Constraint (m, s) → hasIS(m, s) |
4.2.2. Model II: Cascading Collective Classification
Motivation for the model.
As shown in Section 3, bridging anaphors have different determiners and few easily identifiable surface features. In addition, they are relatively rare, making up only 6% of noun phrases in our data. Such multi-class imbalance problems are an open research topic (Abe, Zadrozny, and Langford 2004; Zhou and Liu 2010; Wang and Yao 2012). Classification accuracy may be artificially high in case of extremely imbalanced data: Majority classes are favored, and minority classes are not recognized. Such a bias becomes stronger within the multi-class setting. To address this problem while still keeping the strength of collective inference within a multi-class setting, we integrate our collective classification model (Section 4.2.1) into a cascading collective classification system inspired by Omuya, Prabhakaran, and Rambow (2013).
Detailed Model.
Unlike in the multi-class setting, learning from imbalanced data in the binary setting has been well studied (He and Garcia 2009). Therefore, our cascading collective classification system, shown in Figure 3, combines binary classifiers for minority categories and a collective classifier for all categories in a pipeline. Specifically, for the five classes mediated/function, mediated/aggregate, mediated/comparative, mediated/bridging, and mediated/WK that each constitutes less than the expected one-eighth of the instances, we develop five binary classifiers with SVMlight (Joachims 1999). These classifiers use only non-joint inference formulae, but have the advantage that we can tune the SVM parameter against data imbalance on the training set. We arrange them from the rarest to the most frequent category. Whenever a minority classifier predicts true, this class is assigned. When all minority classifiers say false, we back off to multi-class collective inference (Section 4.2.1). Omuya, Prabhakaran, and Rambow (2013, page 805) motivate a rarest to most frequent ordering on the task of dialogue act tagging by “the observation that the less frequent classes are also hard to predict correctly” and we follow their procedure. Such a framework substantially improves bridging anaphora recognition without jeopardizing performance on other IS classes (Section 4.4.3).
4.3. Method: Features
Section 4.3.1 details the relational features that instantiate the joint inference formula template fg in Table 5. Section 4.3.2 details non-relational features that instantiate the non-joint inference formula template fl in Table 5. Apart from the ISNotes corpus, for some non-relational features we use additional resources with manual annotation, namely, NomBank (Meyers et al. 2004), WordNet, the General Inquirer (Stone et al. 1966), and the ACE2 annotations for genericity (Mitchell et al. 2002).
4.3.1. Relational Features.
Syntactic hasChild relations.
We link a mention m1 to a mention m2 via a hasChild relation if (i) m2 is a possessive or prepositional modification of m1; or (ii) m2 is a proper name premodifier of m1. For instance, the mention [professors at Cambridge] is linked to the mention [Cambridge] via a hasChild relation.
Syntactic hasChildCoordination relations.
We link a mention m1 to a mention m2 via a hasChildCoordination relation if m1 is a coordination and m2 is one of its children. For example, the mention [Not only George Bush but also Barack Obama] is linked to the mention [Barack Obama] via a hasChildCoordination relation.
Syntactic ConjoinedTo relations.
Conjoined mentions may have the same IS class. We link a mention m1 to a mention m2 via a ConjoinedTo relation if both m1 and m2 are the children of a coordination. For example, [George Bush] is linked to [Barack Obama] via a ConjoinedTo relation as both are the children of the coordination [Not only George Bush but also Barack Obama].
4.3.2. Non-relational Features.
Table 6 shows all non-relational features for IS recognition.
Feature . | Value . |
---|---|
Features from previous work (Nissim 2006; Rahman and Ng 2011) | |
p1 FullPrevMention (n) | {yes, no, NA}1 |
p2 FullMentionTime (n) | {first, second, more, NA} |
p3 HeadMatch (n) | {yes, no, NA} |
p4 NPlength (int) | numeric, e.g., 5 |
p5 Determiner (n) | {def, indef, dem, poss, bare, NA} |
p6 GrammaticalRole (n) | {subject, subjpass, object, predicate, pp, other} |
p7 NPType (n) | {common noun, proper noun, pronoun, other} |
p8 Unigrams (l) | e.g., his, the, China |
New features for identifying several IS classes (non-bridging) | |
g1 HeadMatchTime (n) | {first, second, more, NA} |
g2 ContentWordPreMention (b) | {yes, no, NA} |
g3 IsFrequentProperName (b) | {yes, no} |
g4 PreModByCompMarker (b) | {yes, no} |
g5 DependOnChangeVerb (b) | {yes, no} |
New features for recognizing bridging anaphora | |
Discourse structure | |
f1 IsCoherenceGap (b) | {yes, no} |
f2 IsSentFirstMention (b) | {yes, no} |
f3 IsDocFirstMention (b) | {yes, no} |
Lexico-semantics | |
f4 IsArgumentTakingNP (b) | {yes, no} |
f5 IsWordNetRelationalNoun (b) | {yes, no} |
f6 IsInquirerRoleNoun (b) | {yes, no} |
f7 SemanticClass (n) | a list of 16 classes, e.g., location, organization |
f8 IsBuildingPart (b) | {yes, no} |
f9 IsSetElement (b) | {yes, no} |
f10 ModSpatialTemporal (b) | {yes, no} |
f11 IsYear (b) | {yes, no} |
f12 PreModifiedByCountry (b) | {yes, no} |
Identifying generic NPs | |
f13 AppearInIfClause (b) | {yes, no} |
f14 NPNumber (n) | {singular, plural, unknown} |
f15 VerbPosTag (l) | e.g., VBG, MD, VB |
f16 IsFrequentGenericNP (b) | {yes, no} |
f17 GeneralWorldKnowledge(l) | e.g., the sun, the wind |
f18 PreModByGenericQuantifier (b) | {yes, no} |
Mention syntactic structure | |
f19 HasChildMention (b) | {yes, no} |
Feature . | Value . |
---|---|
Features from previous work (Nissim 2006; Rahman and Ng 2011) | |
p1 FullPrevMention (n) | {yes, no, NA}1 |
p2 FullMentionTime (n) | {first, second, more, NA} |
p3 HeadMatch (n) | {yes, no, NA} |
p4 NPlength (int) | numeric, e.g., 5 |
p5 Determiner (n) | {def, indef, dem, poss, bare, NA} |
p6 GrammaticalRole (n) | {subject, subjpass, object, predicate, pp, other} |
p7 NPType (n) | {common noun, proper noun, pronoun, other} |
p8 Unigrams (l) | e.g., his, the, China |
New features for identifying several IS classes (non-bridging) | |
g1 HeadMatchTime (n) | {first, second, more, NA} |
g2 ContentWordPreMention (b) | {yes, no, NA} |
g3 IsFrequentProperName (b) | {yes, no} |
g4 PreModByCompMarker (b) | {yes, no} |
g5 DependOnChangeVerb (b) | {yes, no} |
New features for recognizing bridging anaphora | |
Discourse structure | |
f1 IsCoherenceGap (b) | {yes, no} |
f2 IsSentFirstMention (b) | {yes, no} |
f3 IsDocFirstMention (b) | {yes, no} |
Lexico-semantics | |
f4 IsArgumentTakingNP (b) | {yes, no} |
f5 IsWordNetRelationalNoun (b) | {yes, no} |
f6 IsInquirerRoleNoun (b) | {yes, no} |
f7 SemanticClass (n) | a list of 16 classes, e.g., location, organization |
f8 IsBuildingPart (b) | {yes, no} |
f9 IsSetElement (b) | {yes, no} |
f10 ModSpatialTemporal (b) | {yes, no} |
f11 IsYear (b) | {yes, no} |
f12 PreModifiedByCountry (b) | {yes, no} |
Identifying generic NPs | |
f13 AppearInIfClause (b) | {yes, no} |
f14 NPNumber (n) | {singular, plural, unknown} |
f15 VerbPosTag (l) | e.g., VBG, MD, VB |
f16 IsFrequentGenericNP (b) | {yes, no} |
f17 GeneralWorldKnowledge(l) | e.g., the sun, the wind |
f18 PreModByGenericQuantifier (b) | {yes, no} |
Mention syntactic structure | |
f19 HasChildMention (b) | {yes, no} |
We changed the value of “f 1 FullPrevMention” from “numeric” to {yes, no, NA}.
Features p1–p8 from previous work.
We adapt features p1–p8 from Nissim (2006) and Rahman and Ng (2011). A mention with complete string match to a previous one is likely to be old (p1, p2). The head match feature p3 (from Nissim’s PartialpreMention feature as well as coreference resolution [Vieira and Poesio 2000; Soon, Ng, and Lim 2001]) identifies old and mediated categories such as comparative anaphora. p4 NPlength is motivated by Arnold et al. (2000, page 34): “items that are new to the discourse tend to be complex and items that are given tend to be simple.” There is a tendency for indefinite NPs to be new (Hawkins 1978) (p5). Subjects are likely to be old (p6) (Prince 1992). Pronouns tend to be old (p7). Rahman and Ng (2011) explore lexical features (p8), for example, mentions which include the lexical unit his are likely not to be new.
New features for identifying several IS classes (non-bridging).
The new features (g1–g5) capture the classes old as well as mediated/WK, mediated/comparative, and mediated/function. g1 HeadMatchTime and g2 ContentWordPreMention are string match variations, giving a categorical version of p3 HeadMatch and a partial mention match going beyond the mention’s head, respectively.
Proper names not previously mentioned in the text but appearing in many other documents are likely to be hearer-old (IS class mediated/WK). To approximate this, g3 IsFrequentProperName measures if the mention is a proper name, occuring in at least 100 documents in the Tipster corpus (Harman and Liberman 1993).
Mediated/comparative mentions are often indicated by surface clues such as premodifiers (e.g., other, another). In g4 PreModByCompMarker, we check for such markers16 as well as the presence of adjectives or adverbs in the comparative form.
g5 DependOnChangeVerb determines whether a number mention is the object of an increase/decrease verb and therefore is likely to be the IS class mediated/function.17
New features for recognizing bridging anaphors.
Bridging anaphors are rarely marked by surface features but are often licensed because of discourse structure and/or lexical or world knowledge. Motivated by these observations, we develop discourse structure and lexico-semantic features indicating bridging anaphora. We also design features to separate genericity from bridging anaphora.
Discourse structure features (Table 6, f1–f3).
Bridging is sometimes the only means to establish entity coherence to previous sentences/clauses (Grosz, Joshi, and Weinstein 1995; Poesio et al. 2004b). This is especially true for topic NPs (Halliday and Hasan 1976). We therefore define coherence gap sentences as sentences that have none of the following three coherence elements: (1) entity coreference to previous sentences, as approximated via string match or the presence of pronouns; (2) comparative anaphora approximated by mentions modified via 10 comparative markers, or by the presence of adjectives or adverbs in the comparative (see also g4 PreModByCompMarker); or (3) proper names.18 Bridging Examples (1), (9), (10), (11), (12), (14), and (16) occur in coherence gap sentences under our definition. We approximate the topic of a sentence via the first mention (f2 IsSentFirstMention). f3 IsDocFirstMention models that bridging anaphors do not appear at the beginning of a text.
Lexico-semantic features (Table 6, f4–f12).
Drawing on theories of noun types (Löbner 1985) and bridging sub-classes (Clark 1975; Poesio and Vieira 1998; Lassalle and Denis 2011), we capture lexical properties of head nouns of bridging.
Löbner (1985) distinguishes between relational nouns that take on at least one core semantic role (such as friend) and sortal nouns (such as table or flower). He points out that relational nouns are more frequently used for bridging than sortal nouns (see Examples (8), (9), (13), and (14)). f4 IsArgumentTakingNP and f5 IsWordNetRelationalNoun capture relational nouns. f4 decides whether the argument taking ratio of a mention’s head is bigger than some threshold k. We calculate the argument taking ratio α for a mention using NomBank (Meyers et al. 2004). For each mention, α is calculated via its head frequency in the NomBank annotation divided by the head’s total frequency in the WSJ corpus on which the NomBank annotation is based. The value of α reflects how likely an NP is to take arguments. For instance, the value of α is 0.90 for husband but 0.31 for children. We also extract around 4,000 relational nouns from WordNet, then determine whether the mention head appears in the list or not (f5 IsWordNetRelationalNoun). The core semantic role for a relational noun can of course also be filled NP-internally instead of anaphorically. We use the features f12 PreModifiedByCountry (such as the Egyptian president) and f19 HasChildMention (for complex NPs that are likely to fill needed roles NP-internally) to address this.
Role terms (e.g., chairman) and kinship terms (e.g., husband) are also relational nouns. f6 IsInquirerRoleNoun determines whether the mention head appears under the role category in the General Inquirer lexicon (Stone et al. 1966). The feature f7 SemanticClass puts each mention into one of 16 coarse-grained semantic classes: {rolePerson, relativePerson, person*, organization, geopolitical entity (GPE), location, nationality or religious or political group (NORP), event, product, date, time, percent, money, ordinal, cardinal, other}, using the OntoNotes annotation for named entities and WordNet for common nouns. The category rolePerson matches person mentions whose head noun specifies a professional role such as mayor, director, or president, using a list of 100 such nouns from WordNet. The category relativePerson matches person mentions whose head noun specifies a family or friend role such as husband, daughter, or friend, using a list of 100 such nouns from WordNet. The category person* is assigned to all other person mentions.
Because part-of relations are typical bridging relations (see Example (1) and Clark [1975]), f8 IsBuildingPart determines whether the mention head might be a part of a building, using a list of 45 nouns from the General Inquirer under the BldgPt category.
f9 IsSetElement is used to identify set-membership bridging cases (see Example (12)), by checking whether the mention head is a number or indefinite pronoun (one, some, none, many, most) or modified by each, one. However, not all numbers are bridging cases, and we use f11 IsYear to exclude some such cases.
Some bridging anaphors are indicated by spatial or temporal modifiers (see Example (11) and also Lassalle and Denis [2011]). We use f10 ModSpatialTemporal to detect these cases by compiling 22 such modifiers from the General Inquirer (Stone et al. 1966).19
Features to detect generic NPs (Table 6, f13–f18). Generic NPs (Example (15)) are easily confused with bridging. Inspired by Reiter and Frank (2010), we develop features (f13–f18) to exclude generics.
First, hypothetical entities are likely to refer to generic entities (Mitchell et al. 2002). We approximate this by determining whether the NP appears in an if-clause (f13 AppearInIfClause). Also the NP’s number (e.g., singular or plural) and the clause tense/mood may play a role to decide genericity (Reiter and Frank 2010). The former is detected on the basis of the POS tag of the mention’s head word (f14 NPNumber). The latter is often reflected by the verb form of the clause where the mention is present, such as VBG or MD VB VBG. So we use the POS tags of the clause verbs as lexical features (f15 VerbPosTag).
The ACE-2 corpus (Mitchell et al. 2002) (distinct from our corpus) contains annotations for genericity. We collect all NPs from ACE-2 that are always used generically (f16 IsFrequentGenericNP). We also try to learn NPs that are uniquely identifiable without further description or anaphoric links such as the sun or the pope, by extracting common nouns that are annotated as mediated/WK from the training set and use these as lexical features (f17 GeneralWorldKnowledge).
Motivated by the ACE-2 annotation guidelines, f18 PreModByGenericQuantifier identifies six quantifiers that may indicate genericity (all, no, neither, every, any, most).
4.4. Results and Discussion
4.4.1. Experimental Set-up.
Because of the still limited size of our annotated corpus, especially for the rarer IS categories, we conduct experiments via document-wise 10-fold cross-validation. We use the OntoNotes named entity and syntactic annotation for feature extraction. The value of the parameter k in the feature f4 IsArgumentTakingNP (Table 6) is estimated for each fold separately: We first choose ten documents randomly from the training set for each fold as the development set to estimate k’s value via a grid search over k ∈ {0.5, 0.6, 0.7, 0.8, 0.9}, then the whole training set is trained again using the optimized parameter. We use recall, precision, and F-score to measure the performance per category. Accuracy measures overall performance on all IS categories. Statistical significance is measured using McNemar’s χ2 test (McNemar 1947).
4.4.2. Evaluation of New Non-relational Features.
We reimplemented the two local IS classifiers in Nissim (2006) and Rahman and Ng (2011) as baselines (henceforth Nissim and RahmanNg), using their feature and algorithm choices. We then add our new features from Table 6 in Section 4.3.2 to the two baselines, yielding the following six systems.
Nissim.
Nissim plus g1–g5.
Features g1–g5 from Table 6 are added to Nissim. These new features are designed for the categories old, mediated/WK, mediated/comparative, and mediated/function.
Nissim plus g1–g5 plus f1–f19.
Features f1–f19 from Table 6 designed for mediated/bridging are added. As for algorithm Nissim, we again exclude lexical features (f15 VerbPosTag and f17 GeneralWorldKnowledge).
RahmanNg.
Rahman and Ng (2011) use a binary SVM with a composite kernel (Joachims 1999; Moschitti 2006) on the Switchboard corpus. They use the one-versus-all strategy for multi-class classification and the features p1–p8 from Table 6. In addition, they use a tree kernel feature where the context of a mention is represented by its parent and its sibling nodes (without lexical leaves). Although this feature captures the syntactic context of a mention, it does not capture the internal structure of the mention itself nor the interaction between the IS of a mention and its children or parents.
RahmanNg plus g1–g5.
Features g1–g5 from Table 6 are added.
RahmanNg plus g1–g5 plus f1–f19.
Features f1–f19 (Table 6) are added.
Results for adding the new features to Nissim are shown in Table 7 (top) and to RahmanNg in Table 7 (bottom). The final algorithm improves significantly over all previous models in overall accuracy, showing the effectiveness of our new features. Comparative anaphors are recognized reliably via a small set of comparative markers. Including features g3 IsFrequentProperName and g5 DependOnChangeVerb improves results for mediated/WK and mediated/function, respectively.
Features f1–f19 from Table 6 were specifically designed for bridging: They help Nissim plusg1–g5 plusf1–f19 improve the results for bridging substantially over Nissim plusg1–g5. They also help to delimit several other IS classes better, such as mediated/syntactic for Nissim plusg1–g5 plusf1–f19 and RahmanNgplusg1–g5 plusf1–f19.
The new features f1–f19 only have limited effect on bridging recognition in RahmanNgplusg1–g5 plusf1–f19 compared with RahmanNg. Unigrams in RahmanNg may cover the lexical knowledge for bridging anaphora recognition that we model explicitly via features. Also although the overall IS classification performance of RahmanNgplusg1–g5 plusf1–f19 is significantly better than Nissimplusg1–g5 plusf1–f19, the former is worse than the latter with regard to bridging anaphora recognition. The one-versus-all strategy for a multi-class setting in Rahman and Ng (2011) is not suitable for identifying a minority class which lacks strong indicators such as bridging.
4.4.3. Evaluation of Collective and Cascaded Collective Classification.
We now compare the best local classifier, RhamanNgplusg1–g5 plusf1–f19, to collective and cascaded collective classifiers (Collective and CascadedCollective). The MLN classifier Collective (Section 4.2.1) uses the non-relational features from Table 6 and adds the relational features from Section 4.3.1. We use thebeast20 to learn weights and to perform inference.21thebeast uses cutting plane inference (Riedel 2008) to improve the accuracy and efficiency of MAP inference for MLNs.
The relational features in Collective lead to significant improvements in accuracy over the local model (Table 8), in particular for mediated/syntactic and mediated/aggregate as well as their distinctions from new. Such improvement is in accordance with the linguistic relations among IS categories we analyzed in Section 4.2.1.22Collective also improves the F-score for bridging by 13.5% compared with the local model. This is mainly through improved recall, where the local model in Table 8 is very conservative with a recall score of only 12.4%. Collective doubles recall but at a certain loss to precision.
However, the results for the bridging category, including recall, are still low. In a multi-class setting, prediction is biased toward the classes with the highest priors. CascadedCollective classification (Section 4.2.2) addresses this problem by combining a sequence of minority binary classifiers (based on SVMs, using only non-relational features) with a final collective classifier (based on MLNs, using non-relational and relational features). CascadedCollective improves bridging F-score and recall substantially without jeopardizing performance on other IS classes (Table 8, right). One question is whether the cascading algorithm is sufficient for improved bridging recognition with our additional non-relational bridging features f1–f19 being superfluous. We ran Cascaded Collective without these features. Results worsened substantially to 74.4% overall accuracy and 29.2 bridging F-measure. Our novel features (addressing linguistic properties of bridging) and the cascaded algorithm (addressing data sparseness) are complementary.
4.4.4. Error Analysis.
Our performance on bridging recognition, although outperforming reimplementations of previous work, is still under 50% in all measures. We conducted an error analysis using our best model CascadedCollective. We examine the confusion matrix (Table 9) of the model, concentrating only on the numbers related to bridging.
C → | old | new | brid | syn | comp | aggr | func | know |
G ↓ | ||||||||
old | - | - | 175 | - | - | - | - | - |
new | - | - | 193 | - | - | - | - | - |
brid | 66 | 251 | 323 | 10 | 2 | 1 | 0 | 10 |
synt | - | - | 10 | - | - | - | - | - |
comp | - | - | 2 | - | - | - | - | - |
aggr | - | - | 0 | - | - | - | - | - |
func | - | - | 0 | - | - | - | - | - |
know | - | - | 35 | - | - | - | - | - |
C → | old | new | brid | syn | comp | aggr | func | know |
G ↓ | ||||||||
old | - | - | 175 | - | - | - | - | - |
new | - | - | 193 | - | - | - | - | - |
brid | 66 | 251 | 323 | 10 | 2 | 1 | 0 | 10 |
synt | - | - | 10 | - | - | - | - | - |
comp | - | - | 2 | - | - | - | - | - |
aggr | - | - | 0 | - | - | - | - | - |
func | - | - | 0 | - | - | - | - | - |
know | - | - | 35 | - | - | - | - | - |
The highest proportion of recall errors is due to the fact that 251 bridging anaphors are misclassified as new. This can be explained as the syntactic form of many new instances and bridging anaphors are the same, new items are more frequent, and our lexico-semantic features in particular only pick up on certain types of bridging.
Most precision errors are new and old instances being misclassified as mediated/bridging. Many old instances misclassified as bridging are definite NPs without further modification and common noun heads without a string match to a previous mention. An example would be an NP such as the president, which can easily be coreferent to a previous president named by proper name (Barack Obama) or a bridging to a country or company. This coincides with the fact that in coreference resolution, common noun anaphors without head match are also hardest to detect (Martschat and Strube 2014). Future work attempting joint bridging and coreference resolution might help here. New items misclassified as bridging are also NPs with common noun heads and no modification (outside determiners) such as control or the back, often generics (see Examples (14) and (15)). In the latter cases how the phrase is embedded in the discourse plays an important role and is only partially modeled by our approach. Currently, the lexical semantic knowledge we explored only indicates that some NPs are more likely to be used as bridging anaphora than others.
5. Bridging Antecedent Selection
Bridging antecedent selection chooses an antecedent among all possible candidates for a given bridging anaphor. We make contributions in three areas for antecedent selection: (i) using joint modeling to tackle what we call sibling anaphora, (ii) developing a range of semantic and salience features for the problem, and (iii) proposing the novel concept of an anaphor’s discourse scope to delimit the list of possible candidate antecedents.
From now on we assume that the antecedent is an NP mention—because, among 663 bridging anaphors, only 39 have verbs/clauses as antecedents (see Section 3). We do not resolve the latter and count our decisions in these cases as incorrect. The antecedent can be coreferent with prior mentions of the same entity. In Example (1), repeated as Example (20), The windows is the bridging anaphor, the Polish center is the antecedent (mention), and the antecedent is coreferent to the center mentioned previously. We call such a coreference chain of antecedents the antecedent entity.
5.1. Method: A Joint Model
Motivation.
Many of our bridging anaphors are siblings—that is, they share the same antecedent (Section 3). Sibling anaphors clustering tries to identify such siblings. We then use joint inference to model sibling anaphors clustering and bridging antecedent selection together.
Detailed model.
Table 10 shows hard constraints and formula templates for this problem in MLNs.
Hidden predicates | |
p1 | isBridging(a1, e) |
p2 | hasSameAntecedent(a1, a2) |
Hard Constraints | |
f1 | ∀a ∈ A: |e ∈ E: isBridging(a, e)| ≤ 1 |
f2 | ∀a ∈ A ∀e ∈ E: hasPairDistance(e, a, d) ∧ d < 0 → ¬isBridging(a, e) |
f3 | ∀a1, a2 ∈ A: a1 ≠ a2 ∧ hasSameAntecedent(a1, a2) → hasSameAntecedent(a2, a1) |
f4 | ∀a1, a2, a3 ∈ A: a1 ≠ a2 ∧ a1 ≠ a3 ∧ a2 ≠ a3 ∧ hasSameAntecedent(a1, a2) ∧ hasSameAntecedent(a2, a3) → hasSameAntecedent(a1, a3) |
f5 | ∀a1, a2 ∈ A ∀i ∈ E: a1 ≠ a2 ∧ hasSameAntecedent(a1, a2) ∧ isBridging(a1, e) → isBridging(a2, e) |
f6 | ∀a1, a2 ∈ A ∀e ∈ E: a1 ≠ a2 ∧ isBridging(a1, e) ∧ isBridging(a2, e) → hasSameAntecedent(a1, a2) |
Formula template for sibling anaphors clustering | |
fc | ∀a1, a2 ∈ A: siblingAnaphorsClusteringFormula_Template (a1, a2) → hasSameAntecedent(a1, a2) |
Formula template for bridging antecedent selection | |
fr1 | ∀a ∈ A ∀e ∈ E: bridgingAnaResolutionFormula_Template1 (a, e) → isBridging(a, e) |
fr2 | ∀a ∈ A ∀e ∈ Ea: bridgingAnaResolutionFormula_Template2 (a, e) → isBridging(a, e) |
Hidden predicates | |
p1 | isBridging(a1, e) |
p2 | hasSameAntecedent(a1, a2) |
Hard Constraints | |
f1 | ∀a ∈ A: |e ∈ E: isBridging(a, e)| ≤ 1 |
f2 | ∀a ∈ A ∀e ∈ E: hasPairDistance(e, a, d) ∧ d < 0 → ¬isBridging(a, e) |
f3 | ∀a1, a2 ∈ A: a1 ≠ a2 ∧ hasSameAntecedent(a1, a2) → hasSameAntecedent(a2, a1) |
f4 | ∀a1, a2, a3 ∈ A: a1 ≠ a2 ∧ a1 ≠ a3 ∧ a2 ≠ a3 ∧ hasSameAntecedent(a1, a2) ∧ hasSameAntecedent(a2, a3) → hasSameAntecedent(a1, a3) |
f5 | ∀a1, a2 ∈ A ∀i ∈ E: a1 ≠ a2 ∧ hasSameAntecedent(a1, a2) ∧ isBridging(a1, e) → isBridging(a2, e) |
f6 | ∀a1, a2 ∈ A ∀e ∈ E: a1 ≠ a2 ∧ isBridging(a1, e) ∧ isBridging(a2, e) → hasSameAntecedent(a1, a2) |
Formula template for sibling anaphors clustering | |
fc | ∀a1, a2 ∈ A: siblingAnaphorsClusteringFormula_Template (a1, a2) → hasSameAntecedent(a1, a2) |
Formula template for bridging antecedent selection | |
fr1 | ∀a ∈ A ∀e ∈ E: bridgingAnaResolutionFormula_Template1 (a, e) → isBridging(a, e) |
fr2 | ∀a ∈ A ∀e ∈ Ea: bridgingAnaResolutionFormula_Template2 (a, e) → isBridging(a, e) |
p1 and p2 are hidden predicates that we predict: choosing the antecedent for anaphor a1 and deciding whether a1 and a2 are sibling anaphors. f1 models that each bridging anaphor has at most one antecedent.23f2 models that a bridging anaphor should not appear before its antecedent. f3 and f4 model the reflexivity and transitivity of sibling anaphor clustering. f5 and f6 model that sibling anaphors share the same antecedent.
fc is the formula template for sibling anaphor clustering, fr1 and fr2 are the formula templates for bridging antecedent selection. Specific formulas instantiating fc and fr1/fr2 are described in Sections 5.2.1 and 5.2.2. In formulas instantiating fr2, the set of antecedent candidates (Ea) for bridging anaphor a is constructed on the basis of the anaphor’s discourse scope (i.e., local or non-local) (described in Section 5.3).
5.2. Model: Features
We now describe all the features we use. The only additional manually annotated resource we need for feature extraction is WordNet.
5.2.1. Features for Sibling Anaphor Clustering.
Table 11 shows the formulas for predicting sibling anaphors. Each formula is associated with a weight w learned during training. The polarity of the weights is indicated by the leading + or −.
Formulas for sibling anaphors clustering |
f1 + (w) ∀a1, a2 ∈ AParallelAnas(a1, a2) → hasSameAntecedent(a1, a2) |
f2 + (w) ∀a1, a2 ∈ AsameHead(a1, a2) → hasSameAntecedent(a1, a2) |
f3 + (w) ∀a1, a2 ∈ ArelatedTo(a1, a2) → hasSameAntecedent(a1, a2) |
Formulas for sibling anaphors clustering |
f1 + (w) ∀a1, a2 ∈ AParallelAnas(a1, a2) → hasSameAntecedent(a1, a2) |
f2 + (w) ∀a1, a2 ∈ AsameHead(a1, a2) → hasSameAntecedent(a1, a2) |
f3 + (w) ∀a1, a2 ∈ ArelatedTo(a1, a2) → hasSameAntecedent(a1, a2) |
In f3, we predict semantically related anaphors which do not share the same head word (such as limited access and one last entry in Example (26)), using WordNet-based similarity measures by Pedersen, Patwardhan, and Michelizzi (2004) in SVMlight.24
5.2.2. Features for Bridging Antecedent Selection.
Each formula for bridging antecedent selection (Table 12) is associated with a weight w learned during training. The polarity of the weights is indicated by leading + or −. For some formulas the final weight consists of a learned weight w multiplied by a score d (e.g., the inverse distance between antecedent and anaphor). In these cases, the final weight for a ground formula does not just depend on the respective formula, but also on the specific constants.
Formulas for bridging antecedent selection |
Semantic class features |
f1 + (w) ∀a ∈ A ∀ e ∈ E: hasSemanticClass(a, “gpeRolePerson”) ∧ hasSemanticClass(e, “gpe”) ∧ hasPairDistance(e, a, d) ∧ d > 0 → isBridging(a, e) |
f2 + (w) ∀a ∈ A ∀e ∈ E: hasSemanticClass(a, “otherRolePerson”) ∧ hasSemanticClass(e, “org”) ∧ hasPairDistance(e, a, d) ∧ d > 0 → isBridging(a, e) |
f3 + (w ⋅ d) ∀a ∈ A ∀e ∈ E: hasSemanticClass(a, “relativePerson”) ∧ hasSemanticClass(e, “person ★”) ∧ hasPairDistanceInverse(e, a, d) → isBridging(a, e) |
f4 + (w ⋅ d) ∀a ∈ A ∀e ∈ E: hasSemanticClass(a, “date|time”) ∧ hasSemanticClass(e, “date|time”) ∧ hasPairDistanceInverse(e, a, d) → isBridging(a, e) |
Semantic features |
f5 + (w ⋅ d) ∀a ∈ A ∀e ∈ Ea: relativeRankPrepPattern(a, e, d) → isBridging(a, e) |
f6 + (w) ∀a ∈ A ∀e ∈ Ea: isTopRelativeRankPrepPattern(a, e) → isBridging(a, e) |
f7 + (w ⋅ d) ∀a ∈ A ∀e ∈ Ea: relativeRankVerbPattern(a, e, d) → isBridging(a, e) |
f8 + (w) ∀a ∈ A ∀e ∈ Ea: isTopRelativeRankVerbPattern(a, e) → isBridging(a, e) |
f9 + (w ⋅ d) ∀a ∈ A ∀ e ∈ Ea: isPartOf(a, e) ∧ hasPairDistanceInverse(e, a, d) → isBridging(a, e) |
Salience features |
f10 + (w) ∀a ∈ A ∀e ∈ Ea: predictedGlobalAnte(e) ∧ hasPairDistance(e, a, d) ∧ d > 0 → isBridging(a, e) |
f11 + (w ⋅ d) ∀a ∈ A ∀e ∈ Ea: relativeRankDocSpan(a, e, d) → isBridging(a, e) |
f12 + (w) ∀a ∈ A ∀e ∈ Ea: isTopRelativeRankDocSpan(a, e) → isBridging(a, e) |
Lexical features |
f13 − (w) ∀a ∈ A ∀e ∈ Ea: isSameHead(a, e) → isBridging(a, e) |
f14 + (w) ∀a ∈ A ∀e ∈ Ea: isPremodOverlap(a, e) → isBridging(a, e) |
Syntactic features |
f15 − (w) ∀a ∈ A ∀e ∈ Ea: isCoArgument(a, e) → isBridging(a, e) |
f16 + (w) ∀a ∈ A ∀e ∈ Ea: synParallelStructure(a, e) → isBridging(a, e) |
f17 + (w) ∀a ∈ A ∀e ∈ Ea: isClosestNominalModifer(a, e) → isBridging(a, e) |
f18 + (w) ∀a ∈ A ∀e ∈ Ea: isPredictSetBridging(a, e) → isBridging(a, e) |
Formulas for bridging antecedent selection |
Semantic class features |
f1 + (w) ∀a ∈ A ∀ e ∈ E: hasSemanticClass(a, “gpeRolePerson”) ∧ hasSemanticClass(e, “gpe”) ∧ hasPairDistance(e, a, d) ∧ d > 0 → isBridging(a, e) |
f2 + (w) ∀a ∈ A ∀e ∈ E: hasSemanticClass(a, “otherRolePerson”) ∧ hasSemanticClass(e, “org”) ∧ hasPairDistance(e, a, d) ∧ d > 0 → isBridging(a, e) |
f3 + (w ⋅ d) ∀a ∈ A ∀e ∈ E: hasSemanticClass(a, “relativePerson”) ∧ hasSemanticClass(e, “person ★”) ∧ hasPairDistanceInverse(e, a, d) → isBridging(a, e) |
f4 + (w ⋅ d) ∀a ∈ A ∀e ∈ E: hasSemanticClass(a, “date|time”) ∧ hasSemanticClass(e, “date|time”) ∧ hasPairDistanceInverse(e, a, d) → isBridging(a, e) |
Semantic features |
f5 + (w ⋅ d) ∀a ∈ A ∀e ∈ Ea: relativeRankPrepPattern(a, e, d) → isBridging(a, e) |
f6 + (w) ∀a ∈ A ∀e ∈ Ea: isTopRelativeRankPrepPattern(a, e) → isBridging(a, e) |
f7 + (w ⋅ d) ∀a ∈ A ∀e ∈ Ea: relativeRankVerbPattern(a, e, d) → isBridging(a, e) |
f8 + (w) ∀a ∈ A ∀e ∈ Ea: isTopRelativeRankVerbPattern(a, e) → isBridging(a, e) |
f9 + (w ⋅ d) ∀a ∈ A ∀ e ∈ Ea: isPartOf(a, e) ∧ hasPairDistanceInverse(e, a, d) → isBridging(a, e) |
Salience features |
f10 + (w) ∀a ∈ A ∀e ∈ Ea: predictedGlobalAnte(e) ∧ hasPairDistance(e, a, d) ∧ d > 0 → isBridging(a, e) |
f11 + (w ⋅ d) ∀a ∈ A ∀e ∈ Ea: relativeRankDocSpan(a, e, d) → isBridging(a, e) |
f12 + (w) ∀a ∈ A ∀e ∈ Ea: isTopRelativeRankDocSpan(a, e) → isBridging(a, e) |
Lexical features |
f13 − (w) ∀a ∈ A ∀e ∈ Ea: isSameHead(a, e) → isBridging(a, e) |
f14 + (w) ∀a ∈ A ∀e ∈ Ea: isPremodOverlap(a, e) → isBridging(a, e) |
Syntactic features |
f15 − (w) ∀a ∈ A ∀e ∈ Ea: isCoArgument(a, e) → isBridging(a, e) |
f16 + (w) ∀a ∈ A ∀e ∈ Ea: synParallelStructure(a, e) → isBridging(a, e) |
f17 + (w) ∀a ∈ A ∀e ∈ Ea: isClosestNominalModifer(a, e) → isBridging(a, e) |
f18 + (w) ∀a ∈ A ∀e ∈ Ea: isPredictSetBridging(a, e) → isBridging(a, e) |
In contrast, the variants of these features (i.e., f6, f8, and f12) tell whether the score of an anaphor-antecedent candidate pair is the highest among all pairs for this anaphor.
Frequent Bridging Relations (Table 12: f1–f4).
For the first two bridging types we do not penalize antecedent candidates that are far away from the anaphor (f1 and f2). This is because in news it is common that a globally salient GPE or organization is introduced in the beginning, then later NPs denoting their roles are used as bridging anaphors throughout the document. For personal as well as temporal relations we prefer close antecedents by including the distance between antecedent and anaphor in the weights since these two bridging relations are local phenomena. These restrictions might be genre-specific.
Semantic features: preposition pattern (Table 12: f5 and f6).
Corpus-based patterns capture semantic connectivity between a bridging anaphor and its antecedent. The “NP of NP” pattern (Poesio et al. 2004a) is useful for part-of and attribute-of relations (e.g., windows of a room) but cannot cover all bridging relations (such as sanctions against a country). We therefore generalize it to a preposition pattern to capture diverse semantic relations.
First, we extract the three most highly associated prepositions for each anaphor from Gigaword (Parker et al. 2011) and Tipster (Harman and Liberman 1993). This leads to, for example, the prepositions {against, on, in} for the anaphor sanctions. Then for each anaphor-antecedent candidate pair, we query the corpora using their head words “anaphor preposition antecedent” (e.g. “sanction(s) against/on/in countr(y/ies)”). We replace proper names with fine-grained named entity types (using a gazetteer). Raw query hit counts are converted into Dunning root log-likelihood ratio scores25 and then normalized using Equation (27). Table 13 shows some raw hit counts of the preposition pattern queries, the corresponding Dunning root log-likelihood ratio scores, and the normalized scores for the bridging anaphor sanctions and its antecedent candidates.
Anaphor . | Antecedent Candidate . | RawCount . | RootLLR . | NormalizedScore . |
---|---|---|---|---|
sanctions | the country | 6,817 | 81.44 | 1.00 |
sanctions | apartheid | 26 | 4.8 | 0.32 |
sanctions | further punishment | 9 | −1.88 | 0.26 |
sanctions | … | … | … | … |
Anaphor . | Antecedent Candidate . | RawCount . | RootLLR . | NormalizedScore . |
---|---|---|---|---|
sanctions | the country | 6,817 | 81.44 | 1.00 |
sanctions | apartheid | 26 | 4.8 | 0.32 |
sanctions | further punishment | 9 | −1.88 | 0.26 |
sanctions | … | … | … | … |
Semantic features: verb pattern (Table 12: f7 and f8).
Anaphors whose lexical head is an indefinite pronoun or a number are potential set bridging cases. We extract the verbs on which these potential set bridging anaphors depend (in our example, the verb travel). Finally, for each antecedent candidate, subject-verb, verb-object, or preposition-object queries26 are executed against the Web 1T 5-gram corpus (Brants and Franz 2006). Raw hit counts are transformed into Dunning root log-likelihood ratio scores, then normalized as described in Equation (27).
Semantic features: Part-of relation (Table 12: f9).
We use WordNet to decide whether a (possibly inherited) part-of relation holds between an anaphor and antecedent candidate.
Salience features (Table 12: f10–f12).
Salient entities are preferred as bridging antecedents. In contrast to Poesio et al. (2004a), we find that bridging anaphors with distant antecedents are common if the antecedent is the global focus (Grosz and Sidner 1986).
f10 models global salience by semantic connectivity to all bridging anaphors in the document. For each bridging anaphor a ∈ A and each entity e ∈ E, let score(a, e) be the preposition pattern score (f5 in Table 12). We calculate the global semantic connectivity score esal for each e ∈ E as follows: esal = ∑a∈Ascore(a, e). If an entity appears in the headline27 and also has the highest global semantic connectivity score among all entities in E, then this entity is predicted as globally salient for this document. Not every document has a globally salient entity.
f11 and f12 capture salience by computing the span of text (measured in sentences) in which the antecedent candidate entity is mentioned divided by the number of sentences in the document.
Lexical features (Table 12: f13–f14).
Syntactic features: CoArgument (Table 12: f15).
The CoArgument feature excludes subjects from being antecedents for the object in the same clause, such as excluding “the Japanese” in Example (31) as antecedent for that equipment market.
Syntactic features: intra-sentential syntactic parallelism (Table 12: f16).
Syntactic features: inter-sentential syntactic modification (Table 12: f17).
Laparra and Rigau’s (2013) work on implicit semantic role labeling assumes that different occurrences of the same predicate in a document likely maintain the same argument fillers. Therefore we can identify the antecedent of a bridging anaphor a by analyzing the nominal modifiers in other NPs with the same head word as a.28 Whereas Laparra and Rigau’s work is restricted to ten predicates, we consider all bridging anaphors in ISNotes. In f17, we predict antecedents for bridging anaphors by performing the following two steps:
- 1.
For each bridging anaphor a, we take its head lemma ah and collect all prenominal, possessive, and prepositional modifiers of other occurrences of ah in the document. All realizations of these modifications that precede a form the antecedent candidate set Antea.
- 2.
We choose the most recent mention from Antea as the predicted antecedent for the bridging anaphor a.
Syntactic features: hypertheme antecedent prediction for set sibling anaphors (Table 12: f18).
Set bridging anaphors are often siblings (e.g., One man, A food caterer, and None are all elements of the set provided by employees in Example (35)). The information structure pattern we observe here is Hypertheme–theme (Daneš 1974). We predict heuristically the “themes” (set sibling anaphors) and their “Hypertheme” (antecedent). We first predict set sibling anaphors by expanding “typical” set bridging anaphors (e.g., None in Example 35) to their syntactically parallel neighbors (e.g., One man and A food caterer). We then predict the closest mention among all plural, subject mentions from the sentence immediately preceding the first anaphor as the antecedent for all (predicted) set sibling anaphors. If such a mention does not exist, the closest mention among all plural, object mentions from the sentence immediately preceding the first anaphor is predicted to be the antecedent. In Example (35), employees is predicted to be the antecedent for all (predicted) set sibling anaphors.
5.3. Method: Discourse Scope for Antecedent Candidate Selection
Motivation.
Ranking-based approaches for bridging antecedent selection need to tackle two interacting problems: (1) first, creating a list of antecedent candidates, (2) then, choosing an antecedent from this list. Once implausible candidates are removed from the list in (1), selecting the correct antecedent becomes an easier task in (2). Previous work (Markert, Nissim, and Modjeska 2003; Poesio et al. 2004a; Lassalle and Denis 2011) uses a static sentence window to construct the candidate list. However, this approach is problematic. If the window is too small, we miss too many correct antecedents. For example, 24% of anaphors in ISNotes would miss their antecedent if we used a two-sentence window (Section 3). If it is too large, we include too much noise in learning. In addition, whether more distant antecedents should be included might depend both on the salience properties of the antecedent and the place that the anaphor has in discourse.
We address this problem by proposing the discourse scope for an anaphor. Discourse entities have different scopes: Some contribute to the main topic and interact with distant entities (globally salient entities), and others focus on subtopics and only interact with nearby entities (locally salient entities). In Figure 4, the globally salient entity Marina in s1 has a long forward lifespan, so that it can be accessed by both close and distant anaphors, a resident in s2 and residents in s36. In contrast, the locally salient entity buildings with substantial damage in s24 has a short forward lifespan, therefore it can only be accessed by nearby subsequent anaphors, residents and limited access in s25. Accordingly, anaphors that have non-local discourse scopes can access both locally and distant globally salient entities, whereas anaphors that have local discourse scopes can only access nearby locally salient entities. In consequence, we can add globally or locally salient entities to antecedent candidate lists for bridging anaphors according to their discourse scopes. The challenge is how to decide the discourse scopes for bridging anaphors automatically and how to model salience.
Salience of Antecedents.
For each bridging anaphor a ∈ A, we define three antecedent candidate sets according to different salience measures: , , and :
- •
includes the top p percent salient entities in the text measured through the numbers of mentions in coreference chains.
- •
is the set of globally salient entities measured by the global semantic connectivity score (described in f10 in Table 12). For each document, we create a list by ranking all entities according to their semantic connectivity to all anaphors. An entity is added to if it ranks among the top k in this list and appears in the headline.
- •
The set consists of locally salient entities. We approximate the entity’s local salience by the head position of its mention in the parse tree. Mentions preceding a in the same sentence and in the previous two sentences are added to if the distance from their head to the root of the sentence’s dependency parse tree is less than threshold t.
Anaphors’ discourse scopes.
We postulate that some discourse relations indicate the discourse scope of an anaphor. Here we use the discourse relation Expansion as defined in the Penn Discourse Treebank (Prasad et al. 2008). In this relation, the second argument elaborates on the first one and therefore most entities in the second argument contribute to local instead of global entity coherence. Therefore, we define two types of discourse scopes for bridging anaphora: local and non-local. If a bridging anaphor appears in argument 2 of an Expansion relation, it has local discourse scope; otherwise, it has non-local discourse scope.
Antecedent candidate list for an anaphor via d-scope-salience.
We select the antecedent candidates for an anaphor via its discourse scope: For a local anaphor, only locally salient entities from the local window () are allowed; for a non-local anaphor, apart from , globally salient entities ( and ) are also allowed.
5.4. Results and Discussion
We conduct experiments on the ISNotes corpus via 10-fold cross-validation on documents. We use the OntoNotes named entity and syntactic annotation as well as the Penn Discourse Treebank annotation for feature extraction. In each fold, we first choose ten documents randomly from the training set as the development set to estimate the values of the parameters p, k, and t in , , and , respectively,30 then the whole training set is trained again using the optimized parameters.
5.4.1. Mention-Entity Setting and Mention-Mention Setting.
In the mention-entity setting, entity information is based on the OntoNotes coreference annotation. We resolve bridging anaphors to entity antecedents. Features are extracted by using entity information. For instance, the semantic class of an entity is the majority semantic class of all its mention instantiations. The raw hit counts of the preposition pattern query for a bridging anaphor a and its antecedent candidate e (f5 and f6 in Table 12) is the maximum count among all instantiations of e. The distance between a bridging anaphor a and its antecedent candidate e is the distance between a and the closest mention instantiation of e preceding a.
In the mention-mention setting, we resolve bridging anaphors to mention antecedents and do not use any coreference information in the model or feature extraction. In this setting, we use “string match” for f11/f12 in Table 12 and in Section 5.3 to measure the salience of the mention antecedent candidates.
5.4.2. Evaluation Metrics.
We measure accuracy on the number of bridging anaphors, instead of on all links between bridging anaphors and their antecedent instantiations. We calculate how many bridging anaphors are correctly resolved among all bridging anaphors. In the mention-entity setting, where the gold entity information is given, a bridging anaphor is counted as correctly resolved if the model links the anaphor to its entity antecedent. In the mention-mention setting, where the gold entity information is not given, a bridging anaphor is counted as correctly resolved if the model links the anaphor to one of its preceding antecedent instantiations. Statistical significance is measured using McNemar’s χ2 test (McNemar 1947).
5.4.3. Evaluation of Our New Local Features and Antecedent Candidate Selection.
To evaluate only the impact of our local features (Table 12) and the new antecedent candidate selection strategy (d-scope-salience, Section 5.3), we compare several pairwise machine learning models that successively build on each other. The pairwise model is widely used in coreference resolution (Soon, Ng, and Lim 2001) and has been used for bridging in Poesio et al. (2004a). Similar to the latter, we use it for bridging antecedent selection in the following way: Given an anaphor mention a and the set of antecedent candidate entities Ea that appear before a, we create a pairwise instance (a, e) for every e ∈ Ea. A binary decision whether a is bridged to e is made for each instance (a, e) separately. Finally, we explore the best first strategy (Ng and Cardie 2002) to choose one antecedent for each bridging anaphor. As we evaluate in the mention-entity setting, full coreference information is used in feature computation for all models.
baseline1_NB and baseline2_NB.
We reimplement the algorithm from Poesio et al. (2004a) as a baseline. It is a pairwise naive Bayes classifier that classifies every anaphor-potential antecedent pair as true antecedent or not. We use the standard naive Bayes settings in WEKA (Witten and Frank 2005) with a best first strategy for choosing the correct antecedent (as described above).
Because they did not explain whether they conducted the experiments under the mention-mention or the mention-entity setting, we assume they treated antecedents as entities. We use a two sentence (baseline1_NB) and five sentence (baseline2_NB) window for antecedent candidate selection.31
Poesio et al. (2004a) capture meronymy bridging relations via Google distance and WordNet distance (see Table 14). The former is the inverse value of Google hit counts for the NP of NP pattern query (e.g., the windows of the center). Because the Google API is no longer available, we use the Web 1T 5-gram corpus (Brants and Franz 2006) to extract the Google distance feature. We improve it by taking all information about entities via coreference into account and by replacing proper names with fine-grained named entity types (using a gazetteer). WordNet distance is the inverse value of the shortest path length between anaphor and antecedent candidate among all synset combinations. The other features measure the salience of an antecedent candidate. For instance, local first mention checks whether an antecedent candidate is realized in the first position of a sentence within the previous five sentences of the anaphor. Global first mention checks whether an antecedent candidate is realized in the first position of a sentence anywhere.
Group . | Feature . | Value . |
---|---|---|
lexical | Google distance | numeric |
WordNet distance | numeric | |
salience | utterance distance | numeric |
local first mention | boolean | |
global first mention | boolean |
Group . | Feature . | Value . |
---|---|---|
lexical | Google distance | numeric |
WordNet distance | numeric | |
salience | utterance distance | numeric |
local first mention | boolean | |
global first mention | boolean |
baseline3_SVM.
In baseline3_SVM, we use the same features and the same antecedent candidate selection method as in baseline1_NB, but replace naive Bayes with SVMlight.32 We stick with a two-sentence window as it performed on a par with the five-sentence window in the previous baselines.
local1_SVM.
local2_SVM.
On the basis of local1_SVM, all other features from Table 12 (i.e., f1–f4, f7–f18) are added.
local3_SVM.
On the basis of local2_SVM, we apply our new method (d-scope-salience, Section 5.3) to select antecedent candidates for bridging anaphors.
local1_SVM already outperforms the three baselines (baseline1_NB, baseline2_NB, and baseline3_SVM) by about 10% (Table 15). This is due to normalizing the preposition pattern feature (Equation (27) in Section 5.2.2), and generalizing it (from the preposition of to appropriate prepositions for each anaphor) to capture more diverse semantic relations. This is important as our preposition pattern feature does not need more resources than the original Google distance feature in Poesio et al. (2004a) as it only depends on counts from unannotated corpora. The significant improvements of local2_SVM indicate the contribution of our other features—however, these features sometimes need additional annotation in OntoNotes (such as the syntactic annotation) so the scenario is more idealized. Further improvements are achieved by local3_SVM, which shows the positive impact of our advanced antecedent candidate selection strategy.
Model . | Features . | Ante. candidate list . | Setting . | Acc. . |
---|---|---|---|---|
baseline1_NB | Poesio features | 2-sentence-window | mention-entity | 18.9 |
baseline2_NB | Poesio features | 5-sentence-window | mention-entity | 18.4 |
baseline3_SVM | Poesio features | 2-sentence-window | mention-entity | 19.8 |
local1_SVM | Poesio salience features + PrepPattern features (f5 and f6 from Table 12) | 2-sentence-window | mention-entity | 29.1 |
local2_SVM | Poesio salience features + all features from Table 12 | 2-sentence-window | mention-entity | 39.3 |
local3_SVM | Poesio salience features + all features from Table 12 | d-scope-salience | mention-entity | 46.0 |
Model . | Features . | Ante. candidate list . | Setting . | Acc. . |
---|---|---|---|---|
baseline1_NB | Poesio features | 2-sentence-window | mention-entity | 18.9 |
baseline2_NB | Poesio features | 5-sentence-window | mention-entity | 18.4 |
baseline3_SVM | Poesio features | 2-sentence-window | mention-entity | 19.8 |
local1_SVM | Poesio salience features + PrepPattern features (f5 and f6 from Table 12) | 2-sentence-window | mention-entity | 29.1 |
local2_SVM | Poesio salience features + all features from Table 12 | 2-sentence-window | mention-entity | 39.3 |
local3_SVM | Poesio salience features + all features from Table 12 | d-scope-salience | mention-entity | 46.0 |
5.4.4. Evaluation of the Joint Inference Model.
Simply porting our local model to MLNs (without including joint modeling and sibling anaphors clustering) does not improve performance (see Model local_MLN in Table 16). The model jointme is the joint inference system described in Section 5.1 with all features for sibling anaphors clustering (Section 5.2.1) on top of all features for bridging antecedent selection (Section 5.2.2), using a mention-entity setting. We use thebeast to learn weights for the formulas and to perform inference.33jointme performs significantly better than the two local models (Table 16). This confirms our assumption that additional information from sibling anaphors clustering helps to resolve bridging anaphora.
. | Setting . | Model . | Accuracy . |
---|---|---|---|
local | mention-entity | local3_SVM | 46.0 |
mention-entity | local_MLN | 46.4 | |
joint inference | mention-entity | jointme | 50.7 |
mention-mention | jointmm | 39.8 | |
mention-entity/mention | jointme_mm | 44.2 |
. | Setting . | Model . | Accuracy . |
---|---|---|---|
local | mention-entity | local3_SVM | 46.0 |
mention-entity | local_MLN | 46.4 | |
joint inference | mention-entity | jointme | 50.7 |
mention-mention | jointmm | 39.8 | |
mention-entity/mention | jointme_mm | 44.2 |
The system jointmm includes the same features, sibling clustering, and antecedent selection as jointme but is trained and tested in the mention-mention setting. jointme_mm is trained in the mention-entity setting but tested in the mention-mention setting.
jointme_mm performs significantly better than jointmm. Training the model in the mention-entity setting represents the phenomenon better than training in the noisy mention-mention setting.
5.4.5. Error Analysis.
We conducted an error analysis for our best model jointme. First, anaphors with long distance antecedents are harder to resolve (see Table 17).
Sentence distance . | #pairs . | jointme . |
---|---|---|
0 | 175 | 59.4 |
1 | 260 | 46.9 |
2 | 90 | 50.0 |
≥3 | 158 | 44.3 |
Sentence distance . | #pairs . | jointme . |
---|---|---|
0 | 175 | 59.4 |
1 | 260 | 46.9 |
2 | 90 | 50.0 |
≥3 | 158 | 44.3 |
We now distinguish between sibling anaphors and non-sibling anaphors. The performance of jointme is 62.2% on sibling anaphors but only 34.8% on non-sibling anaphors. Global salience and links between related anaphors do indeed help to capture the behavior of sibling anaphors.
The semantic knowledge we have is still insufficient. Typical problems are:
Cases with context-specific bridging relations. For example, in one text about the stealing of sago palms in California, we found the anaphor the thieves with the antecedent palms, which is not a very common semantic link.
More frequently, we have cases where several good antecedents from a semantic perspective can be found. For example, two laws are discussed and the subsequent anaphor the veto could be the veto of either bills. Integration of the wider context apart from the two noun phrases in question is necessary in these cases. This can include the semantics of modification, whereas we currently consider only head nouns. Thus, the anaphor the local council would preferably be interpreted as the council of a village instead of the council of a state due to the occurrence of local.
Finally, 6% of the anaphors in our corpus have a non-NP antecedent. As we only extract NP phrases as potential candidate antecedents, we cannot handle these.
6. Unrestricted Bridging Resolution
Unrestricted bridging resolution recognizes bridging anaphors (beyond definite NPs only) and also finds links to antecedents (beyond meronymic relations only).
6.1. Method
We combine the two models from the previous two sections in a pipeline (Figure 5). Given extracted mentions, the system first predicts bridging anaphors by applying cascading collective classification (Section 4). It then predicts antecedents for these bridging anaphors (in the mention-mention setting) by applying joint inference trained in the mention-entity setting (Model jointme_mm in Section 5).34
6.2. Experiments and Results
We conduct experiments on ISNotes via 10-fold cross-validation on documents. We use an evaluation metric based on the number of bridging anaphors. The system predicts one unique antecedent for each predicted bridging anaphor. A link is counted as correct if it recognizes the bridging anaphor correctly and links the anaphor to any instantiation of its antecedent preceding the anaphor. We use recall, precision, and F-score and the randomization test35 on F-score for statistical significance.
6.2.1. Baseline.
We compare our pipeline model to a learning-based model (pairwise model), adapted from the pairwise model widely used in coreference resolution (Soon, Ng, and Lim 2001).36 In the pairwise model we first create an initial list of possible bridging anaphors Aml, excluding as many obvious non-bridging mentions from the list as possible. A mention is added to Aml if it (1) does not contain any other mentions, (2) is not modified by premodifications that strongly indicate comparative NPs, and (3) is not a pronoun or a proper name. Then for each NP a ∈ Aml, a list of antecedent candidates Ca is created by including all mentions preceding a from the same as well as from the previous two sentences.37 We create a pairwise instance (a, c) for every c ∈ Ca. In the decoding stage, the best first strategy (Ng and Cardie 2002) is used to predict bridging links. Specifically, for each a ∈ Aml, we predict the bridging link to be the most confident pair (a, cante) among all instances with the positive prediction. We provide this pairwise model with the same non-relational features as our two-stage model (Section 6.1); that is, features from Table 6 in Section 4.3.2 and Table 12 in Section 5.2.2. We use SVMlight to conduct the experiments.38
6.2.2. Results and Discussion.
Our pipeline model significantly outperforms the baseline (Table 18). Although the baseline models bridging anaphora recognition and antecedent selection together, it suffers from fewer positive training instances for each subtask because of its antecedent candidate selection strategy. In addition, we observe that diverse bridging relations in ISNotes, especially many context specific relations such as pachinko – devotees or palms – the thieves, lead to few training instances for each type of relation. As a result, generalizing is difficult for the learning-based approach. This is also the outcome of our earlier work (Hou, Markert, and Strube 2014), in which we propose a rule-based system for full bridging resolution on the same corpus. In this work, the rule-based system performs better than a learning-based approach (pairwise model) that has access to the same knowledge resources. Although the two-stage model outperforms our earlier rule-based system by 3.0 F-score points on bridging resolution, the result is still not satisfactory. This is due to the moderate performance in both stages. On bridging anaphora recognition, our best model (CascadedCollective) achieves an F-score of 46.1%. The errors in this stage are propagated to the second stage, where the accuracy of our best model (jointme) to choose antecedents for gold bridging anaphora is 50.7%. In future work, we would like to provide more training data to check whether the two-stage model benefits from it.
7. Conclusions
We presented the ISNotes corpus, which is annotated for a wide range of information status categories and full anaphoric information for the main anaphora types (i.e., coreference, bridging, and comparative). We developed a two-stage system for full bridging resolution, where bridging anaphors are not limited to definite NPs and bridging relations are not limited to meronymy. We proposed two joint inference models for information status recognition (including bridging recognition) and bridging antecedent selection, respectively. Our system achieves state-of-the-art performance or better for the three tasks (i.e., IS and bridging anaphora recognition, bridging antecedent selection, and bridging resolution) over reimplementations of previous approaches on ISNotes.
There are several open problems to be addressed. First, the results of our system might be improved with more annotations in the future. Given the difficulty of the task itself, we cannot expect that a large-scale corpus for bridging that is reliably annotated by linguists will appear any time soon. An option is to harvest potential bridging pairs by exploring semi-supervised or unsupervised learning approaches and combine these with expert/non-expert annotations. Second, our method should be tested in several other scenarios, such as its performance on other genres and its performance in less idealized conditions (such as automatically parsed corpora). Third, classifying bridging relations into fine-grained categories could be useful for other NLP applications, such as relation extraction across sentence boundaries and machine reading. Finally, bridging resolution, textual entailment, and implicit semantic role labeling are three standard tasks in NLP. They have some common properties and partially overlap. Recently, there are a few efforts that try to “bridge” boundaries between these tasks: Mirkin, Dagan, and Padó (2010) show that textual entailment recognition can benefit from bridging resolution; Stern and Dagan (2014) improve the performance of textual entailment recognition by exploring implicit semantic role labeling. It would be interesting to further explore the interactions between these three tasks, such as whether bridging anaphora recognition can benefit from the rich annotated data from FrameNet (e.g., Null Instantiations) or whether lexico-semantic resources widely used in textual entailment systems can be explored for bridging resolution.
Acknowledgments
This work has been supported by the Research Training Group Coherence in Language Processing at Heidelberg University. Katja Markert received a Fellowship for Experienced Researchers by the Alexander-von-Humboldt Foundation. We thank HITS gGmbH for hosting Katja Markert and funding the annotation and the anonymous reviewers for their valuable feedback.
Notes
All examples, if not specified otherwise, are from OntoNotes (Weischedel et al. 2011). Bridging anaphors are typed in boldface, antecedents in italics throughout this article.
Unfortunately, as we explain in Section 2.2, no other English corpus that is immediately usable for the full problem of bridging resolution is currently available for us to test our system on.
Quantitative results for bridging recognition are very similar to the previous framework, however.
Prince (1992) also gives examples of indefinite bridging cases, so our observation is not new.
Comparative anaphors are typed in boldface, antecedents in italics.
Nissim et al. (2004) view comparative anaphora as a subset of bridging. We distinguish them as their recognition (via lexical clues) and their resolution (often type matches) differ from other bridging cases.
Antecedents for old mentions are from the OntoNotes coreference annotation.
In ISNotes, only 2.6% of bridging anaphors have at least two antecedents. Our automatic system currently cannot deal with such cases—we leave this for future work.
The κ values for the fine-grained scheme are higher than for the coarse-grained one. The hierarchical scheme is organized such that a category lower down the tree is more often confused with a category higher up in a different branch of the tree than with its direct siblings in the tree (i.e., mediated/bridging mentions are often confused with new mentions whereas some mediated categories such as mediated/syntactic or mediated/comparative are very easy to recognize).
The low reliability of category function, when involving Annotator B, is explained by Annotator B forgetting about this category completely and only using it once. When two annotators remembered the category, it was easy to annotate reliably (κ 83.2 for the pairing A-C).
A small portion of anaphors has more than one antecedent. Therefore, the number of anaphor-antecedent pairs (683) is slightly higher than the number of anaphors (663) (Section 3.1 and Example (7)).
We thank an anonymous reviewer for bringing up this example.
fg and fl correspond to Fg and Fl, respectively, in Equation (2).
The full list is: {other, another, such, different, similar, additional, comparable, same, further, extra}.
We extract increase/decrease verbs from the General Inquirer lexicon (Stone et al. 1966). The list contains the verbs {increase, raise, rise, climb, swell, ascend, jump, leap, scale, stretch, become, double, extend, grow, improve, strengthen, fall, drop, cut, slow, ease, reduce, descend, lower, slip}.
Note that we use the notion of a coherence gap as missing entity coherence to all previous sentences, not just the adjacent one as discussed in Grosz, Joshi, and Weinstein (1995).
The whole list is: {final, first, last, next, prior, succeeding, second, nearby, previous, close, above, adjacent, behind, below, bottom, early, formal, future, before, after, earlier, later}.
In 10-fold cross-validation, we have 45 training documents in each fold. In fold0, the ground Markov network of Collective for the first training instance contains 5,831 variables and it takes around 35 minutes on an 8 CPU core machine to train the model.
The improvement is not due simply to a switch from SVMs to MLNs. If we run the MLN without the novel relational features, we obtain performance comparable but slightly lower than SVMs.
We do not model that bridging anaphors have multiple antecedent entities (Example (7)).
We could also use word embeddings as similarity measures. The focus of this article is not on similarity measures but on the joint optimization of antecedent selection. We therefore leave the investigation of different similarity measures to future work.
A variation of Dunning log-likelihood ratio (Dunning 1993) proposed by Dunning in http://mail-archives.apache.org/mod_mbox/mahout-user/201001.mbox.
The query form (i.e., subject-verb, verb-object, or preposition-object) is decided by the syntactic relation between the anaphor and its dependent verb/preposition.
The texts in OntoNotes are not shown with headlines. However, the same texts are included in the Tipster corpus, from which we can extract the headlines.
Note that the bridging anaphor a is not coreferent to these other NPs with the same head word. Otherwise, its information status would be old and not bridging.
Semantic class constraints (f1–f4 in Table 12) strongly indicate bridging. Hence, the antecedent access scope of an anaphor in these constraints is not strongly connected to the anaphor’s discourse scope.
The parameter is estimated using a grid search over p ∈ {0.05, 0.1, 0.15, 0.2, 0.25, 0.3}, k ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, and t ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}.
Poesio et al. (2004a) used a five sentence window for antecedent candidate selection, because all antecedents in their corpus are within the previous five sentences of the anaphors.
We replace naive Bayes with SVMlight because it can potentially deal better with imbalanced data. The SVMlight parameter that handles data imbalance is set according to the ratio between positive and negative instances in the training set.
During training, we have 45 training instances in each fold. In fold0, the ground Markov network of jointme for the first training instance contains 2361 variables, and it takes around 3 minutes on an 8 CPU core machine to train the model.
We use this model for antecedent selection for the pipeline model, as having full entity and coreference information in the test data is unrealistic.
We use the package from https://github.com/smartschat/art.
In Hou, Markert, and Strube (2014), we reimplement a previous rule-based system (Vieira and Poesio 2000) as the baseline. It suffers from a very low recall because it only considers meronymy bridging and compound noun anaphors whose head is prenominally modified by the antecedent head. Therefore, we do not include it in this article as a baseline.
Initial experiments showed that increasing the window size more than two sentences decreases the performance.
To deal with data imbalance, the SVMlight parameter is set according to the ratio between positive and negative instances in the training set.