Text-based NP Enrichment

Understanding the relations between entities denoted by NPs in a text is a critical part of human-like natural language understanding. However, only a fraction of such relations is covered by standard NLP tasks and benchmarks nowadays. In this work, we propose a novel task termed text-based NP enrichment (TNE), in which we aim to enrich each NP in a text with all the preposition-mediated relations -- either explicit or implicit -- that hold between it and other NPs in the text. The relations are represented as triplets, each denoted by two NPs related via a preposition. Humans recover such relations seamlessly, while current state-of-the-art models struggle with them due to the implicit nature of the problem. We build the first large-scale dataset for the problem, provide the formal framing and scope of annotation, analyze the data, and report the results of fine-tuned language models on the task, demonstrating the challenge it poses to current technology. A webpage with a data-exploration UI, a demo, and links to the code, models, and leaderboard, to foster further research into this challenging problem can be found at: yanaiela.github.io/TNE/.


Introduction
A critical part of understanding a text is detecting the entities in the text, denoted by NPs, and determining the different semantic relations that hold between them.Some semantic relations between NPs are explicitly mediated via verbs, as in (1): (1) Water enters the plant through the roots.enters Much work in NLP addresses the recovery of such verb-mediated relations (SRL) (Gildea and Jurafsky, 2002;Palmer et al., 2010), either using Crown Princess Mary of Denmark gives birth to male child Her Royal Highness Crown Princess Mary of Denmark has given birth to a healthy baby boy at a Copenhagen hospital at approximately 1:57 am local time this morning, ending many months of waiting for the Royal Family, the Danish public and much of the world.The baby weighed in at 3.5 kilograms and 51 centimeters long.Figure 1: Preposition-mediated relations between NPs in a text.NPs with the same color designate the same entity (co-refer).Gray boxes show all the preposition-mediated relations for a single NP anchor (some are indicated with "..." for brevity).This figure shows a title and a single short paragraph.The texts in our dataset span 3 paragraphs.pre-specified role ontologies such as PropBank or FrameNet (Palmer et al., 2005;Ruppenhofer et al., 2016), or, more recently, using naturallanguage-based representations (QA-SRL) (He et al., 2015;FitzGerald et al., 2018).Another well-studied kind of semantic relations between NPs is that of coreference (Vilain et al., 1995;Pradhan et al., 2012), where two (or more) NPs refer to the same entity.
Such NP-NP relations, that are either mediated by verbs (as in SRL or Relation Extraction) or form coreference relations, represent only a subset of the NP-NP relations that are naturally expressed in texts.Consider, for instance, the following sentences: (2) A person with brown eyes crossed the street.Students [at Amirkabir University] who protested when Iranian President Mahmoud Ahmadinejad visited Amirkabir University of Technology (also called Amir Kabir University) on December 11, 2006 have been expelled and eligibility notices, allowing the students [at Amirkabir University] to be enlisted into the armed forces, were sent out with the signature [of university chancellor Ali Reza Rahai [of Amirkabir University] ], the "Guardian" reports.
54 students [ Ahmadinejad was cited as saying that dissenting students [at Amirkabir University] would go unpunished."It is my honour [of Ahmadinejad] to burn for the sake [of the nation's ideals] and defend the system," he said as protesters [

Iranian student protesters face expulsion
Students who protested when Iranian President Mahmoud Ahmadinejad visited Amirkabir University of Technology (also called Amir Kabir University) on December 11, 2006 have been expelled and eligibility notices, allowing the students to be enlisted into the armed forces, were sent out with the signature of university chancellor Ali Reza Rahai, the "Guardian" reports.
54 students were expelled and most of them were a part of the protest, which included chants of "Death to the dictator".
The official reason for the expulsions is that the students failed multiple tests Activists, however, claim that other students with similarly poor academic records have been allowed to continue their studies.
Ahmadinejad was cited as saying that dissenting students would go unpunished."It is my honour to burn for the sake and defend the system," he said as protesters burned a picture of him.

Input text
Figure 2: NP-enriched document from the dataset.The title appears in a larger font, the NPs in the document are marked with underline.The green NP-enrichments appear explicitly in the original text, while the red do not, and are typically harder to infer.For brevity, each link in the text mentions only one of the NPs in a coreference cluster.The dataset has additional links to the other NPs in the cluster.
All of the above cases contain examples of NP-NP relations, where the type of relation can be expressed via an English preposition.The preposition may be explicit in the text, as in (2), where the relation A person with blue eyes is explicitly expressed, or they may be implicit and left to the reader to infer, as in ( 3)-(4); in (3) readers easily infer that the roots are of the plant.Likewise, in (4) readers infer that the window is in the room. 1 Properly understanding the text means knowing that these relations hold, even when they are not explicitly stated in the utterance.Figure 1 shows additional examples.These relations, both explicit and implicit, are indispensable for understanding the text.While human-readers infer these relations intuitively and spontaneously while reading, machine-readers 1 Here, both "in" and "of" are possible prepositions, but "in" is slightly more specific.generally ignore them.In this work, we thus propose a new NLU task in which we aim to recover all the preposition-mediated relations -whether explicit or implicit -between NPs that exist in a text.We call this task Text-based NP Enrichment or TNE for short.
The short examples (2)-( 4) that illustrate the phenomenon may not look challenging to infer using current NLP technology.However, when we go beyond sentence level to document level, things become substantially more complicated.As we demonstrate in §6, a typical 3-paragraph text in our dataset has an average of 35.8 NPs, which participate in an average of 186.7 prepositionmediated relations, the majority of which are implicit.Figure 2 shows a complete annotated document from our dataset.
The type of information recovered by the NP Enrichment task complements well-established core NLP tasks such as entity typing, entity linking, coreference resolution, and semantic-role labeling (Jurafsky and Martin, 2009).We believe it serves as an important and much-needed building block for downstream applications that require text understanding, including information retrieval, relation extraction and event extraction, question answering, and so on.In particular, the NP Enrichment task neatly encapsulates a lot of the long-range information that is often required by such applications.Take for example a system that attempts to extract reports on police shooting incidents (Keith et al., 2017), with the following challenging, but not uncommon, passage: 2 Police officers spotted the butt of a handgun in Alton Sterling's front pocket and saw him reach for the weapon before opening fire, according to a Baton Rouge Police Department search warrant filed Monday that offers the first police account of the events leading up to his fatal shooting.
Considering this shooting-event passage, an ideal coreference model will resolve his to Alton Sterling's, making the entity being shot local to the shooting event.On top of that, an ideal NP Enrichment model as we propose here will also recover: making the shooter identity local to the shooting event as well, ready for use by a downstream event-argument extractor or machine reader.
Of course, one could hope that a dedicated, endto-end-trained shooting-events extraction model will learn to recover such information on its own.However it will require to pre-define the frame of shooting events, and it will require a substantial amount of training data to get it right (which often does not happen in practice).Focusing on Textbased NP Enrichment provides an opportunity to learn a core NLU skill that does not focus on a pre-defined set of relations, and is not specific to a particular benchmark.Finally, eyond its potential usefulness for downstream NLP applications, the Text-based NP Enrichment task serves as a challenging benchmark for reading comprehension, as we further elaborate in §3. 2 We thank Katherine Keith for this example.
In what follows we formally define the Textbased NP Enrichment task ( §2) and its relation to reading comprehension ( §3), we describe a largescale high-quality English TNE dataset we collected ( §4) and its curation procedure ( §5).We analyze the dataset ( §6) and experiment with pretrained language model baselines ( §7), achieving moderate but far-from-perfect success on this dataset ( §8).We also conduct an analysis of the best model, showcasing the strenghts, weaknesses and open challenges of the best model ( §9).We then discuss the relation of TNE to other linguistic concepts, such as bridging (Clark, 1975), relational nouns (Partee, 1983(Partee, /1997;;Loebner, 1985;Barker, 1995) and implicit arguments (Ruppenhofer et al., 2009;Meyers et al., 2004;Gerber and Chai, 2012;Cheng and Erk, 2019) ( §10).We finally conclude that, in contrast to those linguistic tasks, the Text-based NP Enrichment task is more exhaustive, sharply scoped, easier to communicate, and substantially easier to consistently annotate and use by non-experts.

Text-based NP Enrichment (TNE)
Task Definition The Text-based NP Enrichment task is deceptively simple: for each ordered pair (n 1 , n 2 ) of non-pronominal base-NP3 spans in an input text, determine if there exists a prepositionmediated relation between n 1 and n 2 , and if there is one, determine the preposition that best describes their relation. 4The output is a list of tuples of the form (n i , prep, n j ), where n i is called the anchor and n j is called the complement of the relation.Figure 2 shows an example of text where each NP n 1 is annotated with its (prep, n 2 ) NPenrichments.
Despite the task's apparent simplicity, the underlying linguistic phenomena are quite complex, and range from simple syntactic relations to re-lations that require pragmatics, world-knowledge and common-sense reasoning.Performing well on the task suggests a human-like level of understanding.Notably, human readers detect most of the relations almost subconsciously when reading, while some of the relations require an extra conscious inference step.
Example Consider the following text: (5) Adam's father went to meet the teacher at his school.
The preposition-mediated relations to be recovered in this example are:5 i (father, of Adam) ii (the teacher, of, Adam) iii (the teacher, at, his school) iv (his school, of, Adam) The first items are anchors, and the latter ones are the complements.
Order The order of appearance of NPs within the text does not matter: for a given pair of NPs n 1 and n 2 , we consider both (n 1 , n 2 ) and (n 2 , n 1 ) as potential relation candidates, and it is possible that both relations will hold (likely with different prepositions).The only restriction is that an NP span cannot relate to itself.For a text with k NPs, this results in k 2 − k candidate pairs.Scope In terms of the annotated relations, we are interested in the set of semantic relations that can be expressed in natural language via the use of a preposition.This identifies a rich, cohesive and well-scoped set of NP-NP relations that are not mediated by a verb and are not coreference relations.Importantly, we restrict ourselves to NPs that are mentioned in the text, excluding relations with NPs that reside in some text-external shared context.For example, consider the sentence: "The president discussed the demonstrations near the border".Here, the NPs "the president", "the border" and "the demonstrations" are all under-determined, and, to be complete, should relate to other NPs using preposition-mediated relations: president [of Country]; border [of Coun-tryX] [with CountryY]; demonstration [by somegroup] [about some-topic].However, as these complement NPs do not appear in the text, we do not consider them to be part of the TNE task.

The Use of Prepositions as Semantic Labels
While the relations we identify between NPs can be expressed using prepositions, one could argue that using prepositions as semantic labels is not ideal, due to their inherent ambiguity (Schneider et al., 2015(Schneider et al., , 2016(Schneider et al., , 2018;;Gessler et al., 2021): indeed a preposition such as for has multiple senses, and can indicate a large set of semantic relations ranging from BENEFICIARY to DURATION.
We chose to use prepositions as relation labels, despite this ambiguity.This follows a line of annotation work that aims to express semantic relations using natural language (FitzGerald et al., 2018;Roit et al., 2020;Klein et al., 2020;Pyatkin et al., 2020), as opposed to works that used formal linguistic terms, traditionally relying on expert-defined taxonomies of semantic roles and discourse relations.The aforementioned works label predicate-argument relations using restricted questions.In the same vein, we label nominal relations using prepositions.
We argue that the preposition-based labels are useful for humans and machines alike: humans can easily understand the task (both as annotators and -perhaps more importantly -as consumers), and current machine learning models are quite effective with implicitly dealing with prepositions ambiguity. 6Moreover, while the prepositions themselves are ambiguous, the (NP, prep, NP) triplet provides context which is, in many cases, sufficient to disambiguate the coarsegrained preposition sense.
We find that the preposition-based annotation has the following advantages: it clearly scopes the task with respect to the kinds of relations that are contained in it; and it is expressive, capturing a large class of interesting semantic relations.On top of that, the task and the corresponding relationset is easy to explain to both human annotators (thus allowing to obtain high levels of agreement) and to human consumers of the model (allowing wider adoption, as the task and its output does not require special training to understand).Finally, the output can be easily fed into existing NLP systems, which already deal to a large extent with the inherent ambiguities of prepositions and prepositional phrases.
To conclude, we argue that despite the ambiguities of prepositions, they allow us to obtain a meaningful set of typed semantic links between NPs, which are well understood by people and can be effectively processed by NLP models.While the annotation can be refined to include a finegrained sense annotation for each link, e.g., via a scheme as that of Schneider et al. (2018), we leave such an extension to future work.
Coreference Clusters A common relation between NPs is that of identity, a.k.a. a coreference relation, where two or more NPs refer to the same entity.How do coreference relations relate to the NP Enrichment task?While the NP Enrichment task so far is posed as inferring prepositional relations between NPs, in actuality the prepositional relations hold between an NP and a coreference cluster.Indeed, if there is a prepositional relation prep(n 1 , n 2 ), and a coreference relation coref-to(n 2 , n 3 ), we can immediately infer the link prep(n 1 , n 3 ). 7We make use of this fact in our annotation procedure, and the dataset includes also 7 Note that the converse does not hold: prep(n1, n2), coref-to(n1, n3) does not necessarily entail prep(n3, n2).Consider for example: "The race began.John, the organizer, pleased".While John and the organizer are coreferring, the relation organizer of the race holds, while John of the race does not.This is because John and the organizer are two different senses for the same reference, and the relation holds only for one of the senses (cf.Frege (1960)).Putting it differently, when John and organizer serve as predicates, their selectional preferences are different despite them coreferring.Such examples are common, consider also "John is Jenny's father, Mary's husband" where father of Jenny holds, while husband of Jenny doesn't.Similarly, husband of Mary holds, while father of Mary doesn't.
the coreference information between all NPs in the text.Indeed, for brevity, Figure 2 shows only a subset of the relations, indicating for each anchor NP only a single complement NP from each coreference cluster.Some of the coreference clusters are shown at the bottom of the Figure.Note that the coreference clusters are not part of the task's input or expected output.
Formal Dataset Description An input text is composed of tokens w 1 , ..., w t , and an ordered set N = n 1 , ..., n k of base-NP mentions.The underlying text is often arranged into paragraphs, and may also include a title.A base-NP mention, also known as NP chunk, is the smallest noun phrase unit that does not contain other NPs, prepositional phrases or relative clauses. 8It is defined as a contiguous span over the text, indicated by start-token and end-token positions (e.g., (3, 5) "the young boy").The output is a set R of relations of the form (n i , prep, n j ), where i = j and prep is a preposition (or a set-membership symbol).Each text is also associated with a set C of non-overlapping coreference clusters, where each cluster c ⊆ N is a non-empty list of NP mentions.The set of clusters is not provided as input, but for correct sets R it holds that ∀n j ∈ c(n j ),

Completeness and Uniformity
The kinds of preposition-mediated relations we cover originate from different linguistic or cognitive phenomena, and some of them can be resolved by employing different linguistic constructs.For example, some within-sentence relations can be extracted deterministically from dependency trees, e.g., by following syntactic prepositional attachment.Other relations can be inferred based on pronominal coreference (e.g., "his school [of Adam]" above can be resolved by first resolving "his" to "Adam's" via a coreference engine, and then normalizing "Adam's school" → "school of Adam").Many others are substantially more involved.We deliberately chose not to distinguish between the different cases, and expose all the relations to the user (and to the annotators) via the same uniform interface.This approach also contributes to the practical usefulness of the task: instead of running several different processes to re-cover different kinds of links, the end-user will have to run only one process to obtain them all.
Evaluation Metrics Our main metrics for evaluating NP enrichment tasks are precision, recall, and F1 on the recovered triplets (links) in the document.For analysis, we also report two additional metrics: precision/recall/F1 on unlabeled links (where the preposition identity does not matter), and accuracy of predicting the right preposition when a gold link is provided.We break this last metric into two quantities: accuracy of predicting the preposition for gold links that were recovered by the model, and accuracy of prepositions for gold links that were not recovered.

TNE as a Reading Comprehension Benchmark
While reading comprehension (RC) and question answering (QA) are often used interchangeably in the literature, measuring the reading comprehension capacity of models via question answering, as implemented in benchmarks such as SQuAD (Rajpurkar et al., 2016), BoolQ (Clark et al., 2019) and others, has several well-documented problems (Dunietz et al., 2020).We argue that the TNE task we propose herein has properties that make it appealing for assessing RC, more than QA is.First, benchmarks for extractive (span-marking) QA are sensitive to the span-boundary selection, on the other hand, benchmarks for yes/no, multiple choice or generative questions can in principle be answered in a way which is completely divorced from the text.On a more fundamental level, all QA benchmarks are very sensitive to lexical choices in the question and its similarity to the text.Furthermore, QA benchmarks rely on human authored questions that are easy to solve based on surface artifacts.Finally, in many cases, the existence of the question itself provides a huge hint towards the answer (Kaushik and Lipton, 2018).
The underlying cause for all of these issues is that QA-based setups do not measure the comprehension of a text, but rather comprehending a (text, question) pair, where the question adds a significant amount of information, focuses the model on specific aspects of the text, and exposes the evaluation to biases and artifacts.The reliance on the human-authored questions makes QA a bad format for measuring "text understanding" -we are likely measuring something else, such as the ability of the model to discern patterns in human question-writing behavior.
The TNE task we define side-steps all the above issues.It is based on the text alone, without revealing additional information not present in the text.The exhaustive nature of the task entails looking both at positive instances (where a relation exists) and negative ones (where it doesn't), making it harder for models to pick up shallow heuristics.We don't reveal information to a model, beyond the information that the two NPs appear in the same text.Finally, the list of NPs to be considered is pre-specified, isolating the problem of understanding the relations between NPs in the text from the much easier yet intervening problem of identifying NPs and agreeing on their exact spans.
Thus, we consider TNE a less biased and less gameable measure of RC than QA-based benchmarks.Of course, the information captured by TNE is limited and does not cover all levels of text understanding.Yet, performing the task correctly entails a non-trivial comprehension of texts, which human readers do as a byproduct of reading.

Text-based NP Enrichment Dataset
We collect a large-scale TNE dataset, consisting of 5.5K documents in English (3,988 train, 500 dev, 500 in-domain test, and 509 out-of-domain test).It covers about 200K NPs and over 1 million NP relations.The main domain is WikiNews articles, and the out-of-domain (OOD) texts are split evenly between reviews from IMDB, fiction from project Gutenberg, and discussions from Reddit.
Each annotated document consists of a title and 3 paragraphs of text, and contains a list of nonpronominal base-NPs (most identified by SpaCy (Honnibal et al., 2020) 9 but some added manually by the annotators), a list of coreference clusters over the NPs, and a list of NP-relations that hold in the text.Each relation is a triplet consisting of two NPs from the NP list, and a connecting element which is one of 23 prepositions (displayed in Table 1)10 or a "member(s) of" relation designating set-membership.The list of NP relations is exhaustive, and aims to cover all and only valid NP-NP relations in the document.of, against, in, by, on, about, with, after, to, from, for, among, under, at, between, during, near, over, before, inside, outside, into, around Table 1: Prepositions used in TNE.

Annotation Procedure
We propose a manual annotation procedure for collecting a large-scale dataset for the TNE task.Considering all k 2 − k NP pairs (with an average k of 35.8 in our dataset) is tedious, and, in our experience, results in mistakes and inconsistencies.In order to reduce the size of the space and improve annotation speed, quality, and consistency, we opted for a two-stage process, where the first stage includes the annotation of coreference clusters over mentions, and the second stage involves NP Enrichment annotation over the clusters from the first stage.We find that this two-stage process dramatically reduces the number of decisions that need to be taken, and also improves recall and consistency by reducing the cognitive load of the annotators, focusing them on a specific mode at each stage.We hereby describe the different stages.
Stage 1: Annotating Coreference Clusters We start by collecting coreference clusters, as well as discarding non-referring NPs, that are "irrelevant" for the next stage (such as time-expressions).We created a dedicated user-interface to facilitate this procedure (Figure 3a).The annotators go over the NPs in the text in order, and, for each NP, indicate if it is (a) a new mention (forming a new cluster); (b) "same as" (coreferring to an entity in an existing cluster initiated earlier); (c) a time or measurement expression; (d) an idiomatic expression.At each point, the annotators can click on a previous NP to return to it and revise their decisions.
The OOD and documents from the test-set were annotated by two annotators for measuring agreement.They were then consolidated by one of the paper's authors for high quality annotations.
Stage 2: Annotating NP-relations The second step is the NP Enrichment relation annotation.The annotators are exposed to a similar interface (Figure 3b).For each NP, they are presented with all the coreference clusters, and must indicate for each cluster if there is a preposition-mediated relation between the NP and the cluster.

Disneyland marks 50th anniversary May 8, 2005
The American theme park Disneyland celebrated its golden anniversary on Thursday, in Anaheim, California.
The theme park, created and opened by Walt Disney in 1955, hosted a special gala opening at 1pm in front of Sleeping Beauty Castle, the world-famous icon of Disneyland.
CEO of the Walt Disney Company Michael Eisner, COO Bob Iger, honorary Disneyland Ambassador Julie Andrews and a host of stars opened the celebration, which is scheduled to run for eighteen months.The American theme park Disneyland celebrated its golden anniversary on Thursday, in Anaheim, California.

Items
The

Disneyland marks 50th anniversary May 8, 2005
The American theme park Disneyland celebrated its golden anniversary on Thursday, in Anaheim, California.
The theme park, created and opened by Walt Disney in 1955, hosted a special gala opening at 1pm in front of Sleeping Beauty Castle, the world-famous icon of Disneyland.
CEO of the Walt Disney Company Michael Eisner, COO Bob Iger, honorary Disneyland Ambassador Julie Andrews and a host of stars opened the celebration, which is scheduled to run for eighteen months.

Disneyland 50th anniversary
The American theme park Disneyland its golden anniversary  For this stage, all documents are annotated by two annotators and undergo a consolidation step.The consolidation over the two annotators is performed by a third annotator, that did not see the document before.This annotator is presented with the interface shown in Figure 3c.The consolidator sees all the relations created by the two preceding annotators, and decides which of them are correct. 11, 12 In-Domain Table 2: Agreement scores on the different annotation parts.We report both the coreference CoNLL scores, and the metrics of NP Enrichment calculated on the consolidated annotations.

Annotators
We trained and qualified 23 workers on the Amazon Mechanical Turk (AMT) platform, to participate in the coreference, NP relations, and consolidation tasks.We follow the controlled crowdsourcing protocol suggested by Roit et al. (2020); Pyatkin et al. (2020) giving detailed instructions, training the workers, and providing them with ongoing personalized feedback for each task.We paid 1.5$, 2.5$, and 1.5$ for each HIT in the coreference, NP-relations, and consolidation tasks respectively.The price for the NP-relations task was raised to 2.7$ for the test and out-of-domain subsets.We additionally paid bonus payments on multiple occasions.Overall, we aimed at paying at least the minimum wage in the U.S.

Inter-annotator Agreement
We report the agreement scores for the coreference and the consolidated relation annotations.The full results, broken by split are reported in Table 2.The IPrep-Acc and UPrep-Acc metrics measure the preposition-only agreement (whether the annotators chose the same preposition for a given identified NP-pair), and are discussed in §9.1.
Coreference We follow Cattan et al. (2021) and evaluate the coreference agreement scores after filtering singleton clusters.We report the standard CoNLL-2012 score (Pradhan et al., 2012) that combines three coreference metric scores.The inwith different prepositions (e.g.Ex. ( 4)).This may increase the number of possible relations in a given document from k 2 − k possible pairs to (k 2 − k) * p, where p is the number of considered prepositions.However in practice, having more than two prepositions for the same NP pairs is not common, and two prepositions occur in 11.6% of the test-set.For simplicity, in this work, we consider a single preposition for each NP pair, but the collected data may contain two prepositions for some pairs.domain test score13 is 82.1, while in the OOD the score is 77.1.For comparison with the most dominant coreference dataset, OntoNotes (Weischedel et al., 2013), which only reported the MUC agreement score (Grishman and Sundheim, 1996), we also measure the MUC score on our dataset.The MUC score on our dataset is 83.6, compared to 78.4-89.4 in OntoNotes, depending on the domain (Pradhan et al., 2012).It is worth noting that on the Newswire domain of OntoNotes (Weischedel et al., 2013) (the domain that is most similar to ours) the score is 80.9, which indicates a high quality of annotation in our corpus.We expect the quality of our final coreference data to be even higher due to the consolidation step that was done by an expert on the test set and OOD splits.
NP-relations Next, we report agreement scores on the NP-relations consolidation annotation, which were measured on 10% of all the annotations.We use the same metrics for the NP Enrichment task ( §2) and use one of the annotations as gold, and the other as the prediction.Thus we only report accuracy and F1 scores (the precision and recall are symmetric depending on the role of each document).The Relation-F1 scores for the train and test are 89.8 and 94.4 respectively, while for the OOD it is 88.9.The preposition scores are almost perfect in all splits, with an average of 99.9 when the annotators agree on the link and 100.0 when they don't.Finally, the F1 scores also differ between splits: 89.6, 94.4, and 88.6 for the train, test, and OOD, respectively, but are overall high.

Dataset Statistics and Analysis
We report statistics of the resulting NP Enrichment dataset, and summarize them in Table 3. Overall, we collected 5,497 documents, with per-document averages of 35.8 NPs, 5.2 non-singleton coreference clusters, and 186.7 NP-relations.The average number of tokens in a document is 163.3 tokens, where the largest document has 304 tokens.

Distribution of Prepositions
We analyze the prepositions in the relations we collected.We aggregate the prepositions of the test set from all relations and present their distribution in Figure 4. We only show prepositions that appear at least in 4% of the data, and the rest are aggregated to-  gether into the Other label.The most common preposition is of, followed by in, which constitute 23.9% and 19.8% of the prepositions in our data respectively.The rest of the prepositions are used much less frequently, with from and for appearing in 9.7% and 6.3% of the prepositions respectively.The least used preposition is into, which appears in 0.07% of the prepositions.

NP-relations
We provide some statistics that shed light on the nature of the prepositionmediated NP-NP relations in the annotated data.First, we measure the surface distance between NPs in the relations, in terms of token counts between the anchor and the complement.We found the average distance to be 53.7 tokens, indicating an average large distance between two NPs, which demonstrates the task's difficulty.Backwardrelations (as opposed to forward-relations) are re-lations where the complement appears before the anchor.56.7% of the relations are backward.Sometimes, the string "anchor preposition complement" appears directly in the text.We call these cases Surface-Form.For instance in Ex (5) (the teacher at his school) is a Surface-Form relation.We computed the percentage of such relations in the data and found only 3.9% of them to be of such type.We also relax this definition and search for the preposition following the complement in a window size of 10 from the anchor, which we call Surface-Form+.The percentage of such cases remains low: 6.0% of the links.Symmetric relations are two relations between the same two NPs, that differ in direction (and potentially the preposition).For instance in Figure 8, the following links are symmetric (website, of, the owners) and (the owners, of, website).On average, there are 10.9 such symmetric relations in a document.Finally, transitive relations are sets of three NPs, a, b and c that include relations between (a, b), (b, c) and (a, c) (the preposition identity is not relevant).We found an average of 97.3 transitive relations per document in total.
Explicit vs. Implicit NP Relations Next, we analyze the composition of the relations in the dataset, as to whether these relations are implicit or explicit.While there is no accepted definition of explicit-implicit distinction in the literature (Carston, 2009;Jarrah, 2016), here we adapt a definition originally used by Cheng and Erk (2019) for another phenomenon, implicit arguments:14 in an implicit relation the anchor and the complement are not syntactically connected to each other and might not even appear in the same sentence.This implies, e.g., that any inter-sentential relations are implicit,15 while relations within one sentence can be either implicit or explicit.We sample three documents from the test-set, containing 590 links in total, and count the number of relations of each type.Our manual analysis reveals that 89.8% of the relations are implicit.
Bridging vs. TNE Bridging has been extensively studied in the past decades, as we discuss in §10.Here, we explore how many of the relations we collected correspond to the definition of bridging.We use the same three documents from the analysis described above, and follow the annotation scheme from ISNotes1.0(Markert et al., 2012) 16 to annotate them for bridging.We found that 15 out of the 590 links (2.5%) in these documents are bridging links (i.e., meet the criteria for bridging defined in ISNotes).These three documents contain 104 NPs, i.e., the ratio of bridging links per NP is 0.14.While the ratio is small, it is larger than the ratio in ISNotes which contains 663 bridging links out of 11K annotated NPs (Hou et al., 2013b), i.e., 0.06 bridging links per NP.

Deterministic Baselines
We explore multiple deterministic baselines, that should expose regularities in the data that models may use (and therefore may result in an easy to solve dataset), and provide further insights about our data.In these baselines we focus on detecting valid anchor/complement pairs, without considering the preposition's identity.
Title Link This baseline considers one of the title's NPs as the complement for each NP in the text.We experiment with three variants: Title-First, Title-Last and Title-Random which use the first, last and a random NP in the title respectively.

Adjacent Link
The second baseline predicts the adjacent NP as a complement.We have two vari-ISNotes1.0/blob/master/doc/release_annotation_scheme.pdfants: predict the next NP as the complement (Adj-Forward) or the previous NP (Adj-Backward).
Surface Link The third baseline predicts surface links in the text, i.e., links in which the string "anchor preposition complement" appears as-is in the text.For instance, in "Adam's father went to meet the teacher at his school" it will predict the link (the teacher, at, his school).We also experiment with Surface-Expand, a relaxed version which looks for the complement at a distance of up to 10 tokens following the anchor.
Combined This baseline combines the three others, using the best strategy of each one (determined based on the empirical results), and predicts a link whenever at least one of the used baselines is triggered.Its purpose is to increase the recall.
Combined-Coref This final baselines adds to the Combined predictions the gold coreference information.For each link to an NP that is part of a coreference cluster, we also add links to all other NPs in the same cluster.

Results
The deterministic baselines' results are summarized in the first part of Table 4.
In general, the F1 scores of the 'single' baselines are low, ranging between 5.8 and 20.8 points, where the Adj-Backward baseline achieves the lowest score and the Surface-Expand baseline achieves the highest score.The Combined baseline makes use of the best strategy of each previous baseline (based on the F1 score), that is, Title-Last, Adj-Backward and Surface-Expand, and reaches 22.8 F1.Combined-Coref extend the Combined baseline by adding the coreference gold data, and achieves the best performance for the deterministic baselines, of an overall 25.2 F1.
These results demonstrate that (a) the links are spread across different locations in the text, and (b) that the data is unlikely to have clear shortcuts that models might exploit, while there are some strong structural cues.

Neural Models
Next, we experiment with three neural models based on a pre-trained masked language model (MLM), specifically, SpanBERT (Joshi et al., 2020).We also experiment with an additional baseline with uncontextualized word embeddings.Architecture At a high level, our models take the encoding of two NPs -an anchor and a complement -and predict whether they are connected, and if so, by which preposition.To encode an anchor-complement pair, we first encode the text using the MLM and then encode each NP by concatenating the vectors of its first and last tokens.The resulting anchor and complement vectors are then each fed into a different MLP, each with a single 500-dimensions hiddenlayer.The concatenation of the MLP outputs results in the anchor-complement representation.This representation is then fed into the prediction model, which has two variants.The architecture resembles the end-to-end architecture for modeling coreference resolution (Lee et al., 2017).A schematic view of the architecture is presented in Figure 5.
Variants In the decoupled variant, we treat each prediction as a two-step process: one binary prediction head asks "are these two NPs linked?", and in case they are, another multiclass head determines the preposition. 17In the coupled variant, we have a single multiclass head that outputs the connecting preposition or NONE, in case the NPs are not connected.We also experiment with a frozen (or "probing") variant of both models, in which we keep the MLM frozen, and update only the NP encoding and prediction heads.The frozen architecture is intended to quantify the degree to which the pretrained MLM encodes the relevant information, and it is very similar to the edgeprobing architecture of Tenney et al. (2018).Finally, the static variant aims to measure how well a model can perform with NPs alone, without considering their context.This model sums all the static embeddings of each span and uses the same modeling as the coupled prediction.This baseline uses the 300-dim word2vec non-contextualized embeddings (Mikolov et al., 2013).We experiment with two versions: decoupled and coupled.
Technical Details All neural models are trained using cross-entropy loss and optimized with Adam (Kingma and Ba, 2015), using the AllenNLP library (Gardner et al., 2018).We train using a 1e − 5 learning rate for 40 epochs, with early stopping based the F1 metric on the development set.We use SpanBERT (Joshi et al., 2020) as the pretrained MLM, as it was found to work well on span-based tasks with its base and the large variants.The anchor and complement encoding MLPs have one 500-dim hidden layer and output 500dim representations.The prediction MLPs have one 100-dim hidden layer.All MLPs use the ReLU activation.We used the same hyperparameters for all baselines and did not tune them.The columns are the same as in Table 4. Also reporting results on in-domain split for comparison.

In-Domain Results
The pretrained models are presented in the second part of Table 4. Overall, the fully-trained transformers in the coupled variant perform significantly better than all other models, achieving 49.2 and 52.4 F1 in the base and large variants.Interestingly, the static and frozen variants perform similarly: the F1 scores range between 15.1 and 23.2.It is worth noting that the static variant achieves better results than the frozen one.This corroborates our hypothesis that many of the capabilities needed to solve the task are not explicitly covered by the language-modeling objective and that the NPs information alone is not sufficient to solve the task, as was also argued in Hou (2020); Pandit and Hou (2021).Finally, we note an interesting trend that the decoupled variant favors recall whereas the coupled variant favors precision, across all models.In summary, all models perform substantially below human agreeleaving a large room for improvement.

OOD Results
Here we report the best model's results (coupled-large) on the OOD data.The results are summarized in Table 5.We break down the results per domain (and per forum in the case of Reddit), as well as the human agreement results for comparison.We observe a substantial drop in Table 6: Additional metrics of the neural models on the TNE test set.We report five metrics: the precision, recall and F1 of the relation predictions, as well as the preposition accuracy on relations where the model predicted there's a relation (IPrep-Acc), as well as the accuracy where the model predicted there's no relation (UPrep-Acc).
The first row is an estimated human agreement on 10% of the data, thus marked with an asterisk.These results are comparable with the 'Pretrained' part in Table 4.
performance, with a large difference between domains (e.g. the model achieves on the IMDB split an overall 36.9F1, while on Reddit -28.2 F1).
While the agreement scores for these domains are also lower than for the in-domain test set (88.6 F1),19 the model's performance decreases more drastically on these splits.

Analysis
9.1 Quantitative Analysis Unlabeled Accuracy and Preposition-only Accuracy To disentangle the ability to identify that a link exists between two NPs from the ability to assign the correct preposition to this link, we report also unlabeled scores (ignoring the preposition's identity) and preposition-only scores.IPrep-Acc is the accuracy of predicting the correct preposition over gold relations (NP pairs) where the unlabeled relation was correctly identified by the model.UPrep-Acc is the accuracy of predicting the correct preposition for gold NP pairs that were not identified by the model.The results (  the models are significantly better (yet far from perfect) at choosing the correct preposition when they identify that a relation should exist between two NPs.Overall, the preposition selection accuracy is significantly better than the majority baseline of choosing "of" for all cases (which would yield 23.5%) but substantially worse than the human agreement which is almost 100%.We also observe that while the unlabeled relation scores are indeed better than their labeled counterparts, the link-identification aspect of the task is significantly more challenging than choosing the correct preposition once the link was identified.

Preposition Analysis
We analyze the errors of the best model on the different classes (the most common prepositions and no-relation).We present a confusion matrix in Figure 6.The most confusing label is in, which is confused (both in false positive and false negative) with all other labels.The preposition of is also confused quite frequently, while about is confused much less.
Accuracy per NP Distance We assess the effect of the linear distance between the two NPs on the ability of the model to accurately predict the link.
For each NP pair in distance x, Figure 7 shows the percentage of correct predictions over that bin.We observe a trend of improved performance until 40 tokens, which then reaches a plateau of about 90% (the results for distances above 180 are noisy due to data sparsity at these distances).Interestingly, the model struggles more in the short-  distance links, rather than the ones farther apart.
We performed the same analysis on precision and recall errors, and found similar trends.

Qualitative Analysis
To better understand the type of errors, we zoom in on a single document (shown in Figure 8 and manually inspect all errors our best model (Coupledlarge) made on it.
Out of the 1980 potential links, the model wrongly predicted 231 links (82 precision errors, where a model predicted an incorrect link, and 149 recall errors, where the model failed to identify a link).Out of the 231 disagreements with the gold labels, we found 84.2% to indeed be incorrect, 10.5% to actually be correct, and 5.3% were found to be ambiguous.
Table 7 breaks down the errors into 9 categories, covering both type of errors and skills needed to solve them: Preposition Semantics: where the model predicted a link, but used a wrong preposition; Ambiguous: where both gold and predicted answers can be correct, depending on the reading of the text; Wrong Label: where the gold label is incorrect; Missing Label: where the prediction was correct, as well as the original label, but the predicted preposition was missing (i.e.cases where more than one preposition are valid); Generics: cases where the anchor is generic, and thus no link exists from it; Coreference Error: where the model links to a complement that appears to be part of the coreference chain, but is not, or the annotator mistakenly attached an additional, erroneous NP to the chain; World Knowledge: links that require some world knowledge in order to complete; Explicit: where the link appears explicitly in the text, but the model did not predict it accordingly; and Other, for none of the above.

Church of Scientology does not see humor in website dedicated to Tom Cruise
September 25, 2005 http://ScienTOMogy.infohas apparently received a fax and at least 6 emails in the span of 2 days from Scientology lawyer Ava Paquette of Moxon & Kobrin threatening a lawsuit of up to $100,000 if the domain name ownership is not transferred.This type of letter is often called a cease and desist letter.
The owners of scienTOMogy.infohave posted the complaints and their replies, saying that the site simply expresses opinion, does not make any claims, and clearly states that it has no connection to the Church of Scientology."The site was put up as a single source to view all the recent hype Tom has made about the church -it does nothing but show Tom, so we are at a loss as to why the church is acting so rashly." The Church of Scientology is notorious for pursing legal action against its critics, under the name of the "Religious Technology Center" (RTC).It previously made headlines when it used the US's Digital Millennium Copyright Act to remove xenu.net, a site critical of Scientology, from Google's listings.8.The first part of the table presents precision errors, where the model predicted some link considered to be an error.The second part presents recall errors, where the model predicted no link exists.
In Table 4 we observed that the model is better at precision than recall.Here we also observe that recall and precision errors differ also in their type distribution.In terms of precision, 17.0% of the links were correct links with a wrong preposition.Such errors seem rather trivial, such that a good language model would not err: using an LM for explicitly quantifying the likelihood of links may be a promising direction for future work.An interesting error that occurs both in the precision and recall errors is that of Ambiguous categorization -for instance in the recall category, one interpretation can be read as an opinion being expressed about Tom Cruise, while the other interpretation reads opinion in a more abstract way, thus not connected to Cruise.Finally, the largest category in the precision errors and the most common category in recall errors is, "Other", with varied mistake that do not single out noticable phenomena.

Related Tasks and Linguistic Phenomena
From the outset, recovering NP-NP relations appears familiar from many previous linguistic endeavors.While TNE is related to them, it is certainly different, in scope, purpose and definition.
Our departure point for this work has been the notion of an implicit argument of a noun, i.e., nouns such as "brother" or "price" that are incomplete on their own, and require an argument to be complete.In linguistics, these are referred to as relational nouns (Partee, 1983(Partee, /1997;;Loebner, 1985;Barker, 1995;De Bruin and Scha, 1988;Partee et al., 2000;Löbner, 2015;Newell and Cheung, 2018).In contrast, nouns like "plant", or "sofa" are called sortal and are conceived as "complete"; their denotation need not rely on the relation to other nouns, and can be fully determined.

No
Yes.
On the other hand, the "undersp-rel" category can include any relations.The relations are marked.
No. Any relations that can be expressed with a preposition, are included, as well as element-set and subset-set relations.A sensible task, then, could be to identify all the relational nouns in the text and recover their missing noun argument.However, in practical terms, the distinction between sortal and relational is not clear-cut.Specifically, sortal nouns often stand in relations to other nouns, and these relations are useful for understanding the text and for fully determining the reference -as in "the sofa [in the house]", or "the sofa [on the carpet]" (as opposed to that on the floor), and "the house [of a particular owner]".
Secondly, the relation between a bridging expression and its antecedent21 has to be implicit.In NP Enrichment the relations between the anchor and the complement are either implicit or explicit.
Next, in most bridging studies a bridge is a type of anaphora: the bridging expression is not interpretable without the antecedent.In NP Enrichment the anchor can in fact be interpretable on its own -the complement supplements it with additional information ("sofa [on the carpet]") or simply exposes existing information in a uniform way.
Also, bridging expressions are not discourseold, i.e., they can only refer to entities that are mentioned in the text for the first time.This implies that in a coreference chain only the first mention can have a bridging link.In NP Enrichment there is no such restriction: an anchor can be either old or new.Furthermore, in many bridging works the antecedent does not have to be an NP.It can be also a verb or a clause.In NP Enrichment both the anchor and the complement have to be NPs.
Finally, all the aforementioned studies have been defined by and written for linguists, using linguistic terminology, with a predominantly documentary motivation.As a result the task definitions are often narrowly scoped, highly technical and non-interpretable for non-experts -making their annotation by crowd-workers essentially impossible.It also makes the consumption of the output by (non-linguist) NLP practitioners doubtful.
In this work we aimed to define a linguistically meaningful yet simple, properly scoped, and easy to communicate task.We want crowd-workers as well as downstream-task designers to be able to properly understand the task, its scope and its output, and we want the data collection procedure to be amenable to high inter-annotator agreement.
A Note on Decontextualization Recently, Choi et al. (2021) introduced the textdecontextualization task, in which the input is a text and an enclosing textual context, and the goal is to produce a standalone text that can be fully interpreted outside of the enclosing context.The decontextualization task involves handling multiple linguistic phenomena, and, in order to perform it well, one must essentially perform a version of the NP Enrichment task.For example, decontextualizing "Prices are expected to rise" based on "Temporary copper shortage.Prices are expected to rise", involves establishing the relation "Prices [of copper] are expected to rise").
Like our NP Enrichment proposal, the decontextualization task bears a strong applicationmotivated, user-facing perspective.It is useful, well-defined and easy to explain.However, as it is entirely goal-based ("make this sentence standalone"), the scope of covered phenomena is somewhat eclectic.More importantly, the output of the decontextualization task is targeted at human readers rather than machine readers.For example, it does not handle relations between NPs that appear within the decontextualized text itself; it only recovers relations of NPs with the surrounding context.Thus, many implicit NP relations are left untreated.

Conclusions
We propose a new task named Text-based NP Enrichment, or TNE, in which we aim to annotate each NP with all its relations to other NPs in the text.This task covers a lot of implicit relations that are nonetheless crucial for text understanding.We introduce a large-scale dataset enriched with such NP links, containing 5.5K documents and over 1M links -enough for training large neural networks -and provide high-quality test sets, both in and out of domain.We propose several baselines for this task and show that it is challenging -even for state-of-the-art LM-based models -and that there is a big gap from human performance.We release the dataset, code, and models, and hope that the community will adopt this task as a standard component of the NLP pipeline.

with ( 3 )
Water enters the plant through the roots. of (4) I entered the room, the window was open. in Iranian student protesters [at Amirkabir University] [against Ahmadinejad] face expulsion [of 54 students] [by University chancellor Ali Reza Rahai] [from Amirkabir University] fatal shooting [of Alton Sterling] [by Police officers [of Baton Rouge Police Department]] Disneyland 50th anniversaryThe American theme park Disneyland its golden anniversary Anaheim California The theme park Walt Disney a special gala opening 1pm front Sleeping Beauty Castle the world-famous icon Disneyland CEO the Walt Disney Company Michael Eisner COO Bob Iger honorary Disneyland Ambassador Julie Andrews a host stars the celebration  My entity clusters: 1. Disneyland [x] The American theme park Disneyland [x] 2. its golden anniversary [x]  What is the best fitting label for the expression 50th anniversary?a. new b.same as...▼ 1. Disneyland The American theme park Disneyland 2

Figure 3 :
Figure 3: Interfaces of the annotation steps.

Figure 4 :
Figure 4: Distribution of the prepositions in the NP Enrichment test set.

Figure 5 :
Figure 5: A schematic view of the model's architecture.

Figure 6 :
Figure 6: A confusion matrix of the predictions of the Joint-large model over the test set.The numbers are in log2 scale (except for zero values which are untouched).We show show the 10 most common labels for brevity.

Figure 7 :
Figure 7: Accuracy of the Joint-large model over the dev set, for every NP-distance bin.

Figure 8 :
Figure 8: Development-set document used for the qualitative error analysis.All 45 considered NPs are underlined.Out of 45 2 − 45 = 1980 potential links, this document contains 271 gold links, and 231 erroneously-predicted links, which we analyze.
at Amirkabir University] [members Iranian student protesters] were expelled and most of them were a part [of the protest [at Amirkabir University] [by 54 students] [against Ahmadinejad] ], which included chants [of "Death [to the dictator"] ] [by students] [at protest] .The official reason [for the expulsions [of 54 students] [by University chancellor Ali Reza Rahai] [from Amirkabir University] ] is that the students [at/of Amirkabir University] [members protesters] failed multiple tests [of students] [at Amirkabir University].Activists [at Amirkabir University], however, claim that other students [at Amirkabir University] [with similarly poor academic records [of students] [at Amirkabir University] ] have been allowed to continue their studies [of students] [at Amirkabir University].
at Amirkabir University] [at protest] [of Ahmadinejad] [members Iranian student protesters] burned a picture [of Ahmadinejad] [in protest] of him.
theme park, created and opened by Walt Disney in 1955, hosted a special gala opening at 1pm in front of Sleeping Beauty Castle, the world-famous icon of Disneyland.CEO of the Walt Disney Company Michael Eisner, COO Bob Iger, honorary Disneyland Ambassador Julie Andrews and a host of stars opened the celebration, which is scheduled to run for eighteen months.
 Next (b) NP Enrichment data collection interface.

Select the links below that you consider correct in the context of this article
(or "none of the above" if none applies) Next (c) NP Enrichment consolidation interface.

Table 3 :
Statistics summary of the NP Enrichment dataset.

Table 4 :
Results of the deterministic baselines and neural models on the test set.We report three metrics: the precision, recall and F1 of the overall relation predictions.The first row is an estimated human agreement on 10% of the data, and not over the entire test set, thus marked with an asterisk.Note that the first and second parts of the table are not directly comparable, since in the Deterministic results, the preposition labels is given by an oracle, whereas in the Pretrained results, it is predicted by the models.

Table 5 :
Results of the best model (Coupled-large) on OOD data, broken into the different sub-splits.

Table 7 :
Error types, and their statistics, based on the text presented in Figure