Abstract
This paper presents a new task of predicting the coverage of a text document for relation extraction (RE): Does the document contain many relational tuples for a given entity? Coverage predictions are useful in selecting the best documents for knowledge base construction with large input corpora. To study this problem, we present a dataset of 31,366 diverse documents for 520 entities. We analyze the correlation of document coverage with features like length, entity mention frequency, Alexa rank, language complexity, and information retrieval scores. Each of these features has only moderate predictive power. We employ methods combining features with statistical models like TF-IDF and language models like BERT. The model combining features and BERT, HERB, achieves an F1 score of up to 46%. We demonstrate the utility of coverage predictions on two use cases: KB construction and claim refutation.
1 Introduction
Motivation and Problem Relation extraction (RE) from text documents is an important NLP task with a range of downstream applications (Han et al., 2020). For these applications, it is vital to understand the quality of RE results. While extractors typically provide confidence (or precision) scores, this paper puts forward the notion of RE coverage (or recall). Given an input document and an RE method, coverage measures the fraction of the extracted relations compared to the complete ground-truth that holds in reality. We consider this on a per-subject and per-predicate basis—for example, all memberships of Bill Gates in organizations or all companies founded by Elon Musk.
Document coverage for RE highly varies. Consider the three text snippets about Tesla as shown in Figure 1. The first text contains all five founders of Tesla, while the second text contains only two of them, and the third has just one. Analogously, for the entity Tesla and the relation founded-by, we see that text 1 has coverage 1, text 2 has coverage 0.4, and text 3 has coverage 0.2.
When applying RE at scale, for example, to populate or augment a knowledge base (KB), an RE system may have to process a huge number of input documents that differ widely in their coverage. As state-of-the-art extractors are based on heavy-duty neural networks (Lin et al., 2016; Zhang et al., 2017; Soares et al., 2019; Yao et al., 2019), processing all documents in a large corpus may be prohibitive and inefficient. Instead, prioritizing the input documents by identifying the best documents with high coverage could be effective. This is why coverage prediction is crucial for large-scale RE, but the problem has not been explored so far.
This problem would be easy if we could first run a neural RE system on each document and then assess the yield, either by comparison to withheld labeled data or by sampling followed by human inspection. However, this is exactly the computational bottleneck that we must avoid. The challenge is to estimate document coverage, for a given entity and relation of interest, with inexpensive and lightweight techniques for document processing.
Approach and Contributions This paper presents the first systematic approach for analyzing and predicting document coverage for relation extraction. To facilitate extensive experimental study on this novel task, we introduce a large Document Coverage (DoCo) dataset of 31,366 web documents for 520 distinct entities spanning 8 relations, along with automated extractions and coverage labels. Table 1 provides samples from DoCo for each relation.
Sample entity-relation-document triples for all eight relations present in our DoCo dataset.
Entity . | Relation . | Content . | Coverage . |
---|---|---|---|
George W. Bush | family | President Bush grew up in Midland, Texas, as the eldest son of Barbara and George H.W. Bush …and met Laura Welch. They were married in 1977 …twin daughters: Barbara, married to Craig Coyne, and Jenna, married to Henry Hager. The Bushes also are the proud grandparents of Margaret Laura “Mila”, Poppy Louise, and Henry Harold “Hal” Hager … | 1 |
FedEx | partner-org | FedEx Corp. …to acquire ShopRunner, the e-commerce …acquires the International Express business of Flying Cargo Group …acquires Manton Air-Sea Pty Ltd, a leading provider …acquires P2P Mailing Limited, a leading …acquires Northwest Research, a leader in inventory …acquires TNT Express …acquires GENCO …acquires Bongo International …acquires the Supaswift businesses in South Africa …acquires Rapidão Cometa … | 1 |
Warren Buffett | member-of | He formed Buffett Partnership Ltd. in 1956, and by 1965 he had assumed control of Berkshire Hathaway …Following Berkshire Hathaway’s significant investment in Coca-Cola, Buffett became …director of Citigroup Global Markets Holdings, Graham Holdings Company and The Gillette Company … | 0.8 |
Indra Nooyi | edu-at | Nooyi was born in Chennai, India, and moved to the US in 1978 when she entered the Yale School of Management …secured her B.S. from Madras Christian College and her M.B.A. from Indian Institute of Management Calcutta ... | 0.75 |
J. K. Rowling | position-held | Rowling is one of the best-selling authors today …job of a researcher and bilingual secretary for Amnesty International …position of a teacher led to her relocating to Portugal ... | 0.67 |
Apple Inc. | founded-by | Steve Jobs, the co-founder of Apple Computers …switched over to managing the Apple “Macintosh” project that was started … | 0.33 |
Intel | board-member | Andy D. Bryant stepped down as chairman …Dr. Omar Ishrak to succeed …Alyssa Henry was elected to Intel’s board. Her election marks the seventh new independent director … | 0.125 |
3M | ceo | The American multinational conglomerate corporation 3M was formerly known as Minnesota Mining and Manufacturing Company. It’s based in the suburbs … | 0 |
Entity . | Relation . | Content . | Coverage . |
---|---|---|---|
George W. Bush | family | President Bush grew up in Midland, Texas, as the eldest son of Barbara and George H.W. Bush …and met Laura Welch. They were married in 1977 …twin daughters: Barbara, married to Craig Coyne, and Jenna, married to Henry Hager. The Bushes also are the proud grandparents of Margaret Laura “Mila”, Poppy Louise, and Henry Harold “Hal” Hager … | 1 |
FedEx | partner-org | FedEx Corp. …to acquire ShopRunner, the e-commerce …acquires the International Express business of Flying Cargo Group …acquires Manton Air-Sea Pty Ltd, a leading provider …acquires P2P Mailing Limited, a leading …acquires Northwest Research, a leader in inventory …acquires TNT Express …acquires GENCO …acquires Bongo International …acquires the Supaswift businesses in South Africa …acquires Rapidão Cometa … | 1 |
Warren Buffett | member-of | He formed Buffett Partnership Ltd. in 1956, and by 1965 he had assumed control of Berkshire Hathaway …Following Berkshire Hathaway’s significant investment in Coca-Cola, Buffett became …director of Citigroup Global Markets Holdings, Graham Holdings Company and The Gillette Company … | 0.8 |
Indra Nooyi | edu-at | Nooyi was born in Chennai, India, and moved to the US in 1978 when she entered the Yale School of Management …secured her B.S. from Madras Christian College and her M.B.A. from Indian Institute of Management Calcutta ... | 0.75 |
J. K. Rowling | position-held | Rowling is one of the best-selling authors today …job of a researcher and bilingual secretary for Amnesty International …position of a teacher led to her relocating to Portugal ... | 0.67 |
Apple Inc. | founded-by | Steve Jobs, the co-founder of Apple Computers …switched over to managing the Apple “Macintosh” project that was started … | 0.33 |
Intel | board-member | Andy D. Bryant stepped down as chairman …Dr. Omar Ishrak to succeed …Alyssa Henry was elected to Intel’s board. Her election marks the seventh new independent director … | 0.125 |
3M | ceo | The American multinational conglomerate corporation 3M was formerly known as Minnesota Mining and Manufacturing Company. It’s based in the suburbs … | 0 |
We employ a classifier architecture that we call HERB (for Heuristics with BERT), based on a document’s lightweight features and additionally incorporates pretrained language models like BERT without any costly re-training and fine-tuning. The best configuration of this classifier achieves an F1-score of up to 46%. The classifier provides scores for its predictions and thus also supports ranking documents by their expected yield for the RE task at hand.
We evaluate our approach against a range of state-of-the-art baselines. Our results show that heuristic features like text length, entity mention frequency, language complexity, Alexa rank, or information retrieval scores have only moderate predictive power. However, in combination with pre-trained language models, the proposed classifier gives useful predictions of document coverage.
We further study the role of coverage prediction in two extrinsic use cases: KB construction and claim refutation. For KB construction, we show that coverage estimates by HERB are effective in ranking candidate documents and can substantially reduce the number of web pages one needs to process for building a reasonably complete KB. For the task of claim refutation (e.g., Tim Cook is the CEO of Microsoft), we show that coverage estimates for different documents can provide counter-evidence that can help to invalidate false statements obtained by RE systems.
The salient contributions of this work are:
We introduce the novel task of predicting document information coverage for RE.
To support experimental comparisons, we present a large dataset of annotated web documents.
We propose a set of heuristics for coverage estimation, analyze them in isolation and in combination with an inexpensive standard embedding-based document model.
We study the application of our classifier on two important use cases: KB construction and claim refutation. Experiments show that our predictor is useful in both of these tasks.
Our data, models and code is publicly available.1
2 Related Work
Relation Extraction (RE)
RE is the task of identifying the relation types between two entities that are mentioned together in a sentence or in proximity within a document (e.g., in the same paragraph). RE has a long history in NLP research (Mintz et al., 2009; Riedel et al., 2010), with a recent overview given by Han et al. (2020). State-of-the-art methods are based on deep neural networks trained via distant supervision (Lin et al., 2016; Zhang et al., 2017; Soares et al., 2019; Yao et al., 2019). On the practical side, RE is available in several commercial APIs for information extraction from text. In our experiments, we make use of Rosette2 and Diffbot.3 Our approach is agnostic to the choice of extractors, though; any RE tool can be plugged in.
Knowledge Base Construction (KBC)
RE plays a crucial part in the more comprehensive KBC task: identifying instances of entity pairs that stand in a given relation in order to construct a knowledge base (Mitchell et al., 2018; Weikum et al., 2021; Hogan et al., 2021).
The input is typically a set of documents, often assumed to be fixed and given upfront. This disregards the critical issue of benefit/cost trade-offs, which mandates identifying high-yield inputs for resource-bounded KBC. Identifying relevant, expressive, and preferable sources for KBC is often referred to as source discovery. Source discovery can be performed via IR-style ranking of documents or can be based on heuristic estimators of the yield of relation extractors (Wang et al., 2019; Razniewski et al., 2019). The former work, in particular, approaches yield optimization as a set coverage maximization problem through shared properties[-6.5pc]Please provide citation for Table 1. of extracted entities. The latter uses textual features in a supervised SVM or LSTM model, a baseline with which we also compare in our experiments.
Document Ranking in IR
Information retrieval (IR) ranks documents by relevance to a query with keywords or telegraphic phrases. Relevance judgments are based on the perception of informativeness concerning the query and its underlying user intent. Standard metrics for assessment, like precision, recall, and nDCG (Järvelin and Kekäläinen, 2002), are not applicable to our setting. The notion of coverage pursued in this paper refers to the yield of structured outputs by RE systems rather than document relevance. For example, a query-topic-wise highly relevant document that contains few extractable facts about named entities would still have low RE coverage.
Relevance of Coverage Estimates
Understanding and incorporating document coverage prediction into NLP-based information extraction is essential for several reasons. For resource-bounded KB construction, it is crucial to know which documents are most promising for extraction with limited budgets for crawling and RE processing and/or human annotation (Ipeirotis et al., 2007; Wang et al., 2019). For claim refutation, coverage estimates can help to assess statements as questionable if documents with high coverage do not support them. So far, claim evaluation systems mostly rely on textual cues about factuality or source credibility (Nakashole and Mitchell, 2014; Rashkin et al., 2017; Thorne et al., 2018; Chen et al., 2019).
For question answering over knowledge bases, it is important to know whether a KB can be relied upon in terms of complete answer sets (Darari et al., 2013; Hopkinson et al., 2018; Arnaout et al., 2021). Current coverage estimation techniques for KBs do this analysis only post-hoc after the KB is fully constructed (Galárraga et al., 2017; Luggen et al., 2019), losing access to valuable information from extraction time.
3 Coverage Prediction
The task thus takes the form of a classical prediction problem, either as a numerical coverage value or binarized class label. In Section 5, we propose several heuristics and methods that can be used to predict coverage for a given document. To study this novel problem, we require evaluation data. The following section thus deals with the generation of a large and diverse document coverage dataset.
4 Dataset Construction
A thorough study of document coverage prediction requires a corpus with two characteristics: (i) relation diversity (i.e., documents containing enough automatically extractable relations) and (ii) content diversity (i.e., multiple documents with varying content per entity). Existing text corpora, like the popular NYT (Sandhaus, 2008) and Newsroom dataset (Grusky et al., 2018), contain ample numbers of articles that mention newsworthy entities; however, the articles are primarily short, mentioning only very few relations. On the other end, machine-translated multilingual versions of Wikipedia articles (Roy et al., 2020) allow extraction of many relations but lack diversity.
For the novel task of predicting document information coverage, we thus built the DoCo (Document Coverage) dataset, consisting of 31,366 web documents for 520 distinct entities, each with its coverage value. Figure 2 illustrates the dataset construction.
Dataset Construction Pipeline. There are two main phases: 1) corpus collection to create GTwikiweb, and 2) coverage calculation. Phase 1 involves: i) for each entity ei, n websites are collected using the Bing search API, ii) text is scraped from each website, iii) RE tuples from documents are extracted via Rosette/Diffbot, and iv) RE tuples are deduplicated and consolidated to form GTwikiweb. The scraped documents are stored as inputs for phase 2 which consists of: i) for each document di, previously extracted relations are collected, and ii) based on the choice of GT, coverage is calculated to create the final DoCo dataset.
Dataset Construction Pipeline. There are two main phases: 1) corpus collection to create GTwikiweb, and 2) coverage calculation. Phase 1 involves: i) for each entity ei, n websites are collected using the Bing search API, ii) text is scraped from each website, iii) RE tuples from documents are extracted via Rosette/Diffbot, and iv) RE tuples are deduplicated and consolidated to form GTwikiweb. The scraped documents are stored as inputs for phase 2 which consists of: i) for each document di, previously extracted relations are collected, and ii) based on the choice of GT, coverage is calculated to create the final DoCo dataset.
Entity Selection
First, well-known entities of two types, person (PER) and organization (ORG), were selected from popular ranking lists by Time 1004 and Forbes5,6 (“Influential people around the globe”, “Most valuable tech companies”). These entities covered 12 diverse sub-domains, including politicians, entrepreneurs, singers, sports figures, writers, and actors, for PER, and technology, automobile, retail, conglomerate, pharmaceuticals, and financial corporations, for ORG. Popular and long-tail entities for PER, companies across demographics and with differing net worth for ORG, were chosen to further obtain documents with varying content.
Websites and Content
We aimed to collect diverse 100 URLs per entity by issuing a set of search engine queries per entity, for example, “about PER”, “PER biography”, “ORG history”. A total of 6 set of queries for PER and 10 for ORG was designed. Since the URLs returned over the set of queries were not always unique, we retained the duplicated URL only once.
Extracting textual content without noisy headers, menus, and comments required a labor-intensive scraping step. We handled the multi-domain content scraping task through a combination of libraries like Newspaper3k,7 Readability,8 and online scraping services like Import.io9 and ParseHub.10 We ensured high-quality scraped content by applying rule-based filters to remove noisy elements like embedded ADs and reference links. The scraped documents covered a range of website domains, including biographical sites, news articles, official company profiles, newsletters, and so on.
Relation Tuples
Each document in DoCo was processed by two relation extraction APIs, Rosette and Diffbot. To annotate each document with coverage, we focused only on the entity queried initially to obtain the document. For our experimental study, we selected the following frequently occurring relations: member-of, family, edu-at, and position-held, for PER, and partner-org, founded-by, ceo, and board-member, for ORG. For more accurate coverage calculation, the RE tuples were deduplicated, for example, (Gates, member-of, Microsoft Corp.) would become (Bill Gates, member-of, Microsoft), via alignment to Wikidata identifiers returned by the APIs.
The relations extracted by the APIs are fine-grained like person-member-of, person-employee-of, org-acquired-by, and org-subsidiary-of. We combined the first two as member-of for PER and the last two as partner-org for ORG as coarse-grained relations.
Ground Truth
We considered three ground-truth labels to calculate coverage for each document:
Wikidata (GTwiki): A popular KB providing data for most relations yet having coverage limitations (Galárraga et al., 2017; Luggen et al., 2019). For example, for Bill Gates, Microsoft and other popularly associated companies for the member-of relation are present, but niche entities like Honeywell are missing. Depending on the entity type and sub-domain, we created the ground-truth labels by choosing those Wikidata properties that best matched the semantics of the 8 selected relations. Table 2 provides the complete information.
Web Extractions (GTwikiweb): We used the set of frequent extractions across all the documents in DoCo as web-aggregated ground truth. For a given entity-relation (e,r), an extraction was determined frequent if it appeared in at least 5% of total documents corresponding to e, or if its count was no less than 5 times the highest counted tuple for (e,r). Deciding frequent extractions relative to total document count and other tuples’ frequencies for an entity resulted in noise-free ground-truth labels.
Wikidata and Web Extractions (GTwikiweb): We merged both previous variants using set union operation and phrase embeddings with cosine similarity for higher recall.
Wikidata property names and identifiers used to create GTwiki.
Relation . | Wikidata Property . |
---|---|
member-of | member of (P463), member of political party (P102), part of (P361), employer (P108), owner of (P1830), record label (P264), member of sports team (P54) |
family | father (P22), mother (P25), spouse (P26), child (P40), stepparent (P3448), sibling (P3373) |
edu-at | educated at (P69) |
position-held | position held (P39), occupation (P106) |
partner-org | owner of (P1830), owned by (P127), member of (P463), parent organization (P749), subsidiary (P355) |
founded-by | founded by (P112) |
ceo | chief executive officer (P169) |
board-member | board member (P3320) |
Relation . | Wikidata Property . |
---|---|
member-of | member of (P463), member of political party (P102), part of (P361), employer (P108), owner of (P1830), record label (P264), member of sports team (P54) |
family | father (P22), mother (P25), spouse (P26), child (P40), stepparent (P3448), sibling (P3373) |
edu-at | educated at (P69) |
position-held | position held (P39), occupation (P106) |
partner-org | owner of (P1830), owned by (P127), member of (P463), parent organization (P749), subsidiary (P355) |
founded-by | founded by (P112) |
ceo | chief executive officer (P169) |
board-member | board member (P3320) |
Coverage Calculation
Coverage was computed on a per entity-relation-document basis using Equation (1). Even though real-valued coverage values are computed while constructing the dataset, it is often not possible to give nuanced predictions at test time. Consider the text “…Musk is a co-founder of Tesla ...”. The term co-founder clearly indicates the presence of multiple founders; however, the context does not provide any clue on the total number of co-founders. For example, there could be one other co-founder (coverage 0.5) or 9 other co-founders (coverage 0.1).
Coverage Binarization
We binarized the coverage values to circumvent this problem, splitting documents into two classes: informative and uninformative. The binarization method comprised an absolute and a relative threshold: A document was labeled as informative or 1 if its coverage was greater than 0.5, or greater than the coverage of at least 85% of documents for the same (e,r); otherwise, it was labeled as uninformative or 0.
Dataset Characteristics
After filtering duplicates, irrelevant URLs like social media handles, and video-content websites, we obtained a total of 31,366 documents for 520 entities. Table 3 provides an overview of the DoCo dataset. We can see that DoCo’s labels are imbalanced, as only 22.6% of the documents are informative and 77.4% are uninformative. The count of documents with non-zero RE tuples is higher than those with non-zero coverage since the RE tuples were not always related to the subject entity, hence irrelevant towards coverage calculation.
Characteristics of the DoCo dataset.
# PER entities | 250 |
# ORG entities | 270 |
# Relations | 8 |
# Documents | 31,366 |
Doc. length range (words) | [20, 10906] |
# Unique website domains | 600 |
# Doc. with non-zero RE tuples | 26956 |
# Doc. with non-zero coverage | 14086 |
# Doc. in class informative | 7103 (22.6 %) |
# PER entities | 250 |
# ORG entities | 270 |
# Relations | 8 |
# Documents | 31,366 |
Doc. length range (words) | [20, 10906] |
# Unique website domains | 600 |
# Doc. with non-zero RE tuples | 26956 |
# Doc. with non-zero coverage | 14086 |
# Doc. in class informative | 7103 (22.6 %) |
Table 4 gives the average number of objects present in each ground truth variant. On average across relations, the number of objects in GTwikiweb is higher than those in GTwiki by 23.7%, and GTwikiweb is higher than those in GTwiki by 28.8%. This implies that GTwikiweb and GTwiki can have overlapping objects, and GTwikiweb might contain extra objects towards GTwikiweb creation.
Average number of objects per entity.
Relation . | GTwiki . | GTwikiweb . | GTwikiweb . |
---|---|---|---|
member-of | 3.61 | 6.51 | 7.12 |
family | 2.21 | 4.0 | 4.41 |
edu-at | 2.26 | 2.07 | 2.58 |
position-held | 5.86 | 7.76 | 10.37 |
partner-org | 6.16 | 4.26 | 3.12 |
founded-by | 1.07 | 1.06 | 1.66 |
ceo | 1.03 | 2.77 | 2.86 |
board-member | 0.47 | 1.44 | 1.75 |
Relation . | GTwiki . | GTwikiweb . | GTwikiweb . |
---|---|---|---|
member-of | 3.61 | 6.51 | 7.12 |
family | 2.21 | 4.0 | 4.41 |
edu-at | 2.26 | 2.07 | 2.58 |
position-held | 5.86 | 7.76 | 10.37 |
partner-org | 6.16 | 4.26 | 3.12 |
founded-by | 1.07 | 1.06 | 1.66 |
ceo | 1.03 | 2.77 | 2.86 |
board-member | 0.47 | 1.44 | 1.75 |
Dataset Quality
We analyzed the quality of the DoCo dataset by comparing automatic relation extractions to extractions given by human annotators. A sample of 400 documents was selected, 50 per relation, with half from the high-coverage range and the rest from the low-coverage range. Each document was annotated with all correct tuples for the document’s main subject entity.
Table 5 shows the observed averaged counts. We note that the human annotators extracted a substantial number of tuples for all 8 relations, indicating the richness and breadth of the DoCo documents. The two automatic extractors mostly yielded smaller numbers of tuples, with a few exceptions. These exceptions include spurious tuples, though. The ground-truth variants consistently suggest higher numbers, but except for the conservative GTwiki, these are usually overestimates due to spurious tuples. The GT variants should thus be seen as upper bounds for the true RE coverage.
Average tuple count per relation. The RE tool with higher tuple count (boldfaced) is chosen for each relation.
Relation . | Human . | Diffbot . | Rosette . | GTwiki . | GTweb . | GTwikiweb . |
---|---|---|---|---|---|---|
member-of | 4.36 | 3.66 | 5.04 | 4.54 | 7.12 | 9.22 |
family | 4.74 | 3.82 | 0.66 | 1.76 | 5.78 | 6.64 |
edu-at | 1.72 | 2.5 | 2.52 | 2.94 | 3.08 | 2.18 |
position-held | 2.9 | 4.26 | – | 6.7 | 6.14 | 9.52 |
partner-org | 3.7 | 0.72 | 2.26 | 0.8 | 5.04 | 5.92 |
founded-by | 1.34 | 0.58 | 1.8 | 0.78 | 2.84 | 2.96 |
ceo | 2.02 | 1.96 | – | 1.68 | 4.32 | 4.2 |
board-member | 2.62 | 1.54 | – | 2.82 | 3.48 | 2.64 |
Relation . | Human . | Diffbot . | Rosette . | GTwiki . | GTweb . | GTwikiweb . |
---|---|---|---|---|---|---|
member-of | 4.36 | 3.66 | 5.04 | 4.54 | 7.12 | 9.22 |
family | 4.74 | 3.82 | 0.66 | 1.76 | 5.78 | 6.64 |
edu-at | 1.72 | 2.5 | 2.52 | 2.94 | 3.08 | 2.18 |
position-held | 2.9 | 4.26 | – | 6.7 | 6.14 | 9.52 |
partner-org | 3.7 | 0.72 | 2.26 | 0.8 | 5.04 | 5.92 |
founded-by | 1.34 | 0.58 | 1.8 | 0.78 | 2.84 | 2.96 |
ceo | 2.02 | 1.96 | – | 1.68 | 4.32 | 4.2 |
board-member | 2.62 | 1.54 | – | 2.82 | 3.48 | 2.64 |
We analyzed how well the automatic annotations reflect human annotations’ coverage by computing Pearson correlation coefficients for the entire set of 400 sample documents. For a relation, the RE tool with higher averaged count was chosen for our experiments, and the correlation for (Human, RE) is 0.68. This shows that optimizing for coverage by automatic RE tools is highly correlated with the overarching goal of approximating human-quality outputs.
5 Approach
We aim to model coverage prediction by processing unstructured document text by inexpensive lightweight techniques. This is crucial for identifying promising documents before embarking on heavy-duty RE.
Heuristics
We devise several simple heuristics involving textual features for document coverage.
- 1.
Document Length: The length of a document is a proxy for the amount of information contained. Longer documents may express more relations.
- 2.
NER Frequency: Length can be misleading when a document is verbose, yet uninformative. The count of named-entity mentions matching the relation domain (e.g., persons for the relation family, or organizations for the relation member-of) could correlate with coverage.
- 3.
Entity Saliency: The more frequently an entity is mentioned in a document, the more likely the document expressed relations for that entity.
- 4.
IR-Relevance Signals: The surface similarity of the entire document with the input query is another cue. We adopt BM25 (Robertson et al., 1995), a classical and still powerful IR model for ranking documents, using ⟨e⟩ + ⟨r⟩ as query, where e and r are the target entity and relation, respectively. Recent advances on neural rankers are considered as well (Nogueira and Cho, 2020). We follow Nogueira et al. (2020) and use the T5 sequence-to-sequence model (Raffel et al., 2020) to rank documents.
- 5.
Website Popularity: Popular websites may be visited often because they are more informative. We use the standard Alexa rank11 as a measure of popularity.
- 6.
Text Complexity: RE methods are effective on simpler text, and may not be able to effectively extract relations from documents written in complex prose. We use the Flesch score (Flesch and Gould, 1949), a popular text readability measure.
- 7.
Random: We contrast the predictive power of our proposed methods with two random baselines: A fair coin, and a biased coin maintaining the label imbalance in our test set.
Methods
We use several inexpensive statistical models for document representation and feed them to a logistic regression classifier.
- 8.
Latent Topic Modeling: Topics in a document could be a useful indicator of coverage. For example, for relation family, latent topics like ancestry or personal life are relevant. We use Latent Dirichlet Allocation (LDA) (Blei et al., 2003) to model documents as distributional vectors.
- 9.
BOW+TFIDF: A simple yet effective statistic to measure word importance given a document in a corpus is the product of term frequency and inverse document frequency (TF-IDF). We vectorize a document into a Bag-of-Words (BOW) representation with TF-IDF weights.
- 10.
Ngrams+TFIDF: A document is vectorized using frequent n-grams (n ≤ 3) with TF-IDF weights.
We employ two neural baselines including LSTM and pre-trained language model (BERT).
- 11.
LSTM: Previous work by Razniewski et al. (2019) used textual features to estimate the presence of a complete set of objects in a text segment. We adopt their architecture, representing documents using 100 dimensional GloVe embeddings (Pennington et al., 2014), and processing them in LSTM (Hochreiter and Schmidhuber, 1997), followed by a feed-forward layer with ReLU activation before the classifier.
- 12.
Language Model (BERT): Without costly re-training or fine-tuning, we utilize pre-trained BERT embeddings (Devlin et al., 2019) in a feature-based approach by extracting activations from the last four hidden layers. As in the original work, these contextual embeddings are fed to a two-layer 768-dimensional BiLSTM before the classifier.
Our experiments (Sec. 6.2) reveal that each of our proposed heuristics has only a moderate predictive power. We therefore formulate a lightweight classifier to combine heuristics with the best performing statistical model (TF-IDF), or language model (BERT).
- 13.
Heuristics with BOW+TFIDF (Heu+TFIDF): We combine TF-IDF with heuristics (one to six) using stacked Logistic Regression (LR) (Figure 3). In level 1, the TF-IDF vector and each individual heuristic are fed to separate LR classifiers. In level 2, all the outputs of level 1 LRs are concatenated and fed to a final LR classifier for coverage prediction. The entire model is jointly trained.
- 14.
Heuristics with BERT (HERB): We combine BERT with heuristics (one to six) in a two-step process (Figure 4). In the first step, we reuse the BERT model above (with no additional training or fine-tuning) for coverage prediction. This prediction is then concatenated with heuristics to form a single vector, which is fed to a LR classifier.
6 Experiments
6.1 Setup
Dataset
We considered two automatic RE tools, Rosette and Diffbot, extr (Rosette, Diffbot), and three ground truth variants: GTwiki,GTwikiweb,GTwikiweb. For each relation, we report on the combination of RE tool and GT variant that achieves the highest count of documents classified as high-coverage.
Each relation had a separate labeled set of documents, split into 70% train, 10% validation and 20% test. Information leakage was prevented by splitting along entities, i.e., all documents on the same entity would exclusively be in one of train, validation or test set. The number of training samples per relation varies from 664 (board-member) to 3604 (position-held). Since the label distribution in DoCo is imbalanced, the uninformative (or 0) class in all train datasets were undersampled to obtain a 50:50 distribution, while the validation and test datasets were kept unchanged to reflect the real-world imbalance. Named entities and numbers were masked.
Models
Each proposed heuristic was turned into a classifier by first ranking documents according to the heuristic, and then labeling the top 50% documents as class 1 or informative. We used the Okapi BM2512 and monoT513 open-source implementations for IR ranking. The monoT5 model is generally used for passage ranking, and as DoCo documents are much longer with multiple passages, we used the MaxP algorithm (Dai and Callan, 2019) to compute the document ranking. Since the difference in performance between T5 and BM25 models is negligible, we chose the simpler yet equally effective BM25 model as IR-relevance signal for HERB.
Feature based methods including topic modeling with LDA, TF-IDF, and n-grams, were fed to a Logistic Regression classifier. In the LSTM architecture, we used 100 dimensional GloVe embeddings with a vocabulary size of 100,000, and a 100 dimensional hidden state for LSTM.
For pre-trained language models, we used the BERT-base-uncased14 model (without additional retraining or fine-tuning) to encode sentences, by summing the [CLS] token’s representation from the last four hidden layers. Input documents were padded or truncated to 650 sentences, and represented through sentence encodings. Coverage classification was performed using the feature-based approach outlined in Devlin et al. (2019).
We constructed mini-batches of size 32, used the Adam optimizer initialized with a constant learning rate of 1e-05 and 1e-09 epsilon value, and trained for 200 epochs. Because our dataset is imbalanced, we monitored validation precision to save the best model, and report optimal F1-scores (Lipton et al., 2014) to compare results.
6.2 Results
Our results are shown in Table 6. Each heuristic gives a mediocre performance, with T5 IR achieving the highest average F1 of 23.6 among the heuristics. In the trained group of models, LDA has the lowest average F1 of 16.9, while BERT performs the best with an average F1 of 36.2.
F1-scores (%) obtained on the coverage prediction task by various heuristics and methods.
Method . | PER . | ORG . | Avg. . | ||||||
---|---|---|---|---|---|---|---|---|---|
member-of . | family . | edu-at . | position-held . | partner-org . | founded-by . | ceo . | board-member . | ||
Random (biased) | 5.7 | 6.8 | 4.9 | 10.0 | 7.5 | 1.2 | 13.5 | 3.7 | 6.6 |
Random (fair) | 15.7 | 11.1 | 12.6 | 15.4 | 15.2 | 8.9 | 21.3 | 7.2 | 13.4 |
Text Complexity | 9.6 | 5.4 | 6.1 | 10.3 | 3.5 | 3.3 | 15 | 5.4 | 7.3 |
Alexa Ranking | 12.6 | 9.8 | 8.1 | 12.4 | 16.7 | 11.3 | 24.8 | 7.3 | 12.9 |
Entity Saliency | 17.8 | 14.3 | 11.9 | 18.2 | 14.7 | 8.4 | 24.6 | 7.1 | 14.6 |
Document Length | 20.5 | 19.0 | 15.5 | 21.9 | 23.9 | 12.8 | 28.8 | 8.5 | 18.9 |
NER Count | 24.3 | 19.8 | 18.2 | – | 21.1 | 13.7 | 34.5 | 11.8 | 20.5 |
BM25 IR | 27.1 | 21.1 | 18.8 | 26.3 | 21.8 | 12.9 | 36.6 | 12.1 | 22.1 |
T5 IR | 26.9 | 23.2 | 20.3 | 29.6 | 19.5 | 15.4 | 41.1 | 13.1 | 23.6 |
LDA Topic Model | 19.3 | 19.0 | 14.5 | 21.1 | 15.7 | 8.6 | 25.2 | 11.5 | 16.9 |
GloVe+LSTM | 16.5 | 28.6 | 19.8 | 32.9 | 24.2 | 19.5 | 24.4 | 4.9 | 21.3 |
Ngrams+TFIDF | 36.2 | 40.0 | 25.6 | 40.2 | 18.6 | 25.5 | 41.8 | 30.2 | 32.3 |
BOW+TFIDF | 36.0 | 41.0 | 29.2 | 42.1 | 17.2 | 28.3 | 40.6 | 32.1 | 33.3 |
BERT | 40.4 | 39.7 | 35.7 | 44.4 | 22.0 | 30.8 | 43.0 | 33.8 | 36.2 |
Heu+TFIDF | 41.9 | 43.5 | 31.3 | 36.5 | 35.1 | 28.2 | 41.4 | 22.0 | 35.0 |
HERB | 44.2 | 41.7 | 40.5 | 45.6 | 28.8 | 32.5 | 46.2 | 34.8 | 39.3 |
Method . | PER . | ORG . | Avg. . | ||||||
---|---|---|---|---|---|---|---|---|---|
member-of . | family . | edu-at . | position-held . | partner-org . | founded-by . | ceo . | board-member . | ||
Random (biased) | 5.7 | 6.8 | 4.9 | 10.0 | 7.5 | 1.2 | 13.5 | 3.7 | 6.6 |
Random (fair) | 15.7 | 11.1 | 12.6 | 15.4 | 15.2 | 8.9 | 21.3 | 7.2 | 13.4 |
Text Complexity | 9.6 | 5.4 | 6.1 | 10.3 | 3.5 | 3.3 | 15 | 5.4 | 7.3 |
Alexa Ranking | 12.6 | 9.8 | 8.1 | 12.4 | 16.7 | 11.3 | 24.8 | 7.3 | 12.9 |
Entity Saliency | 17.8 | 14.3 | 11.9 | 18.2 | 14.7 | 8.4 | 24.6 | 7.1 | 14.6 |
Document Length | 20.5 | 19.0 | 15.5 | 21.9 | 23.9 | 12.8 | 28.8 | 8.5 | 18.9 |
NER Count | 24.3 | 19.8 | 18.2 | – | 21.1 | 13.7 | 34.5 | 11.8 | 20.5 |
BM25 IR | 27.1 | 21.1 | 18.8 | 26.3 | 21.8 | 12.9 | 36.6 | 12.1 | 22.1 |
T5 IR | 26.9 | 23.2 | 20.3 | 29.6 | 19.5 | 15.4 | 41.1 | 13.1 | 23.6 |
LDA Topic Model | 19.3 | 19.0 | 14.5 | 21.1 | 15.7 | 8.6 | 25.2 | 11.5 | 16.9 |
GloVe+LSTM | 16.5 | 28.6 | 19.8 | 32.9 | 24.2 | 19.5 | 24.4 | 4.9 | 21.3 |
Ngrams+TFIDF | 36.2 | 40.0 | 25.6 | 40.2 | 18.6 | 25.5 | 41.8 | 30.2 | 32.3 |
BOW+TFIDF | 36.0 | 41.0 | 29.2 | 42.1 | 17.2 | 28.3 | 40.6 | 32.1 | 33.3 |
BERT | 40.4 | 39.7 | 35.7 | 44.4 | 22.0 | 30.8 | 43.0 | 33.8 | 36.2 |
Heu+TFIDF | 41.9 | 43.5 | 31.3 | 36.5 | 35.1 | 28.2 | 41.4 | 22.0 | 35.0 |
HERB | 44.2 | 41.7 | 40.5 | 45.6 | 28.8 | 32.5 | 46.2 | 34.8 | 39.3 |
Although each heuristic has moderate predictive power, combining them with statistical models like TF-IDF, or pre-trained language models like BERT, gives the best performance. Among the combination models, HERB outperforms Heu+TFIDF in a clear majority of relations.
Model Analysis
Statistical models like BOW+TFIDF and Ngrams+TFIDF performed comparably to BERT for a minority of relations. To better understand these models, we analyzed highly positive and negative features. Table 7 provides noteworthy examples. We observe the presence of semantically relevant phrases. We also inspect the weights of the trained LR classifier of HERB. Across relations, BERT had the highest average weight (5.05), followed by BM25 (2.56), while NER Count had the lowest weight (0.07).
Highly weighted phrases given by the trained LR classifier of Ngrams+TFIDF and BOW+TFIDF.
Relation . | Important Phrases . |
---|---|
member-of | [org], is part of, ambassador, is associated with, [org] partner |
family | [person], married, father, wife, children, daughter, parents, [number] |
edu-at | [org], graduated, degree, studied, [org] in [number], is part of |
position-held | [person], leader, president, actor, professor, writer, founder, police, portman |
partner-org | [org], [number] [org], subsidiary, merger, the company, member of |
founded-by | [person], founder, director, executive, chairman, co founder, head of, chief executive |
ceo | ceo, [person] director, chief, officer, founders, chief executive officer, president |
board-member | [org], [person], chairman, executive, board of directors, [number] senior executive, officer in charge, representative director |
Relation . | Important Phrases . |
---|---|
member-of | [org], is part of, ambassador, is associated with, [org] partner |
family | [person], married, father, wife, children, daughter, parents, [number] |
edu-at | [org], graduated, degree, studied, [org] in [number], is part of |
position-held | [person], leader, president, actor, professor, writer, founder, police, portman |
partner-org | [org], [number] [org], subsidiary, merger, the company, member of |
founded-by | [person], founder, director, executive, chairman, co founder, head of, chief executive |
ceo | ceo, [person] director, chief, officer, founders, chief executive officer, president |
board-member | [org], [person], chairman, executive, board of directors, [number] senior executive, officer in charge, representative director |
Feature Ablations
We further perform an ablation analysis, with Table 8 showing the average F1-scores when individual heuristics are removed from HERB. Removing either BM25 or Text Complexity leads to a significant drop in performance, indicating that other heuristics or BERT do not capture these features well.
Average F1 performance with feature ablations. Text Complexity and BM25 are most important.
HERB | 39.3% |
- Doc. Length | 36.8% (–2.44) |
- Entity Saliency | 36.4% (–2.85) |
- Alexa Ranking | 36.3% (–3.03) |
- NER Count | 36.2% (–3.11) |
- BM25 | 36.0% (–3.29) |
- Text Complexity | 35.7% (–3.62) |
HERB | 39.3% |
- Doc. Length | 36.8% (–2.44) |
- Entity Saliency | 36.4% (–2.85) |
- Alexa Ranking | 36.3% (–3.03) |
- NER Count | 36.2% (–3.11) |
- BM25 | 36.0% (–3.29) |
- Text Complexity | 35.7% (–3.62) |
Human Performance
Finally, we compare the results against human performance on identifying high-coverage documents. For each relation, 10 randomly sampled test documents were labeled as informative or uninformative for RE solely by reading the document. Averaged over all relations, humans obtained an F1 score of 70.42%, compared with HERB predictions reaching an average F1 of 39.3%, and all baselines were significantly inferior. The large gap between humans and learned predictors shows the hardness of the coverage prediction task and underlines the need for the presented research.
7 Analysis and Discussion
Domain Dependency
To investigate how strongly prediction depends on in-domain training data, we performed a stress test, where the train, validation, and test sets were split along domains (e.g., singers vs. entrepreneurs vs. politicians). Table 9 shows the resulting F1-scores (%). For HERB, the average F1-score on the in-domain test set is 34.3%, while on the out-of-domain test set is 34.2%—that is, there is no notable drop for the challenging domain-transfer case. We observe a minor drop for larger relations, while even increases are visible for the smallest two relations. This suggests that HERB learned generalizable features that are beneficial across domains.
Comparison of F1-scores (%) of HERB on the in-domain and out-of-domain test set.
Setting . | member-of . | family . | edu-at . | position-held . | partner-org . | founded-by . | ceo . | board-member . | Avg. . |
---|---|---|---|---|---|---|---|---|---|
HERB (in-domain) | 40.8 | 41.8 | 34.9 | 42.8 | 28.4 | 17.1 | 45.4 | 23.3 | 34.3 |
HERB (out-of-domain) | 35.7 | 39.7 | 32.5 | 39.1 | 29.4 | 23.8 | 42.3 | 31.1 | 34.2 |
Training Data Size | 2194 | 1650 | 1458 | 2940 | 1124 | 828 | 2058 | 608 |
Setting . | member-of . | family . | edu-at . | position-held . | partner-org . | founded-by . | ceo . | board-member . | Avg. . |
---|---|---|---|---|---|---|---|---|---|
HERB (in-domain) | 40.8 | 41.8 | 34.9 | 42.8 | 28.4 | 17.1 | 45.4 | 23.3 | 34.3 |
HERB (out-of-domain) | 35.7 | 39.7 | 32.5 | 39.1 | 29.4 | 23.8 | 42.3 | 31.1 | 34.2 |
Training Data Size | 2194 | 1650 | 1458 | 2940 | 1124 | 828 | 2058 | 608 |
Evaluation of Document Ranking
So far, we have evaluated our methods on a binary prediction problem. However, use cases frequently require a ranking capability (see also Sec. 8). We additionally evaluate our methods on a ranking task, where documents are ranked by the score of positive predictions.
We use the mean Normalized Discounted Cumulative Gain (mean nDCG) (Järvelin and Kekäläinen, 2002) as the evaluation metric. A similar performance trend to the F1 metric is observed among our methods. HERB performs the best with an average nDCG score of 0.45 across relations, while BERT and Heu+TFIDF have 0.44 and 0.43, respectively.
RE Limitations
The performance of RE methods significantly impacts the quality of GTWeb as well as the RE coverage of documents. Although we used state-of-the-art commercial APIs, these nonetheless struggle on open web documents. To illustrate this, we randomly sampled 40 documents from DoCo and compared the count of RE tuples returned by Diffbot/Rosette against the count by a human relation extractor. Diffbot returned 60.6% fewer relational tuples, and Rosette returned 72.3% fewer, suggesting the need for further improvement of RE methods.
Error Analysis
We analyzed the incorrect predictions by HERB and categorized the errors. For each relation, we randomly sampled 10 incorrectly predicted documents, 5 false positives and 5 false negatives. Out of the total 80 samples, 63.75% of documents contained partial information for the chosen relation; on 15% of documents the IE methods failed to extract all the necessary RE tuples; the ground truth for 3.75% of documents had an incomplete set of objects; 3.75% documents had noisy content; and 2.5% documents had incomplete information due to failure of scraping methods on complex website layouts.
Multiple documents in the low-information category contained speculative content—for example, considerations about candidates for a new appointment as a board member or CEO. In other cases, the document would mention the increased count of board members, but not their names. A few documents also had partial information leading to false positives—for example, a document partially talking about the footballer Sergio Agüero for the family relation was incorrectly classified as informative; as it also contained a complete family history about another footballer, Diego Maradona (Sergio’s father-in-law).
Conversely, documents may contain information relevant to a relation without actual mention of the relation, which leads to false negatives. For example, a document on the LinkedIn Corporation stating “ …Weiner stepped down from LinkedIn …He named Ryan Roslansky as his replacement.” was labeled uninformative for the ceo relation. Although Ryan Roslansky and LinkedIn are related through the ceo relation, the implicit statement was not noticed by HERB.
We specifically inspected the IR baselines’ performance to understand better why these are mediocre predictors at best. The IR signals about entire documents merely reflect that a document is on the proper topic given by the query entity, but that does not necessarily imply that the document contains many relational facts about the target entity. For RE coverage, IR-style document-query relevance is a necessary cue but not a sufficient criterion.
Efficiency and Scalability
We measured the run-time of HERB against a state-of-the-art neural model for document-level RE (DocRED) (Yao et al., 2019). Based on the DocRED leaderboard,15 we selected the currently best open-source method: the Transformer-based Structured Self-Attention Network (SSAN) (Xu et al., 2021).
A sample of 100 documents from DoCo was given to both HERB and SSAN and processed as follows. For HERB, features are computed utilizing BERT, followed by coverage prediction. For SSAN, documents first need to be pre-processed to construct the necessary DocRED representation. This includes named entity recognition and pair-wise co-reference resolution, using Stanza16 to properly group same-entity occurrences.
The measurements show the following. HERB takes about 2 seconds, on average, to process one document, whereas SSAN requires 13.6 seconds—a factor of 6.8 higher in speed and resource consumption. The difference becomes even more prominent for very long documents with many named entity mentions. HERB’s run-time grows linearly with document length, while SSAN’s run-time exhibits quadratic growth with the number of entity mentions.
This quadratic complexity of full-fledged neural RE has inherent reasons (as stated in Yao et al., 2019). Document-level relation extraction generally requires computations for all possible pairs of entity mentions. The neural RE methods need to have the positions of candidate entity pairs as input, which necessitates considering all pairs of mentions.
8 Applications
To demonstrate the importance of coverage prediction, we evaluated its utility in two use cases, knowledge base construction and claim refutation. For the former, we discuss the importance of ranking documents by RE coverage (Section 8.1) and a practically relevant setting where RE is constrained by resource budgets (Section 8.2).
8.1 Document Ranking for Relation Extraction
Relation extraction plays a pivotal role in KB construction. We show the relevance of coverage estimates for prioritizing among documents. Entities from our test dataset serve as subjects for RE. We select top k documents from the test dataset corpus by four different techniques. We compare the performance of each method by the total number of extracted RE tuples per subject and compute recall w.r.t. the Wikidata ground-truth.
Random: A random sample of documents.
IR-Relevance: Using BM25 to identify the most relevant documents.
Coverage Prediction: HERB’s predictions to rank documents.
Coverage Oracle: Selecting documents by their ground-truth labels from DoCo. This ranking gives an upper bound on what an ideal method could achieve.
Setup
The document coverage calculation is on a per (e, r) pair basis. In a single iteration, all the proposed methods are given a set of documents partitioned by (e, r) pairs. Each method uses its technique to rank the documents, and the top k ranked documents are given to the RE API (Rosette or Diffbot) for obtaining the set of relational tuples.
Results
Figure 5 (top) compares the total RE tuples obtained by the proposed methods, averaged across test dataset entities and 8 chosen relations. Notably, BM25 doesn’t perform much better than random, while coverage prediction is not far behind the perfect ranking defined by the coverage oracle. Ordering documents by coverage prediction instead of IR-relevance gives 50% more extractions from the top-10 documents.
Total yield (top) and precision (bottom) of KBC based on different ranking methods for documents.
Total yield (top) and precision (bottom) of KBC based on different ranking methods for documents.
Figure 5 (bottom) shows the number of RE tuples that match the Wikidata KB, thus comparing the methods on precision. As was foreseeable, the coverage oracle method wins due to the usage of correct coverage values for ranking. HERB’s coverage prediction performance is considerably higher than IR-relevance and other methods, while it matches the coverage oracle for K ≥ 4. Beyond K > 15, all methods yield nearly the same sets of tuples, hence similar precision.
8.2 Budget-constrained Relation Extraction
Document coverage predictions are particularly important for massive-scale RE tasks targeted at long-tail entities, such as populating or augmenting a domain-specific knowledge base (e.g., about diabetes or jazz music). Such tasks may require screening a huge number of documents. Therefore, practically viable RE methods need to operate under budget constraints, regarding the monetary cost of computational resources (e.g., using and paying for cloud servers) as well as the cost of energy consumption and environmental impact.
In the experiment described here, we simulate this setting, comparing standard RE by SSAN against HERB-enhanced RE where HERB prioritizes documents for RE by SSAN. We assume a budget of 10 minutes of processing time and give both methods 100 candidate documents. SSAN selects documents randomly and processes them until it runs out of time. HERB+SSAN sorts documents by HERB scores for high coverage and then lets SSAN process them in this order. The time for HERB itself is part of the 10-minute budget for the HERB+SSAN method.
As a proof-of-concept, we ran this experiment for a sample of 10 different entities (each with a pool of 100 documents).
Table 10 shows the results. Due to the upfront cost of HERB, HERB+SSAN processes fewer documents within the 10-minute budget, but its yield is substantially higher than that of SSAN alone, by a factor of 1.63. This demonstrates the need for document-coverage prediction towards realistic usage.
8.3 Claim Refutation
Our second use case is fact-checking, specifically the case of refuting false claims by providing counter-evidence via RE.
Reasoning
Extraction confidence and document coverage are conceptually independent notions. However, when looking at sets of documents, an interesting relation emerges. Consider two documents, d1 with high coverage, and d2 with low coverage, along with two claims c1 and c2 from the respective documents, extracted with the same confidence. Can we use coverage information to make claims about extraction correctness?
We propose the following hypothesis: Given that d1 is asserted to have high coverage, we can conclude that any statement not mentioned in d1 (like c2) is more likely false. In contrast, the low coverage of d2 implies that d2 is unlikely to contain all factual statements. Thus, c1 not being found in d2 is no indication that it could not be true.
Validation
We experimentally validated the correctness of the above reasoning as follows. From the collection of relation extractions from the test dataset documents, we randomly sampled 69 pairs of claims for the same entity and relation, which had low support (i.e., extraction found only in one website). We then ordered the pairs by the coverage of the documents that did not express them, obtaining 69 claims with relatively higher coverage in non-expressing documents and 69 claims with relatively lower coverage.
We manually verified the correctness of each claim on the Internet, verifying annotator agreement on a sub-sample, where we found a high Fleiss’ Kappa (Fleiss, 1971) inter-annotator agreement of 0.82.
Using these annotations, we found that from the 69 claims absent from lower-coverage documents, 58% (40) were correct, while from those absent from higher-coverage documents, only 36% (25) were correct. In other words, the fraction of correct claims absent from low-coverage documents is 1.6 times higher; so coverage can be used as a feature for claim refutation.
Table 11 shows examples of claims absent from high-coverage documents.
Incorrect claims extracted by Diffbot RE API from documents predicted as low coverage.
Subject . | Relation . | Object . | Document Snippet . |
---|---|---|---|
Alphabet Inc. | ceo | Susan Wojcicki | Susan Wojcicki is CEO of Alphabet subsidiary YouTube, which has 2 billion monthly users. |
Oracle Corporation | founded-by | David Agus | Oracle Co-founder Larry Ellison and acclaimed physician and scientist Dr. David Agus formed Sensei Holdings, Inc. |
PepsiCo | board-member | Joan Crawford | Film actress Joan Crawford, after marrying Pepsi-Cola president Alfred N. Steele became a spokesperson for Pepsi. |
Subject . | Relation . | Object . | Document Snippet . |
---|---|---|---|
Alphabet Inc. | ceo | Susan Wojcicki | Susan Wojcicki is CEO of Alphabet subsidiary YouTube, which has 2 billion monthly users. |
Oracle Corporation | founded-by | David Agus | Oracle Co-founder Larry Ellison and acclaimed physician and scientist Dr. David Agus formed Sensei Holdings, Inc. |
PepsiCo | board-member | Joan Crawford | Film actress Joan Crawford, after marrying Pepsi-Cola president Alfred N. Steele became a spokesperson for Pepsi. |
9 Conclusion
This paper introduces the new task of document coverage prediction and a large dataset for experimental study of the task. Our methods show that heuristic features can boost the performance of pre-trained language models without costly fine-tuning. Moreover, we demonstrate the value of coverage estimates for the use cases of knowledge base construction and claim refutation. Our future research includes developing a user-friendly tool to support knowledge engineers.
Acknowledgments
We thank Andrew Yates for his suggestions. Further thanks to the anonymous reviewers, action editor, and fellow researchers at MPI, for their comments towards improving our paper. This work is supported by the German Science Foundation (DFG: Deutsche Forschungsgemeinschaft) by grant 4530095897: “Negative Knowledge at Web Scale”.
Notes
References
Author notes
Action Editor: Hoifung Poon