Learning Typed Entailment Graphs with Global Soft Constraints

This paper presents a new method for learning typed entailment graphs from text. We extract predicate-argument structures from multiple-source news corpora, and compute local distributional similarity scores to learn entailments between predicates with typed arguments (e.g., person contracted disease). Previous work has used transitivity constraints to improve local decisions, but these constraints are intractable on large graphs. We instead propose a scalable method that learns globally consistent similarity scores based on new soft constraints that consider both the structures across typed entailment graphs and inside each graph. Learning takes only a few hours to run over 100K predicates and our results show large improvements over local similarity scores on two entailment data sets. We further show improvements over paraphrases and entailments from the Paraphrase Database, and prior state-of-the-art entailment graphs. We show that the entailment graphs improve performance in a downstream task.


Introduction
Recognizing textual entailment and paraphrasing is critical to many core natural language processing applications such as question answering and semantic parsing. The surface form of a sentence that answers a question such as "Does Verizon own Yahoo?" frequently does not directly correspond to the form of the question, but is rather a paraphrase or an expression such as "Verizon bought Yahoo," that entails the answer. The lack of a well-established form-independent semantic representation for natural language is the most important single obstacle to bridging the gap between queries and text resources.
This paper seeks to learn meaning postulates (e.g., buying entails owning) that can be used to augment the standard form-dependent semantics. Our immediate goal is to learn entailment rules between typed predicates with two arguments, where the type of each predicate is determined by the types of its arguments. We construct typed entailment graphs, with typed predicates as nodes and entailment rules as edges. Figure 1 shows simple examples of such graphs with arguments of types company,company and person,location.
Entailment relations are detected computing a similarity score between the typed predicates based on the distributional inclusion hypothesis, which states that a word (predicate) u entails another word (predicate) v if in any context that u can be used, v can be used in its place (Dagan et al., 1999;Geffet and Dagan, 2005;Herbelot and Ganesalingam, 2013;Kartsaklis and Sadrzadeh, 2016). Most previous work has taken a "local learning" approach (Lin, 1998;Weeds and Weir, 2003;Szpektor and Dagan, 2008;Schoenmackers et al., 2010), namely, learning entailment rules independently from each other.
One problem facing local learning approaches is that many correct edges are not identified because of data sparsity and many wrong edges are spuriously identified as valid entailments. A "global learning" approach, where dependencies between entailment rules are taken into account, can improve the local decisions significantly. Berant et al. (2011) imposed transitivity constraints on the entailments, such that the inclusion of rules i→j and j→k implies that of i→k. Although they showed transitivity constraints to be effective in learning entailment graphs, the Integer Linear Programming (ILP) solution of Berant et al. is not scalable beyond a few hundred nodes. In fact, the problem of finding a maximally weighted transitive subgraph of a graph with arbitrary edge weights is NP-hard (Berant et al., 2011).
This paper instead proposes a scalable solution that does not rely on transitivity closure, but instead uses two global soft constraints that maintain structural similarity both across and within each typed entailment graph ( Figure 2). We introduce an unsupervised framework to learn globally consistent similarity scores given local similarity scores ( §4). Our method is highly parallelizable and takes only a few hours to apply to more than 100K predicates. 1,2 Our experiments ( §6) show that the global scores improve significantly over local scores and outperform state-of-the-art entailment graphs on two standard entailment rule data sets (Berant et al., 2011;Holt, 2018). We ultimately intend the typed entailment graphs to provide a resource for entailment and paraphrase rules for use in semantic parsing and open domain question answering, as has been done for similar resources such as the Paraphrase Database (PPDB; Ganitkevitch et al., 2013;Pavlick et al., 2015) in Wang et al. (2015) and Dong et al. (2017). 3 With that end in view, we have included a comparison with PPDB in our evaluation on the entailment data sets. We also show that the learned entailment rules improve performance on a question-answering task ( §7) with no tuning or prior knowledge of the task.

Related Work
Our work is closely related to Berant et al. (2011), where entailment graphs are learned by imposing transitivity constraints on the entailment relations. However, the exact solution to the problem is not scalable beyond a few hundred predicates, whereas the number of predicates that we capture is two orders of magnitude larger ( §5). Hence, it is necessary to resort to approximate methods based across different but related typed entailment graphs and (B) within each graph. 0 ≤ β ≤ 1 determines how much different graphs are related. The dotted edges are missing, but will be recovered by considering relationships shown by across-graph (red) and within-graph (light blue) connections. on assumptions concerning the graph structure.  propose Tree-Node-Fix (TNF), an approximation method that scales better by additionally assuming the entailment graphs are "forest reducible," where a predicate cannot entail two (or more) predicates j and k such that neither j→k nor k→j (FRG assumption). However, the FRG assumption is not correct for many real-world domains. For example, a person visiting a place entails both arriving at that place and leaving that place, although the latter do not necessarily entail each other. Our work injects two other types of prior knowledge about the structure of the graph that are less expensive to incorporate and yield better results on entailment rule data sets. Abend et al. (2014) learn entailment relations over multi-word predicates with different levels of compositionality. Pavlick et al. (2015) add variety of relations, including entailment, to phrase pairs in PPDB. This includes a broader range of entailment relations such as lexical entailment. In contrast to our method, these works rely on supervised data and take a local learning approach.
Another related strand of research is link prediction (Bordes et al., 2013;Riedel et al., 2013;Socher et al., 2013;Yang et al., 2015;Trouillon et al., 2016;Dettmers et al., 2018), where the source data are extractions from text, facts in knowledge bases, or both. Unlike our work, which directly learns entailment relations between predicates, these methods aim at predicting the source data-that is, whether two entities have a particular relationship. The common 704 wisdom is that entailment relations are a byproduct of these methods (Riedel et al., 2013). However, this assumption has not usually been explicitly evaluated. Explicit entailment rules provide explainable resources that can be used in downstream tasks. Our experiments show that our method significantly outperforms a state-of-the-art link prediction method.

Computing Local Similarity Scores
We first extract binary relations as predicateargument pairs using a combinatory categorial grammar (CCG; Steedman, 2000) semantic parser ( §3.1). We map the arguments to their Wikipedia URLs using a named entity linker ( §3.2). We extract types such as person and disease for each argument ( §3.2). We then compute local similarity scores between predicate pairs ( §3.3).

Relation Extraction
The semantic parser of Reddy et al. (2014), GraphParser, is run on the NewsSpike corpus (Zhang and Weld, 2013) to extract binary relations between a predicate and its arguments from sentences. GraphParser uses CCG syntactic derivations and λ-calculus to convert sentences to neo-Davisonian semantics, a first-order logic that uses event identifiers (Parsons, 1990). For example, for the sentence, Obama visited Hawaii in 2012, GraphParser produces the logical form ∃e.visit 1 (e, Obama) ∧ visit 2 (e, Hawaii)∧ visit in (e, 2012), where e denotes an event. We will consider a relation for each pair of arguments, hence, there will be three relations for the given sentence: visit 1,2 with arguments (Obama, Hawaii), visit 1,in with arguments (Obama,2012), and visit 2,in with arguments (Hawaii,2012). We currently only use extracted relations that involve two named entities or one named entity and a noun. We constrain the relations to have at least one named entity to reduce ambiguity in finding entailments.
We perform a few automatic post-processing steps on the output of the parser. First, we normalize the predicates by lemmatization of their head words. Passive predicates are mapped to active ones and we extract negations and particle verb predicates. Next, we discard unary relations and relations involving coordination of arguments. Finally, whenever we see a relation between a subject and an object, and a relation between object and a third argument connected by a preposi-tional phrase, we add a new relation between the subject and the third argument by concatenating the relation name with the object. For example, for the sentence China has a border with India, we extract a relation have border 1,with between China and India. We perform a similar process for prepositional phrases attached to verb phrases. Most of the light verbs and multiword predicates will be extracted by the above post-processing (e.g., take care 1,of ), which will recover many salient ternary relations.
Although entailments and paraphrasing can benefit from n-ary relations-for example, person visits a location in a time-we currently follow previous work (Lewis and Steedman, 2013a);  in confining our attention to binary relations, leaving the construction of n-ary graphs to future work.

Linking and Typing Arguments
Entailment and paraphrasing depend on context. Although using exact context is impractical in forming entailment graphs, many authors have used the type of the arguments to disambiguate polysemous predicates (Berant et al., 2011Lewis and Steedman, 2013a;Lewis, 2014). Typing also reduces the size of the entailment graphs.
Because named entities can be referred to in many different ways, we use a named entity linking tool to normalize the named entities. In the following experiments, we use AIDALight (Nguyen et al., 2014), a fast and accurate named entity linker, to link named entities to their Wikipedia URLs (if any). We thus type all entities that can be grounded in Wikipedia. We first map the Wikipedia URL of the entities to Freebase (Bollacker et al., 2008). We select the most notable type of the entity from Freebase and map it to FIGER types (Ling and Weld, 2012) such as building, disease, person, and location, using only the first level of the FIGER type hierarchy. 4 For example, instead of event/sports_event, we use event as type. If an entity cannot be grounded in Wikipedia or its Freebase type does not have a mapping to FIGER, we assign the default type thing to it.

Local Distributional Similarities
For each typed predicate (e.g., visit 1,2 with types person,location), we extract a feature vector. We use as feature types the set of argument pair strings (e.g., Obama-Hawaii) that instantiate the binary relations of the predicates. The value of each feature is the pointwise mutual information between the predicate and the feature. We use the feature vectors to compute three local similarity scores (both symmetric and directional) between typed predicates: Weeds (Weeds and Weir, 2003), Lin (Lin, 1998), and Balanced Inclusion (BInc; Szpektor and Dagan, 2008) similarities.

Learning Globally Consistent Entailment Graphs
We learn globally consistent similarity scores based on local similarity scores. The global scores will be used to form typed entailment graphs.

Problem Formulation
Let T be a set of types and P be a set of predicates. We denote byV (t 1 , t 2 ) the set of typed predicates p(:t 1 , :t 2 ), where t 1 , t 2 ∈ T and p ∈ P . Each p(:t 1 , :t 2 ) ∈V (t 1 , t 2 ) takes as input arguments of types t 1 and t 2 . An example of a typed predicate is win 1,2 (:team,:event) that can be instantiated with win 1,2 (Seahawks:team,Super Bowl:event).
We define V = t 1 ,t 2 V (t 1 , t 2 ), the set of all typed predicates, and W 0 as a blockdiagonal matrix consisting of all the local sim-5 For each similarity measure, we define one separate matrix and run the learning algorithm separately, but for simplicity of notation, we do not show the similarity measure names. ilarity matrices W 0 (t 1 , t 2 ). Similarly, we define W(t 1 , t 2 ) and W as the matrices consisting of globally consistent similarity scores w ij we wish to learn. The global similarity scores are used to form entailment graphs by thresholding W. For a δ > 0, we define typed entailment graphs as , w ij ≥ δ} are the edges of the entailment graphs.

Learning Algorithm
Existing approaches to learn entailment graphs from text miss many correct edges because of data sparsity-namely, the lack of explicit evidence in the corpus that a predicate i entails another predicate j. The goal of our method is to use evidence from the existing edges that have been assigned high confidence to predict missing ones and remove spurious edges. We propose two global soft constraints that maintain structural similarity both across and within each typed entailment graph. The constraints are based on the following two observations.
First, it is standard to learn a separate typed entailment graph for each (plausible) type-pair because arguments provide necessary disambiguation for predicate meaning (Berant et al., 2011Lewis and Steedman, 2013a,b). However, many entailment relations for which we have direct evidence only in a few subgraphs may in fact apply over many others ( Figure 2A). For example, we may not have found direct evidence that mentions of a living_thing (e.g., a virus) triggering a disease are accompanied by mentions of the living_thing causing that disease (because of data sparsity), whereas we have found that mentions of a government_agency triggering an event are reliably accompanied by mentions of causing that event. While we show that typing is necessary to learning entailments ( §6), we propose to learn all typed entailment graphs jointly.
Second, we encourage paraphrase predicates (where i→j and j→i) to have the same patterns of entailment ( Figure 2B), that is, to entail and be entailed by the same predicates, global soft constraints that we call paraphrase resolution. Using these soft constraints, a missing entailment (e.g., medicine treats disease → medicine is useful for disease) can be identified by considering the entailments of a paraphrase predicate (e.g., Figure 3: The objective function to jointly learn global scores W and the compatibility function β, given local scores W 0 . L withinGraph encourages global and local scores to be close; L crossGraph encourages similarities to be consistent between different typed entailment graphs; L pResolution encourages paraphrase predicates to have the same pattern of entailment. We use an 1 regularization penalty to remove entailments with low confidence. medicine cures disease → medicine is useful for disease).
Sharing entailments across different typed entailment graphs is only semantically correct for some predicates and types. In order to learn when we can generalize an entailment from one graph to another, we define a compatibility function The function is defined for a predicate and two type pairs ( Figure 2A). It specifies the extent of compatibility for a single predicate between different typed entailment graphs, with 1 being completely compatible and 0 being irrelevant. In particular, β p, (t 1 , t 2 ), (t 1 , t 2 ) determines how much we expect the outgoing edges of p(:t 1 , :t 2 ) and p(:t 1 , :t 2 ) to be similar. We constrain β to be symmetric between t 1 , t 2 and t 1 , t 2 as compatibility of outgoing edges of p(:t 1 , :t 2 ) with p(:t 1 , :t 2 ) should be the same as p(:t 1 , :t 2 ) with p(:t 1 , :t 2 ). We denote by β a vectorization consisting of the values of β for all possible input predicates and types.
Note that the global similarity scores W and the compatibility function β are not known in advance. Given local similarity scores W 0 , we learn W and β jointly. We minimize the loss function defined in Equation (1), which consists of three soft constraints defined below and an 1 regularization term (Figure 3).
L withinGraph . Equation (2) encourages global scores w ij to be close to local scores w 0 ij , so that the global scores will not stray too far from the original scores.
L crossGraph . Equation (3) encourages each predicate's entailments to be similar across typed entailment graphs (Figure 2A) if the predicates have similar neighbors. We penalize the difference of entailments in two different graphs when the compatibility function is high. For each pair of typed predicates (i, j) ∈ V (t 1 , t 2 ), we define a set of neighbors (predicates with different types): where a(i, j) is true if the argument orders of i and j match, and false otherwise. For each (i , j ) ∈ N (i, j), we penalize the difference of entailments by adding the term β(·)(w ij − w i j ) 2 . We add a prior term on β as λ 2 1 − β 2 2 , where 1 is a vector of the same size as β with all 1s. Without the prior term (i.e., λ 2 =0), all the elements of β will become zero. Increasing λ 2 will keep (some of the) elements of β non-zero and encourages communications between related graphs. L pResolution . Equation (4) denotes the paraphrase resolution global soft constraints that encourage paraphrase predicates to have the same patterns of entailments ( Figure 2B). The function I ε (x) equals x if x > ε and zero, otherwise. 6 Unlike L crossGraph in Equation (3), Equation (4) operates on the edges within each graph. If both w ij and w ji are high, their incoming and outgoing edges from/to nodes k are encouraged to be similar. We name this global constraint paraphrase resolution, because it might add missing links (e.g., i→k) if i and j are paraphrases of each other and j→k, or break the paraphrase relation, if the incoming and outgoing edges are very different.
We impose an 1 penalty on the elements of W as λ 1 W 1 , where λ 1 is a nonnegative tuning hyperparameter that controls the strength of the penalty applied to the elements of W. This term removes entailments with low confidence from the entailment graphs. Note that Equation (1) has W 0 and average of W 0 across different typed entailment graphs ( §5.4) as its special cases. The former is achieved by setting λ 1 =λ 2 =0 and ε=1 and the latter by λ 1 =0, λ 2 =∞ and ε=1. We do not explicitly weight the different components of the loss function, as the effect of L crossGraph and L pResolution can be controlled by λ 2 and ε, respectively.
Equation (1) can be interpreted as an inference problem in a Markov random field (MRF) (Kindermann and Snell, 1980), where the nodes of the MRF are the global scores w ij and the parameters β p, (t 1 , t 2 ), (t 1 , t 2 ) . The MRF will have five log-linear factor types: one unary factor type for L withinGraph , one three-variable factor type for the first term of L crossGraph , a unary factor type for the prior on β, one four-variable factor type for L pResolution , and a unary factor type for the 1 regularization term. Figure 2 shows an example factor graph (unary factors are not shown for simplicity).
We learn W and β jointly using a message passing approach based on the Block Coordinate Descent method (Xu and Yin, 2013). We initialize W = W 0 . Assuming that we know the global similarity scores W, we learn how much the entailments are compatible between different types ( β) and vice versa. Given W fixed, each w ij sends messages to the corresponding β(·) elements, which will be used to update β. Given β fixed, we do one iteration of learning for each w ij . Each β(·) and w ij elements send messages to the related elements in W, which will be in turn updated. Based on the update rules (Appendix A), we always have w ij ≤ 1 and β ≤ 1.
Each iteration of the learning method takes O W 0 |T | 2 + i∈V ( w i: 0 + w :i 0 ) 2 time, where W 0 is the number of nonzero elements of W (number of edges in the current graph), |T | is the number of types, and w i: 0 ( w :i 0 ) is the number of nonzero elements of the ith row (col-umn) of the matrix (out-degree and in-degree of the node i). 7 In practice, learning converges after five iterations of full updates. The method is highly parallelizable, and our efficient implementation does the learning in only a few hours.

Experimental Set-up
We extract binary relations from a multiple-source news corpus ( §5.1) and compute local and global scores. We form entailment graphs based on the similarity scores and test our model on two entailment rules data sets ( §5.2). We then discuss parameter tuning ( §5.3) and baseline systems ( §5.4).

Training Corpus: Multiple-Source News
We use the multiple-source NewsSpike corpus of Zhang and Weld (2013). NewsSpike was deliberately built to include different articles from different sources describing identical news stories. They scraped RSS news feeds from January-February 2013 and linked them to full stories collected through a Web search of the RSS titles. The corpus contains 550K news articles (20M sentences). Because this corpus contains multiple sources covering the same events, it is well suited to our purpose of learning entailment and paraphrase relations.
We extracted 29M binary relations using the procedure in §3.1. In our experiments, we used two cut-offs within each typed subgraph to reduce the effect of noise in the corpus: (1) remove any argument-pair that is observed with fewer than C 1 =3 unique predicates; (2) remove any predicate that is observed with fewer than C 2 =3 unique argument-pairs. This leaves us with |P |=101K unique predicates in 346 entailment graphs. The maximum graph size is 53K nodes, 8 and the total number of non-zero local scores in all graphs is 66M. In the future, we plan to test our method on an even larger corpus, but preliminary experiments suggest that data sparsity will persist regardless of the corpus size, because of the power law distribution of the terms. We compared our extractions qualitatively with Stanford Open IE (Etzioni et al., 2011;. Our CCG-based extraction generated noticeably 7 In our experiments, the total number of edges is ≈ .01|V | 2 and most of predicate pairs are seen in less than 20 subgraphs, rather than |T | 2 . 8 There are 4 graphs with more than 20K nodes, 3 graphs with 10K to 20K nodes, and 16 graphs with 1K to 10K nodes. better relations for longer sentences with longrange dependencies such as those involving coordination.

Evaluation Entailment Data Sets
Levy/Holt's Entailment Data Set Levy and Dagan (2016) proposed a new annotation method (and a new data set) for collecting relational inference data in context. Their method removes a major bias in other inference data sets such as Zeichner's (Zeichner et al., 2012), where candidate entailments were selected using a directional similarity measure. Levy and Dagan form questions of the type which city (q type ), is located near (q rel ), mountains (q arg )? and provide possible answers of the form Kyoto (a answer ), is surrounded by (a rel ), mountains (a arg ). Annotators are shown a question with multiple possible answers, where a answer is masked by q type to reduce the bias towards world knowledge. If the annotator indicates the answer as True (False), it is interpreted that the predicate in the answer entails (does not entail) the predicate in the question.
Whereas the Levy and Dagan entailment data set removes bias, a recent evaluation identified a high labeling error rate for entailments that hold only in one direction (Holt, 2018). Holt analyzed 150 positive examples and showed that 33% of the claimed entailments are correct only in the opposite direction, and 15% do not entail in any direction. Holt (2018) designed a task to crowdannotate the data set by a) adding the reverse entailment (q→a) for each original positive entailment (a→q) in Levy and Dagan's data set; and b) directly asking the annotators if a positive example (or its reverse) is an entailment or not (as opposed to relying on a factoid question). We test our method on this re-annotated data set of 18,407 examples (3,916 positive and 14,491 negative), which we refer to as Levy/Holt. 9 We run our CCG-based binary relation extraction on the examples and perform our typing procedure ( §3.2) on a answer (e.g., Kyoto) and a arg (e.g., mountains) to find the types of the arguments. We split the reannotated data set into dev (30%) and test (70%) such that all the examples with the same q type and q rel are assigned to only one of the sets. Berant et al. (2011) annotated all the edges of 10 typed entail-9 www.github.com/xavi-ai/relationalimplication-dataset. ment graphs based on the predicates in their corpus. The data set contains 3,427 edges (positive), and 35,585 non-edges (negative). We evaluate our method on all the examples of Berant's entailment data set. The types of this data set do not match with FIGER types, but we perform a simple handmapping between their types and FIGER types. 10

Parameter Tuning
We selected λ 1 =.01 and ε=.3 based on preliminary experiments on the dev set of Levy/Holt's data set. The hyperparameter λ 2 is selected from {0, 0.01, 0.1, 0.5, 1, 1.5, 2, 10, ∞}. 11 We do not tune λ 2 for Berant's data set. We instead use the selected value based on the Levy/Holt dev set. In all our experiments, we remove any local score w 0 ij < .01. We show precision-recall curves by changing the threshold δ on the similarity scores.

Comparison
We test our model by ablation of the global soft constraints L crossGraph and L pResolution , testing simple baselines to resolve sparsity and comparing to the state-of-the-art resources. We also compare with two distributional approaches that can be used to predict predicate similarity. We compare the following models and resources.
CG_PR is our novel model with both global soft constraints L crossGraph and L pResolution . CG is our model without L pResolution . Local is the local distributional similarities without any change.
AVG is the average of local scores across all the entailment graphs that contain both predicates in an entailment of interest. We set λ 2 = ∞, which forces all the values of β to be 1, hence resulting in a uniform average of local scores. Untyped scores are local scores learned without types. We set the cut-offs C 1 =20 and C 2 =20 to have a graph with total number of edges similar to the typed entailment graphs.
ConvE scores are cosine similarities of lowdimensional predicate representations learned by ConvE (Dettmers et al., 2018), a state-of-theart model for link prediction. ConvE is a multilayer convolutional network model that is highly parameter efficient. We learn 200-dimensional vectors for each predicate (and argument) by applying ConvE to the set of extractions of the above untyped graph. We learned embeddings for each predicate and its reverse to handle examples where the argument order of the two predicates are different. Additionally, we tried TransE (Bordes et al., 2013), another link prediction method that, despite its simplicity, produces very competitive results in knowledge base completion. However, we do not present its full results, as they were worse than ConvE. 12 PPDB is based on the Paraphrase Database (PPDB) of Pavlick et al. (2015). We accept an example as entailment if it is labeled as a paraphrase or entailment in the PPDB XL lexical or phrasal collections. 13 Berant_ILP is based on the entailment graphs of Berant et al. (2011). 14 For Berant's data set, we directly compared our results to the ones reported in Berant et al. (2011). For Levy/Holt's data set, we used publicly available entailment rules derived from Berant et al. (2011) that give us one point of precision and recall in the plots. Although the rules are typed and can be applied in a context-sensitive manner, ignoring the types and applying the rules out of context yields much better results (Levy and Dagan, 2016). This is attributable to both the non-standard types used by Berant et al. (2011) and also the general data sparsity issue.
In all our experiments, we first test a set of rule-based constraints introduced by Berant et al. (2011) on the examples before the prediction by our methods. In the experiments on Levy/Holt's data set, in order to maintain compatibility with Levy and Dagan (2016), we also run the lemmabased heuristic process used by them before applying our methods.We do not apply the lemmabased process on Berant's data set in order to compare with Berant et al's (2011) reported results directly. In experiments with CG_PR and CG, if the typed entailment graph corresponding to an example does not have one or both predicates, we resort to the average score between all typed entailment graphs.

Results and Discussion
To test the efficacy of our globally consistent entailment graphs, we compare them with the baseline systems in Section 6.1. We test the effect of approximating transitivity constraints in Section 6.2. Section 6.3 concerns error analysis.

Globally Consistent Entailment Graphs
We test our method using three distributional similarity measures: Weeds similarity (Weeds and Weir, 2003), Lin similarity (Lin, 1998), and Balanced Inclusion (BInc; Szpektor and Dagan, 2008). The first two similarity measures are symmetric, 15 and BInc is directional. Figures 4A and  4B show precision-recall curves of the different methods on Levy/Holt's and Berant's data sets, respectively, using BInc. We show the full curve for BInc; as it is directional and on the development portion of Levy/Holt's data set, it yields better results than Weeds and Lin.
In addition, Table 1 shows the area under the precision-recall curve (AUC) for all variants of the three similarity measures. Note that each method covers a different range of precisions and recalls. We compute AUC for precisions in the range [0.5, 1], because predictions with precision better than random guess are more important for end applications such as question answering and semantic parsing. For each similarity measure, we tested statistical significance between the methods using bootstrap resampling with 10K experiments (Efron and Tibshirani, 1985;Koehn, 2004). In Table 1, the best result for each data set and similarity measure is boldfaced. If the difference of another model with the best result is not significantly different with p-value < 0.05, the second model is also boldfaced.
Among the distributional similarities based on BInc, BInc_CG_PR outperforms all the other models in both data sets. In comparison with BInc score's AUC, we observe more than 100% improvement on Levy/Holt's data set and about 30% improvement on Berant's. Given the consistent gains, our proposed model appears to alleviate the data sparsity and the noise inherent to local scores. Our method also outperforms PPDB and Berant_ILP on both data sets. The second-best performing model is BInc_CG, which improves the results significantly, especially on Berant's data set, over the BInc_AVG (AUC of .177 vs. .144). This confirms that learning what subset of entailments should be generalized across different typed entailment graphs ( β) is effective.
The untyped models yield a single large entailment graph. It contains (noisy) edges that are not found in smaller typed entailment graphs. Despite the noise, untyped models for all three similarity measures still perform better than the typed ones in terms of AUC. However, they do worse in the high-precision range. For example, BInc_untyped is worse than BInc for precision > 0.85. The AVG models do surprisingly well (only about 0.5 to 3.5 below CG_PR in terms of AUC), but note that only a subset of the typed entailment graphs might have (untyped) predicates p and q of interest (usually not more than 10 typed entailment graphs out of 367 graphs). Therefore, the AVG models are generally expected to outperform the untyped ones (with only one exception in our experiments), as typing has refined the entailments and averaging just improves the recall. Comparison of CG_PR with CG models confirms that explicitly encouraging paraphrase predicates to have the same patterns of entailment is effective. It improves the results for BInc score, which is a directional similarity measure. We also tested applying the paraphrase resolution soft constraints alone, but the differences with the local scores were not statistically significant. This suggests that the paraphrase resolution is more helpful when similarities are transferred between graphs, as this can cause inconsistencies around the predicates with transferred similarities, which are then resolved by the paraphrase resolution constraints.
The results of the distributional representations learned by ConvE are worse than most other methods. We attribute this outcome to the fact that a) while entailment relations are directional, these methods are symmetric; b) the learned embeddings are optimized for tasks other than entailment or paraphrase detection; and c) the embeddings are learned regardless of argument types. However, even the BInc_untyped baseline outperforms ConvE, showing that it is important to use a directional measure that directly models entailment. We hypothesize that learning predicate representations based on the distributional inclusion hypotheses which do not have the above limitations might yield better results.

Effect of Transitivity Constraints
Our largest graph has 53K nodes; we thus tested approximate methods instead of the ILP to close entailment relations under transitivity ( §2). The approximate TNF method of Berant et al. (2011) did not scale to the size of our graphs with moderate sparsity parameters.  also present a heuristic method, High-To-Low Forest Reducible Graph (HTL-FRG), which gets slightly better results than TNF on their data set, and which scales to graphs of the size we work with. 16 We applied the HTL-FRG method to the globally consistent similarity scores (BInc_CG_ PR_HTL) and changed the threshold on the scores to get a precision-recall curve. Figures 4C and 4D show the results of this method on Levy/Holt's and Berant's data sets. Our experiments show, in contrast to the results of , that the HTL-FRG method leads to worse results when applied to our global scores. This result is caused both by the use of heuristic methods in place of 16 TNF did not converge after two weeks for threshold δ = .04. For δ = .12 (precisions higher than 80%), it converged, but with results slightly worse than HTL-FRG on both data sets.  globally optimizing via ILP, and by the removal of many valid edges arising from the fact that the FRG assumption is not correct for many realworld domains.

Error Analysis
We analyzed 100 false positive (FP) and 100 false negative (FN) randomly selected examples (using BInc_CG_ST results on Levy/Holt's data set and at the precision level of Berant_ILP, i.e. 0.76). We present our findings in Table 2. Most of the FN errors are due to data sparsity, but a few errors are due to wrong labeling of the data and parsing errors. More than half of the FP errors are because of spurious correlations in the data that are captured by the similarity scores, but are not judged to constitute entailment by the human judges. About one-third of the FP errors are because of the normalization we currently perform on the relations (e.g., we remove modals and auxiliaries). The remaining errors are mostly due to parsing and our use of Levy and Dagan's (2016) lemmabased heuristic process.

Extrinsic Evaluation
To further test the utility of explicit entailment rules, we evaluate the learned rules on an extrinsic task: answer selection for machine reading comprehension on NewsQA, a data set that 712 The board hailed Romney for his solid credentials.
Who praised Mitt Romney's credentials? Researchers announced this week that they've found a new gene, ALS6, which is responsible for . . .
Which gene did the ALS association discover ? One out of every 17 children under 3 years old in America has a food allergy, and some will outgrow their sensitivities. The report said opium has accounted for more than half of Afghanistan's gross domestic product in 2007.
What makes up half of Afghanistans GDP ? contains questions about CNN articles (Trischler et al., 2017). Machine reading comprehension is usually evaluated by posing questions about a text passage and then assessing the answers of a system (Trischler et al., 2017). The data sets that are used for this task are often in the form of (document,question,answer) triples, where answer is a short span of the document. Answer selection is an important task, where the goal is to select the sentence(s) that contain the answer. We show improvements by adding knowledge from our learned entailments without changing the graphs or tuning them to this task in any way.
Inverse sentence frequency (ISF) is a strong baseline for answer selection (Trischler et al., 2017). The ISF score between a sentence S i and a question Q is defined as ISF(S i , Q) = w∈S i ∩Q IDF(w), where IDF(w) is the inverse document frequency of the word w by considering each sentence in the whole corpus as one document. The state-of-the-art methods for answer selection use ISF, and by itself it already does quite well (Trischler et al., 2017;Narayan et al., 2018). We propose to extend the ISF score with entailment rules. We define a new score, where α ∈ [0, 1] is a hyper-parameter and r 1 and r 2 denote relations in the sentence and the question, respectively. The intuition is that if a sentence such as "Luka Modric sustained a fracture to his right fibula" is a paraphrase of or entails the answer of a question such as "What does Luka Modric suffer from?", it will contain the answer span. We consider an entailment decision  between two typed predicates if their global similarity BInc_CG_PR is higher than a threshold δ.
We also considered entailments between unary relations (one argument) by leveraging our learned binary entailments. We split each binary entailment into two potential unary entailments. For example, the entailment visit 1,2 (:person,:location) → arrive 1,in (:person,:location), is split into visit 1 (:person) → arrive 1 (:person) and visit 2 (:location) → arrive in (:location). We computed unary similarity scores by averaging over all related binary scores. This is particularly helpful when one argument is not present (e.g., adjuncts or Wh questions) or does not exactly match between the question and the answer.
We test the proposed answer selection score on NewsQA, a data set that contains questions about CNN articles (Trischler et al., 2017). The data set is collected in a way that encourages lexical and syntactic divergence between questions and documents. The crowdworkers who wrote questions saw only a news article headline and its summary points, but not the full article. This process encourages curiosity about the contents of the full article and prevents questions that are simple reformulations of article sentences (Trischler et al., 2017). This is a more realistic and suitable setting to test paraphrasing and entailment capabilities.
We use the development set of the data set (5,165 samples) to tune α and δ and report results on the test set (5,124 examples) in Table 4. 713 w ij = 1(c ij > λ 1 )(c ij − λ 1 )/τ ij (6) c ij = w 0 ij + (i ,j )∈N (i,j) β(·)w i j − 1(w ij > ε)I ε (w ji ) k∈V (τ 1 (i),τ 2 (i)) (w ik − w jk ) 2 + (w ki − w kj ) 2 + 2 k∈V (τ 1 (i),τ 2 (i)) I ε (w jk )I ε (w kj )w ik + I ε (w ik )I ε (w ki )w kj (7) τ ij = 1 + (i ,j )∈N (i,j) β(·) + 2 k∈V (τ 1 (i),τ 2 (i)) I ε (w jk )I ε (w kj ) + I ε (w ik )I ε (w ki ) β(·) = I 0 1 − j∈V (τ 1 (i),τ 2 (i)) (i ,j )∈N (i,j) (w ij − w i j ) 2 /λ 2 . We observe about 1.4% improvement in accuracy (ACC) and 1% improvement in mean reciprocal rank (MRR) and mean average precision (MAP), confirming that entailment rules are helpful for answer selection. 17 Table 3 shows some of the examples where ISFEnt ranks the correct sentences higher than ISF. These examples are very challenging for methods that do not have entailment and paraphrasing knowledge, and illustrate the semantic interpretability of the entailment graphs. We also performed a similar evaluation on the Stanford Natural Language Inference data set (SNLI;Bowman et al., 2015) and obtained 1% improvement over a basic neural network architecture that models sentences with an n-layered LSTM (Conneau et al., 2017). However, we did not obtain improvements over the state-of-theart results, because only a few of the SNLI examples require external knowledge of predicate entailments. Most examples require reasoning capabilities such as A ∧ B → B and simple lexical entailments such as boy → person, which are often present in the training set.

Conclusions and Future Work
We have introduced a scalable framework to learn typed entailment graphs directly from text. We use global soft constraints to learn globally consistent entailment scores for entailment relations. Our experiments show that generalizing in this way across different but related typed entail- 17 The accuracy results of Narayan et al. (2018) are not consistent with their own MRR and MAP (ACC>MRR in come cases), as they break ties between ISF scores differently when computing ACC compared to MRR and MAP. See also http://homepages.inf.ed.ac.uk/scohen/ acl18external-errata.pdf. ment graphs significantly improves performance over local similarity scores on two standard textentailment data sets. We show around 100% increase in AUC on Levy/Holt's data set and 30% on Berant's data set. The method also outperforms PPDB and the prior state-of-the-art entailment graph-building approach due to Berant et al. (2011). Paraphrase resolution further improves the results. We have in addition showed the utility of entailment rules on answer selection for machine reading comprehension.
In the future, we plan to show that the global soft constraints developed in this paper can be extended to other structural properties of entailment graphs such as transitivity. Future work might also look at entailment relation learning and link prediction tasks jointly. The entailment graphs can be used to improve relation extraction, similar to Eichler et al. (2017), but covering more relations. In addition, we intend to collapse cliques in the entailment graphs to paraphrase clusters with a single relation identifier, to replace the form-dependent lexical semantics of the CCG parser with these form-independent relations (Lewis and Steedman, 2013a), and to use the entailment graphs to derive meaning postulates for use in tasks such as question-answering and construction of knowledge-graphs from text (Lewis and Steedman, 2014). Figure 5 shows the update rules of the learning algorithm. The global similarity scores w ij are updated using Equation (6), where c ij and τ ij are defined in Equation (7) and Equation (8), respectively. 1(x) equals 1 if the condition x is satisfied 714 and zero, otherwise. The compatibility functions β(·) are updated using Equation (9).