Abstract
Entailment rules between predicates are fundamental to many semantic-inference applications. Consequently, learning such rules has been an active field of research in recent years. Methods for learning entailment rules between predicates that take into account dependencies between different rules (e.g., entailment is a transitive relation) have been shown to improve rule quality, but suffer from scalability issues, that is, the number of predicates handled is often quite small. In this article, we present methods for learning transitive graphs that contain tens of thousands of nodes, where nodes represent predicates and edges correspond to entailment rules (termed entailment graphs). Our methods are able to scale to a large number of predicates by exploiting structural properties of entailment graphs such as the fact that they exhibit a “tree-like” property. We apply our methods on two data sets and demonstrate that our methods find high-quality solutions faster than methods proposed in the past, and moreover our methods for the first time scale to large graphs containing 20,000 nodes and more than 100,000 edges.
1. Introduction
Performing textual inference is at the heart of many semantic inference applications, such as Question Answering (QA) and Information Extraction (IE). A prominent generic paradigm for textual inference is Textual Entailment (Dagan et al. 2013). In textual entailment, the goal is to recognize, given two text fragments termed text and hypothesis, whether the hypothesis can be inferred from the text. For example, the text Cyprus was invaded by the Ottoman Empire in 1571 implies the hypothesis The Ottomans attacked Cyprus.
Semantic inference applications such as QA and IE crucially rely on entailment rules (Ravichandran and Hovy 2002; Shinyama and Sekine 2006; Berant and Liang 2014) or equivalently, inference rules—that is, rules that describe a directional inference relation between two fragments of text. An important type of entailment rule specifies the entailment relation between natural language predicates, for example, the entailment rule X invade Y ⇒ X attack Y can be helpful in inferring the aforementioned hypothesis. Consequently, substantial effort has been made to learn such rules (Lin and Pantel 2001; Sekine 2005; Szpektor and Dagan 2008; Schoenmackers et al. 2010; Melamud et al. 2013).
Textual entailment is inherently a transitive relation, that is, the rules x ⇒ y and y ⇒ z imply the rule x ⇒ z. For example, from the rules X reduce nausea ⇒ X help with nausea and X help with nausea ⇒ X associated with nausea we can infer the rule X reduce nausea ⇒ X associated with nausea (Figure 1). Accordingly, Berant, Dagan, and Goldberger (2012) proposed taking advantage of transitivity to improve learning of entailment rules. They modeled learning entailment rules as a graph optimization problem, where nodes are predicates and edges represent entailment rules that respect transitivity. To solve this optimization problem, they formulated it as an Integer Linear Program (ILP) and used an off-the-shelf ILP solver to find an exact solution. Indeed, they showed that applying global transitivity constraints results in more accurate graphs (known as entailment graphs) than methods that ignore the property of transitivity.
A fragment of an entailment graph about the concept nausea from the data set used by Berant, Dagan, and Goldberger. Edges that can be inferred by transitivity are omitted for clarity.
A fragment of an entailment graph about the concept nausea from the data set used by Berant, Dagan, and Goldberger. Edges that can be inferred by transitivity are omitted for clarity.
Although using an off-the-shelf ILP solver is straightforward, finding the optimal set of edges respecting transitivity is NP-hard, and practically, transitivity constraints impose substantial restrictions on the scalability of the methods. In fact, in some cases finding the optimal set of edges for entailment graphs with even ~50 nodes was quite slow.
In this article, we develop algorithms for learning entailment graphs that take advantage of structural properties such as transitivity, but substantially reduce the computational cost of inference, thus allowing us to solve the aforementioned optimization problem on very large graphs.
Our method contains two main steps. The first step is based on a sparsity assumption that even if an entailment graph contains many predicates, most of them do not entail one another, and thus we can decompose the graph into smaller components that can be solved more efficiently. For example, the predicates X parent of Y, X child of Y and X relative of Y are independent from the predicates X works with Y, X boss of Y, and X manages Y, and thus we can consider each one of these two sets separately. We prove that finding the optimal solution for each of the smaller components results in a global optimal solution for our optimization problem.
The second step proposes a polynomial heuristic approximation algorithm for finding a transitive set of edges in each one of the smaller components. It is based on a novel modeling assumption that entailment graphs exhibit a “tree-like” property, which we term forest reducible. For example, the graph in Figure 1 is not a tree, because the predicates X related to nausea and X associated with nausea form a cycle. However, these two predicates are synonymous, and if they were merged into a single node, then the graph would become a tree. We propose a simple iterative approximation algorithm, where in each iteration a single node is deleted from the graph and then inserted back in a way that improves the objective function value. We prove that if we impose a constraint that entailment graphs must be forest reducible, then each iteration can be performed in linear time. This results in an algorithm that can scale to entailment graphs containing tens of thousands of nodes.
We apply our algorithm on two data sets. The first data set includes medium-sized entailment graphs where predicates are typed, that is, the arguments are restricted to belong to a particular semantic type (for instance, Xperson parent of Yperson). We show that using our algorithm, we can substantially improve runtime, suffering from only a slight reduction in the quality of learned entailment graphs. The second data set includes a much larger graph containing 20,000 untyped nodes, where applying state-of-the-art methods that use an ILP solver is completely impossible. We run our algorithm on this data set and demonstrate that we can learn knowledge bases with more than 100,000 entailment rules at a higher precision compared with local learning methods.
The article is organized as follows. In Section 2 we survey prior work on the learning of entailment rules. We first focus on local methods (Section 2.1), that is, methods that handle each pair of predicates independently of other predicates, and then describe global methods (Section 2.2), that is, methods that take into account multiple predicates simultaneously. Section 3 is the core of the article and describes our main algorithmic contributions. After formalizing entailment rule learning as a graph optimization problem, we present the two steps of our algorithm. Section 3.1 describes the first step, in which a large entailment graph is decomposed into smaller components. Section 3.2 describes the second step, in which we develop an efficient heuristic approximation based on the assumption that entailment graphs are forest reducible. Section 4 describes experiments on the first data set containing medium-sized entailment graphs with typed predicates. Section 5 presents an empirical evaluation on a large graph containing 20,000 untyped predicates. We also perform a qualitative analysis in Section 5.4 to further elucidate the behavior of our algorithm. Section 6 concludes the article.
This article is based on previous work (Berant, Dagan, and Goldberger 2011; Berant et al. 2012), but expands over it in multiple directions. Empirically, we present results on a novel data set (Section 5) that is by orders of magnitude larger than in the past. Algorithmically, we present the Tree-Node-And-Component-Fix algorithm, which is an extension of the Tree-Node-Fix algorithm presented in Berant et al. (2012) and achieves best results in our experimental evaluation. Theoretically, we provide an NP-hardness proof for the Max-Trans-Forest optimization problem presented in Berant et al. (2012) and an ILP formulation for it. Last, we perform in Section 5.4 an extensive qualitative analysis of the graphs learned by our local and global algorithms from which we draw conclusions for future research directions.
2. Background
In this section we describe prior work relevant for entailment rule learning. First, we describe local methods that estimate entailment for a pair of predicates, focusing on methods that we employ in this article. Then, we describe global methods that perform inference over a larger set of predicates. Specifically, we provide details on the method and optimization problem developed by Berant, Dagan, and Goldberger (2012) for which we propose scalable inference algorithms in this article. Last, we survey some other related work in NLP that uses global inference.
2.1 Local Learning
In local learning, given a pair of predicates (i, j) we would like to determine whether i ⇒ j. The main sources of information utilized in the past for local learning were (1) lexicographic resources, (2) co-occurrence, and (3) distributional similarity. We briefly describe the first two and then expand more on distributional similarity, which is the most commonly used source of information.
Lexicographic resources are manually built knowledge bases from which semantic information may be extracted. For example, the hyponymy, toponymy, and synonymy relations in WordNet (Fellbaum 1998) can be used to detect entailment between nouns and verbs. Although WordNet is the most popular lexicographic resource, other resources such as CatVar, Nomlex, and FrameNet have also been utilized to extract inference rules (Meyers et al. 2004; Budanitsky and Hirst 2006; Szpektor and Dagan 2009; Coyne and Rambow 2009; Ben Aharon, Szpektor, and Dagan 2010).
Pattern-based methods attempt to identify the semantic relation between a pair of predicates by examining their co-occurrence in a large corpus. For example, the sentence people snore while they sleep provides evidence that snore ⇒ sleep. While most pattern-based methods focused on identifying semantic relations between nouns (for example, Hearst patterns [Hearst 1992]), several works (Pekar 2008; Chambers and Jurafsky 2011; Weisman et al. 2012) attempted to extract relations between predicates as well. Chklovsky and Pantel (2004) used pattern-based methods to generate the commonly used VerbOcean resource.
Both lexicographic as well as pattern-based methods suffer from limited coverage. Distributional similarity is therefore used to automatically construct broad-scale resources. Distributional similarity methods are based on the “distributional hypothesis” (Harris 1954) that semantically similar predicates occur with similar arguments. Quite a few methods have been suggested (Lin and Pantel 2001; Szpektor et al. 2004; Bhagat, Pantel, and Hovy 2007; Szpektor and Dagan 2008; Yates and Etzioni 2009; Schoenmackers et al. 2010), which differ in terms of the specifics of the ways in which predicates are represented, the features that are extracted, and the function used to compute feature vector similarity. Next, we elaborate on the methods that we use in this article.

The measure is computed analogously for the variable Y and the final distributional similarity score, termed DIRT, is the geometric average of the scores for the two variables: If DIRT(u, v) is high, this means that the templates u and v share many “informative” arguments and so it is possible that u ⇒ v.
Both Lin and Pantel as well as Szpektor and Dagan compute a similarity score using a single argument. However, it is clear that although this alleviates sparsity problems, considering pairs of arguments jointly provides more information. Yates and Etzioni (2009), Schoenmackers et al. (2010), and even earlier, Szpektor et al. (2004), presented methods that compute semantic similarity based on pairs of arguments.
A problem common to all local methods presented above is predicate ambiguity—predicates may have different meanings and different entailment relations in different contexts. Some works resolved the problem of ambiguity by representing predicates with argument variables that are typed (Pantel et al. 2007; Schoenmackers et al. 2010). For example, argument variables in the work of Schoenmackers et al. were restricted to belong to one of 156 types, such as country or profession. A different solution that has attracted substantial attention recently is to represent the various contexts in which a predicate can appear in a low-dimensional latent space (for example, using Latent Dirichlet Allocation [Blei, Ng, and Jordan 2003]) and infer entailment relations between predicates based on the contexts in which they appear (Ritter, Mausam, and Etzioni 2010; ÓSéaghdha 2010; Dinu and Lapata 2010; Melamud et al. 2013). In the experiments presented in this article we will use the representation of Schoenmackers et al. in one experiment, and ignore the problem of predicate ambiguity in the other.
2.2 Global Learning
The idea of global learning is that, by jointly learning semantic relations between a large number of natural language phrases, one can use the dependencies between the relations to improve accuracy. A natural way to model that is with a graph where nodes are phrases and edges represent semantic similarity. Snow, Jurafsky, and Ng (2006) presented one of the early examples of global learning in the context of learning noun taxonomies. In their work, they enforced a transitivity constraint over the taxonomy using a greedy inference procedure and demonstrated that this improves the quality of the taxonomy. Transitivity constraints were also enforced by Yates and Etzioni (2009), who proposed a clustering algorithm for learning undirected synonymy relations. Nakashole, Weikum, and Suchanek (2012) learned a taxonomy of binary predicates and also enforced transitivity with a greedy algorithm.
Berant, Dagan, and Goldberger (2012) developed a global method for learning entailment relations between predicates. In this article we present scalable approximation algorithms for the optimization problem they propose, and so we provide further detail on their method. The input to their algorithm is a large corpus and a lexicographic resource, such as WordNet, and the output is a set of entailment rules that respect the constraint of transitivity. The algorithm is composed of two main steps. In the first step, a set of predicates is extracted from the corpus and a local entailment classifier is trained based on examples automatically generated from the lexicographic resource. At this point, we can derive for each pair of predicates (i, j) a score wij ∈ ℝ that estimates whether i ⇒ j.
In the second step of their algorithm, they construct a graph, where predicates are nodes and edges represent entailment rules. Using the local scores they look for the set of edges E that maximizes the objective function ∑(i,j)∈Ewij under the constraint that edges respect transitivity. They show that this optimization problem is NP-hard and find an exact solution using an ILP solver over small graphs. They also avoid problems of predicate ambiguity by partially contextualizing the predicates. In this article, we present efficient and scalable heuristic approximation algorithms for the optimization problem they propose.
In recent years, there has been substantial work on approximation algorithms for global inference problems. Do and Roth (2010) suggested a method for the related task of learning taxonomic relations between terms. Given a pair of terms, they construct a small graph that contains the two terms and a few other related terms, and then impose constraints on the graph structure. They construct these small graphs because their work is geared towards scenarios where relations are determined on-the-fly for a given pair of terms and no global knowledge base is ever explicitly constructed. Because they independently construct a graph for each pair of terms, their method easily produces solutions where global constraints, such as transitivity, are violated.
Another approximation method that violates transitivity constraints is LP relaxation (Martins, Smith, and Xing 2009). In LP relaxation, binary variables are replaced by continuous variables, transforming the problem from an ILP to a Linear Program (LP), which is polynomial. An LP solver is then applied, and variables that are assigned a fractional value are rounded to their nearest integer, so many violations of transitivity may occur. The solution when applying LP relaxation is not a transitive graph; we will show in Section 4 that our approximation method is substantially faster.
Global inference has gained popularity in recent years in NLP, and a common approximation method that has been extensively utilized is dual decomposition (Sontag, Globerson, and Jaakkola 2011). Dual decomposition has been successfully applied in tasks such as Information Extraction (Reichart and Barzilay 2012), Machine Translation (Chang and Collins 2011), Parsing (Rush et al. 2012), and Named Entity Recognition (Wang, Che, and Manning 2013). To the best of our knowledge, it has not yet been applied for the task of learning entailment relations. The graph decomposition method we present in Section 3.1 can be viewed as an ideal case of dual decomposition, where we can decompose the problem into disjoint components in a way that we do not need to ensure consistency of the results obtained on each component separately.
3. Efficient Inference
Our goal is to learn a large knowledge base of entailment rules between natural language predicates. Following Berant, Dagan, and Goldberger (2012), we formulate this task as a graph-learning problem, where, given the nodes of an entailment graph, we would like to find the best set of edges that respect a global transitivity constraint.
Our main modeling assumption is that textual entailment is a transitive relation. Berant, Dagan, and Goldberger (2012) have demonstrated that this modeling assumption holds for focused entailment graphs, that is, graphs in which predicates are disambiguated by one of their arguments. For example, a graph that focuses on the concept nausea might contain a en entailment rule between predicates such as X prevents nausea ⇒ X affects nausea, where the predicate X prevent Y is disambiguated by the instantiation of the argument nausea. In this work, we will examine this modeling assumption for more general graphs. In Section 4 we show that transitivity holds in typed entailment graphs, that is, graphs where each predicate specifies the semantic type of its arguments (for example, Xdrug prevent Ysymptom). In Section 5.4, we examine the assumption of transitivity when predicates do not carry any typing information (for example, X prevent Y); and we observe that in this setting the assumption of transitivity is often violated because of the predicate ambiguity. Nevertheless, we show that even in this set-up we can empirically take advantage of the transitivity assumption.
Let V be a set of predicates, which are the nodes of the entailment graph, and let w : V × V → ℝ be an entailment weighting function. Given a pair of predicates (i, j), a positive wij indicates local tendency to decide that i entails j, whereas a negative wij indicates local tendency to decide that i does not entail j. We want to find a global entailment transitive graph that is most consistent with the local cues. Formally, our goal is to find the directed graph that maximizes the sum of edge weights ∑(i,j)∈Ewij, under the constraint that the graph is transitive—that is, for every triple of nodes (i, j, k), if (i, j) ∈ E and ( j, k) ∈ E, then (i, k) ∈ E.

Berant, Dagan, and Goldberger solved this optimization problem with an ILP solver, but because ILP is NP-hard, this does not scale well; the number of variables is O(|V|2) and the number of constraints is O(|V|3). Thus, even a graph with 80 nodes (predicates) has more than half a million constraints. In this section, we describe a heuristic algorithm that empirically provides high-quality solutions for this optimization problem in graphs with tens of thousands of nodes.
Our algorithm contains two main steps. The first step (Section 3.1) is based on a structural assumption that entailment graphs are relatively sparse—that is, most predicates do not entail one another. This allows us to decompose the graph into smaller components in a way that guarantees that an exact solution for each one of the components results in an exact solution for Max-Trans-Graph. However, often even after decomposition, components are too large and finding an exact solution is still intractable.
The second step (Section 3.2) proposes a heuristic algorithm for finding a good solution for each one of the components. This step is based on an observation that entailment graphs exhibit a “tree-like” property and are very similar to a novel type of graph, which we term forest-reducible graph. We utilize this property to develop an iterative efficient approximation algorithm for learning the graph edges, where each iteration takes linear time. We also prove that finding the optimal forest-reducible graph is NP-hard.
In Sections 4 and 5 we apply our algorithm on two different data sets and show that using transitivity substantially improves performance and that our methods dramatically improve scalability, allowing us to increase the scope of global learning of entailment graphs.
3.1 Entailment Graph Decomposition
The first step of our algorithm takes advantage of graph sparsity: Most predicates in language do not entail one another. Thus, it might be possible to decompose entailment graphs into small components and solve each component separately.
Let V1,V2 be a partitioning of the nodes V into two disjoint non-empty subsets. We term any directed edge, whose two nodes belong to different subsets, a crossing edge.
Proposition 1
If we can partition a set of nodes V into disjoint sets V1,V2 such that no crossing edge has a positive weight, then the optimal set of edges that is the solution of the ILP problem (3) does not contain any crossing edge.
Proof Assume by contradiction that the edge-set of the optimal solution Eopt contains a non-empty set of crossing edges Ecross. We can construct Enew = Eopt \ Ecross. Clearly , as wij ≤ 0 for any crossing edge.
Next, we show that Enew does not violate transitivity constraints. Assuming it does, then the violation is caused by omitting the edges in Ecross. Thus, there must be, without loss of generality, nodes i ∈ V1 and k ∈ V2 such that for some node j, (i,j) and ( j,k) are in Enew, but (i,k) is not. However, this means either (i,j) or ( j,k) is a crossing edge, which is impossible because we omitted all crossing edges. Thus, Enew is a better solution than Eopt, contradiction. Hence, Ecross is empty.
This proposition suggests a simple algorithm (see Algorithm 1): Construct an undirected graph with the node set V and with an edge connecting i and j if either wij > 0 or wji > 0, then find its connected components, and finally solve each component separately. If an ILP solver is used to find the edges of each component separately, then we obtain an optimal (not approximate) solution to the optimization problem (3) for the whole graph. Finding the undirected edges (Line 1) and computing connected components (Line 2) can be performed in O(V2). Thus, in this case the efficiency of the algorithm is dominated by the application of an ILP solver (Line 4).
If the entailment graph decomposes into small components, one could obtain an exact solution with an ILP solver, applied on each component separately, without resorting to any approximation. To further extend scalability in this setting we use a cutting-plane method (Kelley 1960). Cutting-plane methods have been used in the past, for example, in dependency parsing (Riedel and Clarke 2006). The idea is that even if we omit all transitivity constraints, we still expect most transitivity constraints to be satisfied, given a good weighting function w. Thus, it makes sense to avoid specifying the constraints ahead of time, but rather add them when they are violated. This is formalized in Algorithm 2.
Line 1 initializes an active set of constraints (ACT). Line 3 applies the ILP solver with the active constraints. Lines 4 and 5 find the violated constraints and add them to the active constraints. The algorithm halts when no constraints are violated. The solution is clearly optimal because we obtain a maximal solution for a less-constrained problem that does not violate any transitivity constraint.
A pre-condition for using cutting-plane methods is that computing the violated constraints (Line 4) is efficient, as it occurs in every iteration. We do that in a straightforward manner: For all edges (i, j) and ( j, k) that are in the current solution E*, if (i, k) ∉ E* we add xij + xjk − xik ≤ 1 to the violated constraints. This is cubic in worst-case and performs very fast in practice.
To conclude, an exact algorithm for Max-Trans-Graph decomposes the graph into components and uses Algorithm 2 to find an exact solution for each component. However, because the problem is NP-hard, this will still fail once components become large. Next, we describe the second step of our method, which replaces Algorithm 2 with an efficient approximation that can scale to much larger graphs. We will show in Section 4 that the solutions obtained by the approximation algorithm are almost as good as the exact solution on a data set where finding an exact solution is feasible.
3.2 Efficient Tree-Based Approximation
The approximation we present in this section is based on a conjecture that entailment graphs exhibit a “tree-like” property, that is, they can be reduced into a structure similar to a directed forest, which we term forest-reducible graph (FRG). Although FRGs are a more constrained class of directed graphs, we prove that restricting our optimization problem to FRGs does not make the problem fundamentally easier, that is, the problem remains NP-hard (see Appendix). Then, we present in Section 3.2.2 our iterative approximation algorithm, where in each iteration a node is removed and re-attached back to the graph in a locally optimal way. Combining this scheme with our conjecture about the graph structure yields a linear algorithm for node re-attachment.
Thus, the contribution of this section is twofold: First, we define a novel modeling assumption about the tree-like structure of entailment graphs and empirically demonstrate its validity. Second, we exploit this assumption to develop a polynomial approximation algorithm for learning entailment graphs that can scale to much larger graphs than in the past.
3.2.1 Forest-Reducible Graph
The predicate entailment relation, described by our entailment graphs, is typically from a “semantically specific” predicate to a more “general” one. Thus, intuitively, the topology of an entailment graph is expected to be “tree-like.” In this section we first formalize this intuition and then empirically analyze its validity. This property of entailment graphs is an interesting topological observation on its own, but also enables the efficient approximation algorithm of Section 3.2.2.
For a directed edge i ⇒ j in a directed acyclic graph (DAG), we term the node i a child of node j, and j a parent of i.2 A directed forest is a DAG where all nodes have no more than one parent. A strongly connected component in a directed graph is a set of nodes such that there is a path from every node in the set to any other node. In entailment graphs, strongly connected components correspond to semantically equivalent predicates.
The entailment graph in Figure 2a (subgraph from the data set described in Section 4) is clearly not a directed forest—it contains a cycle of size two comprising the nodes X common in Y and X frequent in Y, and in addition the node X be epidemic in Y has three parents. However, we can convert it to a directed forest by applying the following operations. Any directed graph can be converted into a Strongly-Connected-Component (SCC) graph in the following way: Every strongly connected component is contracted into a single node, and an edge is added from SCC S1 to SCC S2 if there is an edge in
from some node in S1 to some node in S2. The SCC graph is always a DAG (Cormen et al. 2002), and if
is transitive, then the SCC graph is also transitive. The graph in Figure 2b is the SCC graph of the one in Figure 2a, but is still not a directed forest because the node X be epidemic in Y has two parents.
A fragment of an entailment graph (a), its SCC graph (b), and its reduced graph (c). Nodes are predicates with typed variables, which are omitted in (b) and (c) for compactness.
A fragment of an entailment graph (a), its SCC graph (b), and its reduced graph (c). Nodes are predicates with typed variables, which are omitted in (b) and (c) for compactness.
The transitive closure of a directed graph is obtained by adding an edge from node i to node j if there is a path in
from i to j. The transitive reduction of
is obtained by removing all edges whose absence does not affect its transitive closure. In DAGs, the result of transitive reduction is unique (Aho, Garey, and Ullman 1972). We thus define the reduced graph
of a directed graph
as the transitive reduction of its SCC graph. The graph in Figure 2c is the reduced graph of the one in Figure 2a and is a directed forest. We say a graph is a forest-reducible graph if its reduced graph is a directed forest.
We now hypothesize that entailment graphs are approximately FRGs. The intuition is that the predicate on the left-hand-side of an entailment rule has a more specific meaning than the one on the right-hand-side. For instance, in Figure 2a,X be epidemic in Y (where X is a type of disease and Y is a country) is more specific than X common in Y and X frequent in Y, which are equivalent, while X occur in Y is even more general. Accordingly, the reduced graph in Figure 2c is an FRG. We note that this is not always the case: For example, the entailment graph in Figure 3 is not an FRG, because X annex Y entails both Y be part of X and X invade Y, while the latter two do not entail one another. However, we hypothesize that this scenario is rather uncommon. Consequently, a natural variant of the Max-Trans-Graph problem is to restrict the required output graph of the optimization problem (3) to an FRG. We term this problem Max-Trans-Forest.
To test whether our hypothesis holds empirically, we performed the following analysis. We sampled seven gold standard typed entailment graphs (that is, graphs where predicates specify the semantic type of their arguments, such as Xdrug prevents Ysymptom) from the data set described in Section 4, manually transformed them into FRGs by deleting a minimal number of edges, and measured recall over the set of edges in each graph (precision is naturally 1.0, as we only delete gold-standard edges). The lowest recall value obtained was 0.95, illustrating that deleting a very small proportion of edges converts a typed entailment graph into an FRG. Further support for the practical validity of this hypothesis is obtained from our experiments on typed entailment graphs in Section 4. In these experiments we show that exactly solving Max-Trans-Graph and Max-Trans-Forest (with an ILP solver) results in nearly identical performance. In Section 5.4 we qualitatively analyze the validity of the FRG assumption in untyped entailment graphs (where predicates are of the form X prevents Y) and find that this assumption does not always hold, because of predicate ambiguity.
3.2.2 Approximation Algorithm
In this section we present Tree-Node-Fix and Tree-Node-And-Component-Fix, which are efficient approximation algorithms for Max-Trans-Forest, as well as Graph-Node-Fix, an approximation for Max-Trans-Graph.
Tree-Node-Fix. The scheme of Tree-Node-Fix (TNF) is the following. First, an initial FRG is constructed, using some initialization procedure. Then, at each iteration a single node v is re-attached (see below) to the FRG in a way that improves the objective function. This is repeated until the value of the objective function can no longer be improved by re-attaching a node.
Re-attaching a node v is performed by removing v from the graph (deleting v and all its adjacent edges) and connecting it back with a better set of edges, while maintaining the constraint that it is an FRG. This is done by considering all possible edges from/to the other graph nodes and choosing the optimal subset, while the rest of the graph edges remain fixed. For example, in Figure 2, one way of re-attaching the node X common in Y is to add it as a direct child of X occur in Y that is not a synonym of X frequent in Y. This will result in deletion of the edges X common in Y ⇒ X frequent in Y, X frequent in Y ⇒ X common in Y, and also X be epidemic in Y ⇒ X common in Y, because otherwise the resulting graph will not be an FRG. We will show that re-attachment can be efficiently performed in linear time using dynamic programming.





There are several cases for re-attaching v that we need to consider:
- 1.
Inserting v into an existing component c ∈ Vred (case 1, see Figure 4a).
- 2.
Inserting v as a new component, which breaks into two subcases:
(a) Inserting v into a component c ∈ Vred. (b) Inserting v as a child of c and a parent of a subset of c's children in . (b') A node d that is a descendant but not a child of c cannot choose v as a parent, as v becomes its second parent. (c) Inserting v as a new root.
(a) Inserting v into a component c ∈ Vred. (b) Inserting v as a child of c and a parent of a subset of c's children in . (b') A node d that is a descendant but not a child of c cannot choose v as a parent, as v becomes its second parent. (c) Inserting v as a new root.
To find the optimal way of re-attaching v, we need to compute the score Equation (9), for each of the three cases, and choose the best one. We now describe how this score is computed in each of the three cases.











Re-attachment is summarized in Algorithm 3. Computing an SCC graph is linear (Cormen et al. 2002), and it is easy to verify that transitive reduction in FRGs is also linear (Line 1). Computing Sv−in(c) and Sv−out(c) (Lines 2–3) is also linear, as explained. Cases 1 and 2b are trivially linear and in case 2a we go over the children of all nodes in Vred. As the reduced graph is a forest, this simply means going over all nodes of Vred, and so the entire algorithm is linear.
Because re-attachment is linear, re-attaching all nodes is quadratic. Thus if we bound the number of iterations over all nodes, the overall complexity is quadratic. This is dramatically more efficient and scalable than applying an ILP solver. Empirically, we found that TNF converges after 5–10 iterations.
Tree-Node-and-Component-Fix. Assuming that entailment graphs are FRGs allows us to use the node re-attachment operation in linear time. However, this assumption also enables performing other graph operations efficiently. We now suggest a natural extension to the TNF algorithm that better explores the space of FRGs.
A disadvantage of TNF is that it re-attaches a single node in every iteration. Consider, for instance, the reduced graph in Figure 5. In this graph, nodes l, m, and n are direct children of node k, but suppose that in the optimal solution they are all children of node j. Reaching the optimal solution would require three independent re-attachment operations, and it is not clear that each of the three alone would improve the objective function value. However, if we allow re-attachment operations over components in the SCC graph, then we would be able to re-attach the strong connectivity component containing the nodes l, m, and n in a single operation. Thus, the idea of our extended TNF algorithm is to allow re-attachment of both nodes and components. We term this algorithm Tree-Node-and-Component-Fix (TNCF).
An example graph where single node re-attachment is a disadvantage.
There are many ways in which this intuition can be implemented and our TNCF algorithm uses one possible variant. In TNCF we first perform node re-attachment until convergence as in TNF (after initialization), but then compute the SCC graph and perform component re-attachment until convergence. Component re-attachment is identical to node re-attachment, except that we are guaranteed that the reduced graph is a forest. After performing component re-attachment until convergence, we again perform node re-attachments and then component re-attachments and so on, until the entire process converges.
Three types of transitivity constraint violations, when re-attaching the node v to a graph containing the nodes i and k.
Three types of transitivity constraint violations, when re-attaching the node v to a graph containing the nodes i and k.
Complexity is exponential because of the ILP solver; however, the ILP size is reduced by an order of magnitude to O(|V|) variables and O(|V|2) constraints. To summarize, the complexity of GNF is higher than the complexity of TNF, but it does not require the assumption that entailment graphs are FRGs.
4. Experimental Evaluation: Typed Entailment Graphs
In this and the next section we empirically evaluate the algorithms presented in Section 3 on two different data sets. Our first data set, presented in this section, comprises medium-sized graphs for which obtaining an exact solution is possible, and we demonstrate that our approximation methods substantially improve runtime while causing only a small degradation in performance compared with the optimal solution. The graphs are also particularly suited for global optimization because graph predicates are typed, which substantially reduces their ambiguity. The second data set (Section 5) contains a graph with tens of thousands of untyped nodes, where exact inference is completely impossible. We show that our methods scale to this graph and that transitivity improves performance even when predicates are untyped. The resulting graph contains more than 100,000 entailment rules that can be utilized in downstream semantic applications.
The input to the algorithms presented in Section 3 is a set of nodes V and a weighting function w : V × V → ℝ. We describe how those are constructed before presenting an experimental evaluation.
4.1 Typed Entailment Graphs
As mentioned in Section 2, one of the challenges in global optimization is that transitivity does not always hold when predicates are ambiguous. Schoenmackers et al. (2010) proposed an algorithm for learning inference rules between typed predicates, a representation that substantially reduces ambiguity. We adopt their representation and learn typed entailment graphs. A typed entailment graph (see Figure 7) is a directed graph where the nodes are typed predicates. A typed predicate is a triple (t1, p, t2), or simply p(t1, t2), representing a predicate in natural language. p is the lexical realization of the predicate and the typed variables t1, t2 indicate that the arguments of the predicate belong to the semantic types t1, t2. Semantic types are taken from a set of types T, where each type t ∈ T is a bag of natural language words or phrases. Examples for typed predicates are: conquer(country,city) and contain(product,material). An instance of a typed predicate is a triple (a1, p, a2), or simply p(a1, a2), where a1 ∈ t1 and a2 ∈ t2 are termed arguments. For example, be common in(asthma,australia) is an instance of be common in(disease,place).
Given a set of typed predicates, entailment rules can only exist between predicates that share the same (unordered) pair of types (such as place and country), as otherwise, the rule would contain unbound variables. Hence, given a set of typed predicates we can immediately decompose them into disjoint subsets—all typed predicates sharing the same pair of types define a separate graph that describes the entailment relations between those predicates (see Figure 7).
Edges in typed entailment graphs represent entailment rules in the usual way. If the type t1 is different from the type t2, mapping of arguments is straightforward, as in the rule be find in(material,product) ⇒ contain(product,material). If t1 and t2 are equal we need to specify how arguments are mapped. This is done by splitting each node into two: For example, the node beat(team,team) is split into two typed predicates beat(Xteam,Yteam) and beat(Yteam,Xteam). This allows us to specify a rule where argument order is reversed such as beat(Xteam,Yteam) ⇒ lose to(Yteam,Xteam). This also accommodates rules such as play(Xteam,Yteam) ⇒ play(Yteam,Xteam): If team A plays team B, then team B plays team A.
To create typed entailment graphs, we used a data set that was generously provided by Schoenmackers et al. (2010). Schoenmackers et al. produced a mapping of 1.1 million arguments into 156 types (Examples for (argument, type) pairs are (exodus, book), (china, country), and (asthma, disease)), and then utilized the types, the mapped arguments, and 1 million TextRunner tuples (Banko et al. 2007) to generate a set of 10,672 typed predicates.3 Because entailment can only occur between predicates that share the same types, we decomposed the 10,672 typed predicates into 2,303 typed entailment graphs. The largest graph contains 188 nodes and the total number of potential rules is 263,756.4 The advantage of typing predicates is that it substantially reduces ambiguity, while still maintaining rules of wide applicability.
4.2 Training a Local Entailment Classifier

Training is similar to the method proposed by Berant, Dagan, and Goldberger (2012), and we briefly describe it here. The input for training is a lexicographic resource, for which we use WordNet, and a set of tuples, for which we use the 1 million typed TextRunner tuples provided by Schoenmackers et al. We perform the following steps:
- 1.
Training set generation Positive examples are generated using WordNet synonyms and hypernyms. Negative pairs are generated using WordNet direct co-hyponyms (sister terms), but we also utilize Word hyponyms at distance 2. In addition, we generate negative examples by randomly sampling pairs of typed predicates that share the same types. Table 1 provides an example for each type of automatically generated training example. It has been noted in the past that the WordNet verb hierarchy contains a certain amount of noise (Richens 2008; Roth and Frank 2012). However, we use WordNet only to generate examples for training a statistical classifier, and thus we can tolerate some noise in the generated examples. In fact, we have noticed that simple variants in training set generation do not result in substantial differences in classifier performance.
- 2.
Feature representation Each example pair of typed predicates (p1, p2) is represented by a feature vector, where each feature is a distributional similarity score estimating whether p1 entails p2.
We compute 11 distributional similarity scores for each pair of typed predicates, based on their arguments in the input set of tuples. The first six scores are computed by trying all combinations of two similarity functions Lin and BInc with three types of feature representations (see Section 2):
- (a)
A feature is a pair of arguments. For example, a feature for the typed predicate invade(country,city) might be (germany,leningrad) or (england,paris).
- (b)
Predicate representation is binary, and each typed predicate has two feature vectors, one for the X slot and one for the Y slot. Similarities are computed for each vector separately and are then combined by a geometric average (as in DIRT [Lin and Pantel 2001]). For example, the predicate invade(country,city) will have a feature vector for its X slot with features such as germany and england, and a feature vector for its Y slot with features such as leningrad and paris.
- (c)
Binary typed predicates are decomposed into two unary typed predicates, and similarity is computed separately for each unary predicate. Then, similarity scores are combined by a geometric average. For example, the binary typed predicate invade(country, city) will be decomposed into two unary predicates, one where the first argument of invade has the type country (and the type of the second argument is unspecified), and another where the second argument of invade has the type city (and the first argument in unspecified).
- (a)
- 3.
Training We sub-sample negative examples and train over an equal number of positive and negative examples. We used SVMperf (Joachims 2005) to train a Gaussian kernel classifier that provides an output score, sij. We tuned the two SVM parameters using 5-fold cross validation on a development set of two typed entailment graphs.
- 4.
Prior knowledge For a small number of pairs of predicates, we might have prior knowledge of whether one entails the other. Berant, Dagan, and Goldberger (2012) integrated prior knowledge by adding hard constraints to the ILP. Because not all of our algorithms use an ILP solver, we integrate prior knowledge by modifying the local classifier score. For pairs of predicates i, j for which we have prior knowledge that i entails j (termed positive local constraints), we set sij = ∞. For pairs of predicates i,j for which we have prior knowledge that i does not entail j (termed negative local constraints), we set sij = −∞.
To generate prior knowledge constraints we normalized each predicate by omitting the first word if it is a modal and turned passives into actives. If two normalized predicates are equal, they are a positive local constraint. Negative local constraints were constructed from three sources (1) Predicates differing by a single pair of words that are WordNet antonyms, (2) Predicates differing by a single word of negation, and (3) Predicates p(t1, t2) and p(t2, t1) where p is a transitive verb (for example, beat) in VerbNet (Kipper, Dang, and Palmer 2000).
As reported by Berant, Dagan, and Goldberger (2012), we find that injecting prior knowledge into the graph improves performance, as it provides a good starting point for the inference procedure.
Automatically generated training set examples.
Type . | Example . |
---|---|
direct hypernym | beat(team,team) ⇒ play(team,team) |
direct synonym | reach(team,game) ⇒ arrive at(team,game) |
direct cohyponym | invade(country,city) ⇏ bomb(country,city) |
hyponym (distance=2) | defeat(city,city) ⇏ eliminate(city,city) |
random | hold(place,event) ⇏ win(place,event) |
Type . | Example . |
---|---|
direct hypernym | beat(team,team) ⇒ play(team,team) |
direct synonym | reach(team,game) ⇒ arrive at(team,game) |
direct cohyponym | invade(country,city) ⇏ bomb(country,city) |
hyponym (distance=2) | defeat(city,city) ⇏ eliminate(city,city) |
random | hold(place,event) ⇏ win(place,event) |
4.3 Experimental Evaluation
To evaluate performance, we manually annotated all edges in 10 typed entailment graphs containing 14, 14, 22, 30, 53, 62, 76, 86, 118, and 118 nodes. This annotation yielded 3,427 edges and 35,585 non-edges, resulting in an empirical edge density of 9%. The data set is publicly available and can be downloaded from http://www-nlp.stanford.edu/joberant/homepage_files/resources/Acl2011Exp.rar.
We implemented the following algorithms for learning graph edges, where in all of them the graph is first decomposed into components as described in Section 3.1.
No-trans Use local scores without transitivity constraints—an edge (i, j) is inserted iff wij > 0, or in other words iff sij > λ.
Exact-graph Find the optimal solution for Max-Trans-Graph in each component by applying an ILP solver in a cutting-plane method, as described in Section 3.1.
Exact-forest Find the exact solution for Max-Trans-Forest (see Equation 8) in each component by applying an ILP solver in a cutting-plane method.
LP-relax Solve Max-Trans-Graph approximately by applying an LP-relaxation (see Section 2) on each graph component. We apply the LP solver within the same cutting-plane method to allow for a direct comparison. As mentioned, our goal is to present a method for learning transitive graphs, whereas LP-relax produces solutions that violate transitivity. However, we run it on our data set to obtain empirical results, and to compare runtimes against TNF.
Graph-Node-Fix (GNF) Initialize each component in the following way: If the graph is very sparse, that is, λ ≥ C for some constant C (set to 1 in our experiments), then solving the graph exactly is not an issue and we use Exact-graph. Otherwise, we initialize by applying Exact-graph in a sparse configuration, that is, λ = C.
Tree-Node-Fix (TNF) Initialize as in GNF, except that if it generates a graph that is not an FRG, it is corrected by a simple heuristic: For every node in the reduced graph that has more than one parent, we choose from its current parents the single one whose SCC is composed of the largest number of nodes in
.
We do not present the results of TNCF in this experiment, because for medium-sized graphs it provides results that are almost identical to TNF.
The No-trans baseline is a state-of-the-art local learning algorithm. It uses state-of-the-art local scores such as DIRT (Lin and Pantel 2001), BInc (Szpektor and Dagan 2008), Cover (Weeds and Weir 2003), and SR (Schoenmackers et al. 2010) and trains a classifier using these scores. Berant, Dagan, and Goldberger (2010) have demonstrated empiricially that training a classifier over multiple similarity scores improves performance compared with using just a single similarity score.
The Exact-graph algorithm is the state-of-the-art global method presented by Berant, Dagan, and Goldberger (2012). It finds the optimal solution for Max-Trans-Graph, and our goal is to show that we can obtain good performance more efficiently using our approximation methods. Last, LP-relax is a standard approximation method for solving ILPs (Martins, Smith, and Xing 2009).
We note that a trivial baseline would be to initialize edges based on whether sij > λ and then compute the transitive closure in each component. We empirically found that this adds edges too aggressively and results in very low precision.
We use the Gurobi optimization package5 as our ILP solver in all experiments. The experiments were run on a multi-core 2.5GHz server with 32GB of RAM.
We evaluate algorithms by comparing the set of gold-standard edges with the set of edges learned by each algorithm. We measure recall, precision, and F1 for various values of the sparseness parameter λ, and compute the area under the precision-recall curve (AUC) generated. Efficiency is evaluated by comparing runtimes.
We first focus on runtimes and show that TNF is substantially more efficient than other baselines that use transitivity. Figure 8 compares runtimes of Exact-graph, GNF, TNF, and LP-relax as −λ increases and the graph becomes denser. Note that the y-axis is in logarithmic scale. Clearly, Exact-graph is extremely slow and runtime increases quickly. For λ = 0.3, runtime was already 12 hours and we were unable to obtain results for λ < 0.3, whereas in TNF we easily got a solution for any λ. When λ = 0.6, where both Exact-graph and TNF achieve best F1, TNF is 10 times faster than Exact-graph. When λ = 0.5, TNF is 50 times faster than Exact-graph, and so on. Most importantly, runtime for GNF and TNF increases much more slowly than for Exact-graph. Comparing runtimes for TNF and GNF, we see that the gap between the algorithms decreases as −λ increases. However, for reasonable values of λ, TNF is about four to seven times faster than GNF, and we were unable to run GNF on large graphs, as we report in Section 5.
Runtime of LP-relax is also bad compared with TNF and GNF. Runtime increases more slowly than Exact-graph, but still very fast compared with TNF. When λ = 0.6, LP-relax is almost 10 times slower than TNF, and when λ = −0.1, LP-relax is 200 times slower than TNF. This points to the difficulty of scaling LP-relax to large graphs. Last, Exact-forest is the slowest algorithm and because it is an approximation of Exact-graph we omit if from the figure for clarity.
We now examine the quality of the learned graphs and validity of our modeling assumptions. Figure 9 (left) shows a pair-wise comparison of Exact-graph and Exact-forest. As is evident, the two curves are very similar, and the maximal F1 on the curve and AUC are almost identical. This provides further support for our modeling assumption that entailment graphs are roughly forest-reducible. Figure 9 (right) shows a similar comparison for TNF and GNF. We observe again that the curves are similar and performance is almost identical (maximal F1: 0.41, AUC: 0.31), illustrating that using the FRG assumption when using our approximation algorithm is empirically effective.
Pair-wise comparison of the precision-recall curves of Exact-graph vs. Exact-forest, and GNF vs. TNF.
Pair-wise comparison of the precision-recall curves of Exact-graph vs. Exact-forest, and GNF vs. TNF.
In Figure 10, we compare the performance of Exact-graph, which exactly solves Max-Trans-Graph, to our most efficient approximation algorithm, TNF, and to No-trans, a baseline that does not use transitivity at all (GNF and LP-relax are omitted from the figure to improve readability). We observe that both Exact-graph and TNF substantially outperform No-trans and that TNF's graph quality is only slightly lower than Exact-graph (which is extremely slow). We report in the caption the maximal F1 on the curve and AUC in the recall range 0–0.5 (the widest range for which we have results for all algorithms). Note that compared with Exact-graph, TNF reduces AUC by merely a point and the maximal F1 score by two points only.
Precision (y-axis) vs. recall (x-axis) curve. Maximal F1 on the curve is 0.43 for Exact-graph, 0.41 for TNF, and 0.34 for No-trans. AUC in the recall range 0–0.5 is 0.32 for Exact-graph, 0.31 for TNF, and 0.26 for No-trans.
Precision (y-axis) vs. recall (x-axis) curve. Maximal F1 on the curve is 0.43 for Exact-graph, 0.41 for TNF, and 0.34 for No-trans. AUC in the recall range 0–0.5 is 0.32 for Exact-graph, 0.31 for TNF, and 0.26 for No-trans.
As for LP-relax, results are just slightly lower than Exact-graph (maximal F1: 0.43, AUC: 0.32), but its output is not a transitive graph, and as shown above runtime is quite slow.
To summarize, in this section we have empirically evaluated the algorithms presented in Section 3 on medium-sized graphs for which we can find an optimal solution. Our main findings are as follows:
- 1.
Finding the optimal solution for Exact-graph and Exact-forest results in very similar graphs. This supports our assumption that entailment graphs are approximately forest-reducible.
- 2.
Running GNF and TNF also yields very similar graphs, illustrating that using the FRG assumption when using our re-attachment approximation scheme is effective.
- 3.
Our efficient approximation algorithm, TNF, is substantially faster compared with running an ILP solver that exactly solves Max-Trans-Graph.
- 4.
TNF results in only a slight decrease in quality of learned graphs compared with Exact-graph.
- 5.
TNF, which respects the transitivity constraint, learns graphs of higher quality compared with the local baseline, No-trans.
These findings lead us to believe that the algorithms presented in Section 3 can scale to large entailment graphs. In the next section, we empirically evaluate on a much larger data set for which using ILP solvers is impractical.
5. Experimental Evaluation: Untyped Entailment Graphs
In this section we evaluate our methods on a graph with 20,000 nodes. Again, we describe how the nodes V and the weighting function w are constructed.
5.1 Untyped Entailment Graphs
The nodes of the entailment graph we learn in this section are not typed. Although ambiguity is a problem in this setting, we will show that nevertheless transitivity constraints can improve results compared with a state-of-the-art local entailment classifier.
The nodes of the graph were generated from a set of two billion tuples of the form (arg1,predicate,arg2) that were extracted by the Reverb open information extraction system (Fader, Soderland, and Etzioni 2011) over the Clueweb096 data set and were generously provided by the authors of Reverb.
We performed some simple preprocessing over the extracted tuples. Each tuple is accompanied by a confidence value and we discarded tuples with confidence smaller than 0.5. Next, we normalized predicates using a procedure that omits unnecessary content such as modals and adverbs. The entire normalization procedure is publicly available as part of the Reverb project.7 Last, we normalized arguments by replacing pronouns and determiners by the tokens PRONOUN and DET. We also ran the BIU number normalizer8 and replaced numbers larger than one by the token NUM. After these three steps, we are left with a total of 960 million tuples of which 291 million are distinct.
The number of distinct predicates in the set of extracted tuples is 103,315. Because this is still a large number, we restrict the predicate set to the 10,000 predicates that appear with the highest number of pairs of arguments. As previously explained, each predicate is split into two (for example, defeat is split into X defeat Y and Y defeat X), and so the final number of entailment graph nodes is 20,000.
5.2 Training a Local Entailment Classifier
As with typed entailment graphs, the weighting function w is obtained by training a classifier that provides a score sij for all pairs of predicates9 and defining wij = sij − λ. This setting, computing the scores sij, involves the following steps:
- 1.
Training set generation. We use the data set released by Zeichner, Berant, and Dagan (2012), which contains 6,567 entailment rule applications annotated for their validity by crowdsourcing. For example, the data set marks that The exercises alleviate pain ⇒ The exercises help ease pain is a valid rule application, whereas Obama want to boost the defense budget ⇒ Obama increase the defense budget is an invalid rule application. We extract a single rule from each rule application, for example, from the rule application The exercises alleviate pain ⇒ The exercises help ease pain we extract the rule X alleviate Y ⇒ X help ease Y. We use half of the data set for training, resulting in 1,224 positive examples and 2,060 negative examples. Another two training examples are X unable to pay Y ⇒ X owe Y and X own Y ⇏ Y be sold to X.
- 2.
Feature representation. Each pair of predicates (p1, p2) is represented by a feature vector where the first six are distributional similarity features identical to the first six features described in Section 4.2. In addition, for pairs of predicates for which at least one distributional similarity feature is non-zero, we add lexicographic features computed from WordNet (Fellbaum 1998), VerbOcean (Chklovski and Pantel 2004), and CatVar (Habash and Dorr 2003), as well as string-similarity features. Table 2 provides the exact details of these features. A feature is computed for a pair of predicates (p1, p2) where in this context a predicate is a pair (pred,rev): pred is the lexical realization of the predicate, and rev is a Boolean indicating whether arg1 is X and arg2 is Y or vice versa. Overall, each pair of predicates is represented by 27 features.
- 3.
Training. After obtaining a feature representation for every pair of predicates, we train a Gaussian kernel SVM classifier that optimizes F1 (SVMperf implementation [Joachims 2005]), and tune the parameters C and γ by a grid search combined with 5-fold cross validation. We use the trained local classifier to compute a score sij for all pairs of predicates.
- 4.
Prior knowledge. We automatically generate local constraints for pairs of predicates for which we know with high certainty whether the first entails the second or not. We define and compute constraints over pairs of predicates (p1, p2), where again a predicate p is a pair (pred, rev). We start with negative local constraints (Examples in Table 3):
- (a)
cousin: (p1, p2) are cousins if rev1 = rev2 and pred1 = pred2, except fora single pair of words w1 and w2, which are cousins in WordNet—that is, they have a common hypernym at distance 2.
- (b)
indirect hypernym: (p1, p2) are indirect hypernyms if rev1 = rev2 and pred1 = pred2, except for a single pair of words w1 and w2, and w1 is a hypernym at distance 2 of w2 in WordNet.
- (c)
antonym: (p1, p2) are antonyms if rev1 = rev2 and pred1 = pred2, except for a single pair of words w1 and w2, and w1 is an antonym of w2 in WordNet.
- (d)
uncertainty implication: holds for (p1, p2) if rev1 = rev2 and concatenating the words want to to pred2 results in pred1.
- (e)
negation: Negation holds for (p1, p2) and (p2, p1), if rev1 = rev2 and in addition removing a single negation word (not, no, never, or nt) from pred1 results in pred2. Negation also holds for (p1, p2) and (p2, p1), if rev1 = rev2 and in addition replacing in pred1 the word no for the word a results in pred2.
- (f)
transitive opposites: (p1, p2) are transitive opposites if pred1 = pred2 and rev1 ≠ rev2 and pred1 is a transitive verb in VerbNet.
- (a)
determiner: The determiner constraint holds for (p1, p2) and (p2, p1) if rev1 = rev2 and omitting a determiner (a or the) from pred1 results in pred2.
- (b)
positive implication: Positive implication holds for (p1, p2) if rev1 = rev2 and concatenating the words manage to or start to or start or decide to or begin to to pred2 results in pred1.
- (c)
passive-active: The passive-active constraint holds for (p1, p2) and (p2, p1) if p2 is the passive form of p1.
- (a)
Definition of lexicographic and string-similarity features. We denote by the string pred the lexical realization of the predicate, and by the Boolean indicator rev whether arg1 is X and arg2 is Y, or vice versa. Normalized edit-distance is edit-distance divide by the sum of lengths of pred1 and pred2.
Name . | Type . | Source . | Description . |
---|---|---|---|
synonym | binary | WordNet | rev1 = rev2 ∧ pred1 is a synonym of pred2. |
loose synonym | binary | WordNet | rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1,w2) where w1 is a synonym of w2. |
hypernym | binary | WordNet | rev1 = rev2 ∧ pred1 is a hypernym of pred2 at distance ≤ 2. |
loose hypernym | binary | WordNet | rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1,w2) where w1 is a hypernym of w2 at distance ≤ 2. |
hyponym | binary | WordNet | rev1 = rev2 ∧ pred1 is a hyponym of pred2 at distance ≤ 2. |
loose hyponym | binary | WordNet | rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1,w2) where w1 is a hyponym of w2 at distance ≤ 2. |
co-hyponym | binary | WordNet | rev1 = rev2 ∧ pred1 is a co-hyponym of pred2 at distance ≤ 2. |
loose co-hyponym | binary | WordNet | rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1,w2) where w1 is a co-hyponym of w2 at distance ≤ 2. |
entailment | binary | WordNet | rev1 = rev2 ∧ pred1 verb-entails pred2 (distance ≤ 1). |
loose entailment | binary | WordNet | rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1,w2) where w1 verb-entails w2 (distance ≤ 1). |
stronger | binary | VerbOcean | rev1 = rev2 ∧ pred1 is stronger-than pred2. |
loose stronger | binary | VerbOcean | rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1,w2) where w1 is stronger-than w2. |
rev-stronger | binary | VerbOcean | rev1 = rev2 ∧ pred2 is stronger-than pred1. |
loose rev-stronger | binary | VerbOcean | rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1,w2) where w2 is stronger-than w1. |
CatVar | binary | CatVar | pred1 and pred2 contain a pair of content words (w1,w2) that are either identical or derivationally-related in CatVar. |
remove word | binary | String | rev1 = rev2 ∧ removing a single word from pred1 will result in pred2. |
add word | binary | String | rev1 = rev2 ∧ adding a single word to pred1 will result in pred2. |
remove adj. | binary | String | rev1 = rev2 ∧ removing a single adjective from pred1 will result in pred2. |
add adj. | binary | String | rev1 = rev2 ∧ adding a single adjective to pred1 will result in pred2. |
Edit | real | String | if rev1 = rev2, the normalized edit-distance between pred1 and pred2, otherwise 1. |
Reverse | binary | String | rev1 = rev2. |
Name . | Type . | Source . | Description . |
---|---|---|---|
synonym | binary | WordNet | rev1 = rev2 ∧ pred1 is a synonym of pred2. |
loose synonym | binary | WordNet | rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1,w2) where w1 is a synonym of w2. |
hypernym | binary | WordNet | rev1 = rev2 ∧ pred1 is a hypernym of pred2 at distance ≤ 2. |
loose hypernym | binary | WordNet | rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1,w2) where w1 is a hypernym of w2 at distance ≤ 2. |
hyponym | binary | WordNet | rev1 = rev2 ∧ pred1 is a hyponym of pred2 at distance ≤ 2. |
loose hyponym | binary | WordNet | rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1,w2) where w1 is a hyponym of w2 at distance ≤ 2. |
co-hyponym | binary | WordNet | rev1 = rev2 ∧ pred1 is a co-hyponym of pred2 at distance ≤ 2. |
loose co-hyponym | binary | WordNet | rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1,w2) where w1 is a co-hyponym of w2 at distance ≤ 2. |
entailment | binary | WordNet | rev1 = rev2 ∧ pred1 verb-entails pred2 (distance ≤ 1). |
loose entailment | binary | WordNet | rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1,w2) where w1 verb-entails w2 (distance ≤ 1). |
stronger | binary | VerbOcean | rev1 = rev2 ∧ pred1 is stronger-than pred2. |
loose stronger | binary | VerbOcean | rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1,w2) where w1 is stronger-than w2. |
rev-stronger | binary | VerbOcean | rev1 = rev2 ∧ pred2 is stronger-than pred1. |
loose rev-stronger | binary | VerbOcean | rev1 = rev2 ∧ pred1 and pred2 are identical except for a pair of words (w1,w2) where w2 is stronger-than w1. |
CatVar | binary | CatVar | pred1 and pred2 contain a pair of content words (w1,w2) that are either identical or derivationally-related in CatVar. |
remove word | binary | String | rev1 = rev2 ∧ removing a single word from pred1 will result in pred2. |
add word | binary | String | rev1 = rev2 ∧ adding a single word to pred1 will result in pred2. |
remove adj. | binary | String | rev1 = rev2 ∧ removing a single adjective from pred1 will result in pred2. |
add adj. | binary | String | rev1 = rev2 ∧ adding a single adjective to pred1 will result in pred2. |
Edit | real | String | if rev1 = rev2, the normalized edit-distance between pred1 and pred2, otherwise 1. |
Reverse | binary | String | rev1 = rev2. |
Examples for negative local constraints. The column ⇒/⇔ indicates whether the constraint is directional or symmetric.
Name . | ⇒/⇔ . | Example . |
---|---|---|
cousin | ⇔ | X beat Y ⇏ X fix Y |
indirect hypernym | ⇒ | X give all kind of ⇏ X sell all kind of Y |
antonym | ⇔ | X announce death of ⇏ X announce birth of Y |
uncertainty implication | ⇒ | X want to analyze Y ⇏ X analyze Y |
negation | ⇔ | X do no harm to Y ⇏ X do harm to Y |
transitive opposite | ⇔ | X abandon Y ⇏ Y abandon X |
Name . | ⇒/⇔ . | Example . |
---|---|---|
cousin | ⇔ | X beat Y ⇏ X fix Y |
indirect hypernym | ⇒ | X give all kind of ⇏ X sell all kind of Y |
antonym | ⇔ | X announce death of ⇏ X announce birth of Y |
uncertainty implication | ⇒ | X want to analyze Y ⇏ X analyze Y |
negation | ⇔ | X do no harm to Y ⇏ X do harm to Y |
transitive opposite | ⇔ | X abandon Y ⇏ Y abandon X |
Examples for positive local constraints. The column ⇒/⇔ indicates whether the constraint is directional or symmetric.
Name . | ⇒/⇔ . | Example . |
---|---|---|
determiner | ⇔ | X be the answer to Y ⇒ X be answer to Y |
positive implication | ⇒ | X start to doubt Y ⇒ X doubt Y |
passive-active | ⇔ | X command Y ⇒ Y be commanded by X |
Name . | ⇒/⇔ . | Example . |
---|---|---|
determiner | ⇔ | X be the answer to Y ⇒ X be answer to Y |
positive implication | ⇒ | X start to doubt Y ⇒ X doubt Y |
passive-active | ⇔ | X command Y ⇒ Y be commanded by X |
5.3 Experimental Evaluation
To evaluate performance we use the second half of the data set released by Zeichner, Berant, and Dagan (2012) as a test set. This test set contains 3,283 annotated examples, where 1,734 are covered by the 10,000 nodes of our graph (649 positive examples and 1,085 negative examples). As usual, for each algorithm we compute recall and precision with respect to the gold standard at various points by varying the sparseness parameter λ.
Most methods used over typed graphs in Section 4 (Exact-graph, Exact-forest, LP-relax, and GNF) are intractable because they require an ILP solver and our graph does not decompose into components that are small enough. Therefore, we compare the No-trans local baseline, where transitivity is not used, with the efficient global methods TNF and TNCF. Recall that in Section 4 we initialized TNF by applying an ILP solver in a sparse configuration on each graph component. In this experiment the graph is too large to use an ILP solver and so we initialize TNF and TNCF with a simple heuristic we describe next.
We begin with an empty graph and first sort all pairs of predicates (i, j) for which wij > 0, according to their weight w. Then, we go over predicate pairs one-by-one and perform two operations. First, we verify that inserting (i, j) into the graph does not violate the FRG assumption. This is done by going over all edges (i, k) ∈ E and checking that for every k either ( j, k) ∈ E or (k, j) ∈ E. If this is the case, then the edge (i, j) is a valid candidate edge as the resulting reduced graph will remain a directed forest, otherwise adding it will result in i having two parents in the reduced graph, and we do not insert this edge. Second, we compute the transitive closureTij, which contains all node pairs that must be edges in the graph in case we insert (i, j), due to transitivity. We compute the change in the value of the objective function if we were to insert Tij into the graph. If this change improves the objective function value, we insert Tij into the graph (and so the value of the objective function increases monotonically). We keep going over the candidate edges from highest score to lowest until the objective function can no longer be improved by inserting a candidate edge. We term this initialization HTL-FRG, because we scan edges from high-score to low-score and maintain an FRG. We add this initialization procedure as another baseline.
Figure 11 presents the precision-recall curve of all global algorithms compared with the local classifier for 0.05 ≤ λ ≤ 2.5. One evident property is that No-trans reaches higher recall values than the global algorithms, which means that adding a global transitivity constraint prevents adding correct edges that have positive weight, since this would cause the addition of many other edges that have negative weight. A possible reason for that is predicate ambiguity, where positive weight edges connect connectivity components through an ambiguous predicate. This suggests that a natural direction for future research is to develop algorithms that do not impose transitivity constraints for rules i ⇒ j and j ⇒ k, if each rule refers to a different meaning of j. One possible direction is to learn latent “sense” or “topic” variables for each rule (as suggested by Melamud et al. [2013]). Then, we can use the methods presented in this article to impose transitivity constraints for each sense separately, and easily allow violations of transitivity across different senses. We provide concrete examples for such cases of ambiguity in our qualitative analysis in Section 5.4.
Precision-recall curve comparing global algorithms to the local algorithm.
Although global algorithms are limited in their recall, for recall values of 0.1–0.2 they substantially improve precision over No-trans. To better observe this result, Table 5 presents the results for the recall range 0.1–0.25. Comparing TNCF and No-trans shows that the TNCF algorithm improves over No-trans by 5–10 precision points: In the recall range of 0.1–.2, the precision of No-trans is 0.68–0.75, whereas the precision of TNCF is 0.74–0.8. The number of rules learned by TNCF in this recall range is about 100,000.
Recall, precision, F1, and the number of learned rules for No-trans and several global algorithms for parameters of λ for which recall is in the range 0.1–0.25.
method . | λ . | recall . | precision . | F1 . | # of rules . |
---|---|---|---|---|---|
No-trans | 1 | 0.1 | 0.75 | 0.18 | 39,872 |
No-trans | 0.8 | 0.14 | 0.7 | 0.24 | 57,276 |
No-trans | 0.6 | 0.17 | 0.68 | 0.27 | 65,232 |
No-trans | 0.4 | 0.19 | 0.7 | 0.3 | 77,936 |
No-trans | 0.2 | 0.24 | 0.65 | 0.35 | 116,832 |
HTL-FRG | 0.2 | 0.09 | 0.78 | 0.17 | 48,986 |
HTL-FRG | 0.15 | 0.1 | 0.78 | 0.18 | 55,756 |
HTL-FRG | 0.1 | 0.11 | 0.77 | 0.19 | 65,802 |
HTL-FRG | 0.05 | 0.12 | 0.76 | 0.21 | 84,380 |
TNF | 0.2 | 0.11 | 0.77 | 0.19 | 61,186 |
TNF | 0.15 | 0.11 | 0.72 | 0.19 | 71,674 |
TNF | 0.1 | 0.14 | 0.72 | 0.23 | 88,304 |
TNF | 0.05 | 0.15 | 0.72 | 0.25 | 127,826 |
TNCF | 0.2 | 0.12 | 0.79 | 0.21 | 66,240 |
TNCF | 0.15 | 0.13 | 0.75 | 0.23 | 80,450 |
TNCF | 0.1 | 0.16 | 0.8 | 0.27 | 102,565 |
TNCF | 0.05 | 0.18 | 0.74 | 0.29 | 156,208 |
method . | λ . | recall . | precision . | F1 . | # of rules . |
---|---|---|---|---|---|
No-trans | 1 | 0.1 | 0.75 | 0.18 | 39,872 |
No-trans | 0.8 | 0.14 | 0.7 | 0.24 | 57,276 |
No-trans | 0.6 | 0.17 | 0.68 | 0.27 | 65,232 |
No-trans | 0.4 | 0.19 | 0.7 | 0.3 | 77,936 |
No-trans | 0.2 | 0.24 | 0.65 | 0.35 | 116,832 |
HTL-FRG | 0.2 | 0.09 | 0.78 | 0.17 | 48,986 |
HTL-FRG | 0.15 | 0.1 | 0.78 | 0.18 | 55,756 |
HTL-FRG | 0.1 | 0.11 | 0.77 | 0.19 | 65,802 |
HTL-FRG | 0.05 | 0.12 | 0.76 | 0.21 | 84,380 |
TNF | 0.2 | 0.11 | 0.77 | 0.19 | 61,186 |
TNF | 0.15 | 0.11 | 0.72 | 0.19 | 71,674 |
TNF | 0.1 | 0.14 | 0.72 | 0.23 | 88,304 |
TNF | 0.05 | 0.15 | 0.72 | 0.25 | 127,826 |
TNCF | 0.2 | 0.12 | 0.79 | 0.21 | 66,240 |
TNCF | 0.15 | 0.13 | 0.75 | 0.23 | 80,450 |
TNCF | 0.1 | 0.16 | 0.8 | 0.27 | 102,565 |
TNCF | 0.05 | 0.18 | 0.74 | 0.29 | 156,208 |
We use HTL-FRG as an initialization method for TNCF, which on its own is not a very good approximation algorithm. Comparing HTL-FRG with No-trans, we observe that it is unable to reach high recall values and it only marginally improves precision compared with No-trans. Applying TNF over HTL-FRG also provides disappointing results—it increases recall compared with HTL-FRG, but precision is not better than No-trans. Indeed, it seems that although performing node re-attachments monotonically improves the objective function value, it is not powerful enough to learn a graph that is better with respect to our gold standard. On the other hand, adding component re-attachments (TNCF) allows the algorithm to better explore the space of graphs and learn a graph with higher recall and precision than HTL-FRG.
Figure 12 presents the runtime for HTL-FRG, TNF, and TNCF (y-axis in logarithmic scale), all of which yield an FRG and can be run on a large untyped graph. Again, we were unable to obtain results with Exact-graph, Exact-forest, LP-relax, and GNF, because using an ILP solver on a graph with 20,000 nodes was impossible. As expected, HTL-FRG is much faster than both TNF and TNCF, and TNCF is somewhat slower than TNF. However, as mentioned earlier, TNCF is able to improve both precision and recall compared with TNF and HTL-FRG. Note that when λ = 1.2, runtime increases suddenly for both TNF and TNCF. This is because at this point a large connected component is formed, as we explain next.
Figure 13 presents the results of graph decomposition as described in Section 3.1. Evidently, when λ = 0 the largest component constitutes almost all of the graph, which contains 20,000 nodes. Note that when λ = 1.2, the size of the largest component increases suddenly to more than half of the graph nodes. Additionally, TNCF obtained recall of 0.1–0.2 when λ ≤ 0.2, and in this case the number of nodes in the largest component is 90% of the total number of graph nodes. Thus, contrary to the typed graphs, untyped graphs that contain ambiguous predicates do not decompose well into small components.
The number of nodes in the largest component as a function of the sparseness prior −λ.
The number of nodes in the largest component as a function of the sparseness prior −λ.
To summarize, we demonstrated that our efficient global methods scale to a graph with 20,000 nodes and observed that even when predicates are ambiguous we are able to improve precision for moderate values of recall. Nevertheless, there are indications that our modeling assumptions are violated when applied over a large graph with ambiguous predicates. Transitivity and FRG constraints limit the recall we are able to reach, and the graph does not decompose very well into components. Thus, in future work we would like to apply our algorithms over graphs where predicate ambiguity is modeled and so transitivity constraints can be properly applied.
5.4 Qualitative Analysis
In the previous section we quantitatively evaluated global methods for learning entailment graphs over a large set of untyped predicates. Our results revealed interesting issues that we would like to further investigate:
- 1.
Why do methods that assume entailment graphs are transitive and forest-reducible have limited recall?
- 2.
How much does node re-attachment affect graph structure?
In this section, we try to answer these questions by performing a qualitative analysis that allows us to gain better insight on the behavior of our algorithms.
Question 1: Transitivity and FRG assumptions. We would like to understand why using global algorithms such as TNCF results in limited recall. Our hypothesis is that because predicates in the graph are untyped and potentially ambiguous, violations of the assumptions are possible, which results in recall reduction. To examine this hypothesis, we perform two qualitative analyses:
- 1.
We analyze graphs containing only gold-standard edges, and manually identify cases where the modeling assumptions are violated.
- 2.
We compare the local algorithm, No-trans, to the global algorithm, TNCF. If our hypothesis is correct, then we expect TNCF to remove correct edges that are inserted by No-trans due to ambiguous predicates.
We start by constructing a graph with all edges from our gold-standard data set and search for transitivity and FRG violations caused by ambiguous predicates. Note that the manually annotated gold-standard edges are only a subset of the full set of correct edges, which is actually much larger. Thus, we cannot automatically identify violations, because the graph is missing many true edges. Instead, we manually go over candidate violations and check if indeed this case is a result of a violation of our modeling assumptions, and if so then our global algorithm cannot predict the correct set of edges.
One example is the pair of rules seek ⇒ apply for and apply for ⇒ file for. The first rule was annotated in the rule application Others seek medical care ⇒ Others apply for medical care (Recall that crowdsourcing was used to annotate rule applications, and rules were extracted from these applications.) The second rule was extracted from the rule application Students apply for more than one award ⇒ Student file for more than one award. This causes a transitivity violation because Students seek more than one award ⇏ Student file for more than one award. Evidently, the meaning of the predicate apply for is context-dependent. Another example is the pair of rules contain ⇒ supply and supply ⇒ serve annotated in the applications The page contains links ⇒ The page supplies links and Users supply information or material ⇒ Users serve information or material. Clearly, contain ⇏ serve and the transitivity violation is caused by the fact that the predicate supply is context-dependent—in the first context the subject is inanimate, whereas in the second context the subject is human.
A good example for an FRG violation is the predicate come from. The following three entailment rules are annotated by the turkers: come from ⇒ be raised in, come from ⇒ be derived from, and come from ⇒ come out of. These correspond to the following rule applications: The Messiah comes from Judah ⇒ The Messiah was raised in Judah, The colors come from the sun light ⇒ The colors are derived from the sun light, and The truth comes from the book ⇒ The truth comes out of the book. Clearly, be raised in ⇏ be derived from and be derived from ⇏ be raised in, so this is an FRG violation. Indeed, the three applications correspond to different meanings of the predicate come from, which depend on context. A second example is the pair of rules be dedicated to ⇒ be committed to and be dedicated to ⇒ contain information on. Again, this is an FRG violation because be committed to ⇏ contain information on and contain information on ⇏ be committed to. This violation occurs because of the ambiguity of the predicate be dedicated to, which can be resolved by knowing whether the subject is human or is an object that carries information (for example, The web site is dedicated to the life of lizards).
This first analysis revealed particular cases where the transitivity and FRG assumptions do not hold. In the next analysis, we would like to directly compare cases where global and local methods disagree with one another. We hypothesize that because predicates are not disambiguated, then global algorithms will delete correct edges due to an ambiguous predicate. We set the sparseness parameter λ = 0.2 and compare the set of edges learned by No-trans to the set of edges learned by TNCF.
Examining the disagreements, we observe that, as expected, in 90% of the cases TNCF deletes an edge that was inserted by No-trans. We divide these deleted edges into two categories in the following manner: We compute the weakly connected components of the graph learned by TNCF,10 and then for each edge inserted by No-trans and deleted by TNCF we check whether it connects two different weakly connected components or not. There are two reasons for focusing on this property of deleted edges. First, we hypothesize that most deleted edges will connect two different connected components, because such edges result in many violations of transitivity. Second, weakly connected components usually correspond to separate semantic meanings, and thus a correct edge that connects two weakly connected components probably involves an ambiguous predicate.
Indeed, out of all edges where No-trans inserts an edge and TNCF does not, 76.4% connect weakly connected components in the global graph. This proves that deleting such edges is the main cause for disagreement between TNCF and No-trans. Now, we can take a closer look at these edges, and check whether indeed when a correct edge is deleted, it is because of an ambiguous predicate.
We randomly sample five cases where TNCF erroneously deleted an edge connecting weakly connected components—these are cases where ambiguity is likely to occur. For comparison, we also sample five cases where TNCF was right and No-trans erred. Table 6 shows the samples where No-trans is correct. The first column describes the rule and the second column specifies the score sij provided by the local classifier No-trans. The last two columns detail two examples for predicates that are in the same weakly connected component with the rule LHS and RHS in the graph learned by TNCF. These two columns provide a flavor for the overall meaning of predicates that belong to this component. Table 7 is equivalent and shows the samples where TNCF is correct.
Correct edges inserted by No-trans and omitted by TNCF that connect weakly connected components in TNCF. An explanation for each column is provided in the body of the section.
Example . | Rule . | sij . | LHS component . | RHS component . |
---|---|---|---|---|
1 | X turn off Y ⇒ X cut off Y | 1.32 | X shut off Y, X cuton Y | X chop Y, X slice Y |
2 | X be arrested in Y ⇒ X be captured in Y | 0.27 | X be killed on Y, Xbe arrested on Y | X be delivered in Y, Xbe provided in Y |
3 | X die of Y ⇒ X be diagnosed with Y | 0.88 | X gain on Y, X runon Y | X be treated for Y, Xhave symptom of Y |
4 | X be available to Y ⇒ X be open to Y | 0.27 | X be offered to Y, Xbe provided to Y | X open for Y, X openup for Y |
5 | X be put by Y ⇒ X be run by Y | 0.41 | Y help put X, X beplaced by Y | Y operate X, Y controlX |
Example . | Rule . | sij . | LHS component . | RHS component . |
---|---|---|---|---|
1 | X turn off Y ⇒ X cut off Y | 1.32 | X shut off Y, X cuton Y | X chop Y, X slice Y |
2 | X be arrested in Y ⇒ X be captured in Y | 0.27 | X be killed on Y, Xbe arrested on Y | X be delivered in Y, Xbe provided in Y |
3 | X die of Y ⇒ X be diagnosed with Y | 0.88 | X gain on Y, X runon Y | X be treated for Y, Xhave symptom of Y |
4 | X be available to Y ⇒ X be open to Y | 0.27 | X be offered to Y, Xbe provided to Y | X open for Y, X openup for Y |
5 | X be put by Y ⇒ X be run by Y | 0.41 | Y help put X, X beplaced by Y | Y operate X, Y controlX |
Erroneous edges inserted by No-trans and omitted by TNCF that connect weakly connected components in TNCF. An explanation for each column is provided in the body of the section.
Example . | Rule . | sij . | LHS component . | RHS component . |
---|---|---|---|---|
1 | X want to see Y ⇒ X wantto help Y | 0.83 | X need to visit Y, Xneed to see Y | X be a supporter of Y, Xneed to support Y |
2 | X cut Y ⇒ X fix Y | 0.88 | X chop Y, X slice Y | X cure Y, X mend Y |
3 | X share Y ⇒ X understand Y | 0.62 | X partake in Y, Y beshared by X | Y be recognized by X, Xrealize Y |
4 | X build Y ⇒ X rebuild Y | 1.42 | X construct Y, Y bemanufactured by X | X regenerate Y, X restoreY |
5 | X measure Y ⇒ X weigh Y | 0.65 | X quantify Y, X be ameasure of Y | X count for Y, X appear Y |
Example . | Rule . | sij . | LHS component . | RHS component . |
---|---|---|---|---|
1 | X want to see Y ⇒ X wantto help Y | 0.83 | X need to visit Y, Xneed to see Y | X be a supporter of Y, Xneed to support Y |
2 | X cut Y ⇒ X fix Y | 0.88 | X chop Y, X slice Y | X cure Y, X mend Y |
3 | X share Y ⇒ X understand Y | 0.62 | X partake in Y, Y beshared by X | Y be recognized by X, Xrealize Y |
4 | X build Y ⇒ X rebuild Y | 1.42 | X construct Y, Y bemanufactured by X | X regenerate Y, X restoreY |
5 | X measure Y ⇒ X weigh Y | 0.65 | X quantify Y, X be ameasure of Y | X count for Y, X appear Y |
Example 1 in Table 6 already demonstrates the problem of ambiguity. The LHS for this rule application in the crowdsourcing experiment was your parents turn off comments and indeed in this case the RHS your parents cut off comments is inferred. However, we can see that the LHS component revolves around the turning off or shutting off of appliances, for example, whereas the RHS component deals with a more explicit and physical meaning of cutting. This is because the predicate X cut off Y is ambiguous and consequently TNCF omits this correct edge. Example 5 also demonstrates well the problem of ambiguity—the LHS of this rule application is This site was put by fans, which in this context entails the RHS This site was run by fans. However, the more common sense of be put by does not entail the predicate be run by; this sense is captured by the predicates in the LHS component Y help put X and X be placed by Y. Example 4 is another good example where the ambiguous predicate is X be open to Y—the RHS component is concerned with the physical state of being opened, whereas the LHS component has a more abstract sense of availability. The LHS of the rule application in this case was The services are available to museums. Examples 2 and 3 are cases where the problem is not in predicate ambiguity but in the connected component itself. Thus, out of five randomly sampled examples where TNCF deleted an edge that connects two weakly connected components, three can be attributed to predicate ambiguity.
Table 7 demonstrates cases where TNCF correctly omitted an edge and consequently decomposed a weakly connected component into two. These are classical cases where the global constraint of transitivity helps the algorithm avoid errors. In Example 1, although sij = 0.83 (which is higher than λ = 0.2), the edge is not inserted because this would cause the addition of wrong edges from the LHS component that deals with seeing and visiting to the RHS component, which is concerned with help and support. Example 2 is a case of a pair of predicates that are co-hyponyms or even perhaps antonyms. Transitivity helps TNCF avoid this erroneous rule.
To conclude, our analysis reveals that applying structural constraints encourages global algorithms to omit edges. Although this is often desirable, it can also prevent correct rules from being discovered because of problems of ambiguity. Thus, as discussed in Section 5.3, we believe that an important direction for future research is to apply our global graph learning methods over context-sensitive representations (such as the one presented by Melamud et al. [2013]).
Question 2: Node re-attachment. We would like to understand the effect of applying TNF over the entailment graph, or more precisely, check whether TNF substantially alters the set of graph edges compared with an initialization with HTL-FRG. First, we run both HTL-FRG and TNF with sparseness parameter λ = 0.2 and compute the proportion of edges that is in their intersection. HTL-FRG learns 48,986 edges and TNF learns 61,186 edges. The number of edges in the intersection is 35,636. This means that 27% of the edges learned by HTL-FRG were removed (13,350 edges) and in addition 25,550 edges were added into the graph. Clearly, this indicates that TNF changes the learned graph substantially.
We exemplify this with a few fragments of graphs learned by HTL-FRG and TNF. Figure 14 shows 11 predicates,11 where the black solid edges correspond to HTL-FRG edges and red dashed edges correspond to TNF edges. Nodes that contain more than a single predicate represent an agreement between TNF and HTL-FRG that all predicates in the node entail one another. Clearly, TNF substantially changes the initialization generated by HTL-FRG. HTL-FRG creates an erroneous component in which give a lot of and generate a lot of are synonymous and mean a lot of entails them. TNF disassembles this component and determines that give a lot of is equivalent to offer plenty of, while generate a lot of entails cause of. Moreover, TNF disconnects the predicate mean a lot of completely. TNF also identifies that the predicates offer a wealth of, provide a wealth of, provide a lot of, and provide plenty of entail offer plenty of. Of course, TNF sometimes causes errors—for example, HTL-FRG decided that cause and cause of are equivalent, but TNF deleted the correct entailment rule cause of ⇒ cause.
A fragment from the graphs learned by HTL-FRG (black solid edges) and TNF (red dashed edges).
A fragment from the graphs learned by HTL-FRG (black solid edges) and TNF (red dashed edges).
Figure 15 provides another interesting example for an improvement in the graph due to the application of TNF. In the initialization, HTL-FRG determined that the predicate encrypt entails the predicates convert and convince rather than the predicates code and encode. This is because the local score provided by the classifier for (encrypt, convert) is 0.997—slightly higher than the local score for (encrypt, encode), which is 0.995. Therefore, HTL-FRG inserts the wrong edge encrypt ⇒ convert and then it is unable to insert the correct rule encrypt ⇒ encode because of the FRG assumption. After HTL-FRG terminates, TNF goes over the nodes one by one and is able to determine that overall the predicate encrypt fits better as a synonym of the predicates code and encode and re-attaches it in the correct location. As an aside, notice that both HTL-FRG and TNF are able to identify in this case the correct directionality of the rule compress ⇒ encode. The local score given by the classifier for (compress, encode) is 0.65, whereas the local score for (encode, compress) is –0.11.
A fragment from the graphs learned by HTL-FRG (black solid edges) and TNF (red dashed edges).
A fragment from the graphs learned by HTL-FRG (black solid edges) and TNF (red dashed edges).
As a last example, Figure 16 demonstrates again the problem of predicate ambiguity. The predicate depress can occur in two contexts that, though related, instigate different entailment relations. The first context is when the object of the predicate (the Y argument) is a person and then the meaning of depress is similar to discourage. A second context is when the object is some phenomenon, for example, The serious economical situation depresses consumption levels. In initialization, HTL-FRG generates a set of edges that is compatible with the latter context, determining that the predicate depress entails the predicates lower and bring down. TNF re-attaches depress as a synonym of discourage and deter (because this improves the objective function), resulting in a set of edges that corresponds to the first meaning of depress mentioned above. The methods we use in this article do not allow learning the rules for both contexts because this would result in violations of the transitivity and FRG assumptions. This again emphasizes the importance of learning graphs where predicates are marked by their various senses, which will result in a model that can directly benefit from the methods suggested in this article.
A fragment from the graphs learned by HTL-FRG (black solid edges) and TNF (red dashed edges).
A fragment from the graphs learned by HTL-FRG (black solid edges) and TNF (red dashed edges).
6. Conclusions
The problem of language variability is at the heart of many semantic applications such as Information Extraction, Question Answering, Semantic Parsing, and more. Consequently, learning broad coverage knowledge bases of entailment rules and paraphrases (Ganitkevitch, Van Durme, and Callison-Burch 2013) has proved crucial for systems that perform inference over textual representations (Stern and Dagan 2012; Angeli and Manning 2014).
In this article, we have presented algorithms for learning entailment rules between predicates that can scale to large sets of predicates. Our work builds on prior work by Berant, Dagan, and Goldberger (2012), who defined the concept of entailment graphs, and formulated entailment rule learning as a graph optimization problem, where the graph has to obey the structural constraint of transitivity.
Our main contribution is a heuristic polynomial approximation algorithm that can learn entailment graphs containing tens of thousands of nodes. The algorithm is based on two main observations: first, that entailment graphs are sparse, and second, that entailment graphs are forest-reducible. This leads to an efficient algorithm in which at the first step the graph is decomposed into smaller components, and in the second step we find a set of edges for each component. The second step is an iterative approximation algorithm in which at each iteration a node or component of the graph is deleted and then re-attached in a way that improves the overall objective function.
We performed a thorough empirical evaluation, in which we ran our algorithm on two separate data sets. On the first data set we witnessed a substantial improvement in runtime at a relatively small cost in performance. On the second data set, we show that our method scales to an untyped entailment graph containing 20,000 nodes, much larger than was previously possible using state-of-the-art global methods. We see that using a global algorithm improves precision for modest values of recall, but that local methods can achieve higher recall than global methods. We perform an analysis and conclude that the main reason for this limitation is predicate ambiguity, which results in graphs that do not adhere to the structural constraints of transitivity and forest reducibility.
In future work, we hope to address the problem of predicate ambiguity. A natural direction is to combine methods that model the context-sensitivity of predicates with global structural assumptions. In this way, one can apply structural constraints for a set of predicates, given that they appear in similar contexts. A second important direction for future work is to demonstrate that using entailment rules learned by our method improves performance in downstream semantic applications such as QA, IE, and Recognizing Textual Entailment.
Appendix A. Max-Trans-Forest Is NP-hard
In this appendix we show that the Max-Trans-Forest problem (8) is NP-hard. We prove that the following decision problem (which is clearly easier than Max-Trans-Forest) is NP-hard: Given a set of nodes V, a function w : V × V → ℝ, and an integer number α, is there a transitive forest-reducible (FRG) subgraph G = (V, E) such that ∑e∈Ew(e) ≥ α? In this appendix we use the standard terminology that for a directed edge i ⇒ j, the node i is termed the parent of node j, and the node j is termed the child of i. We start with a simple polynomial reduction from the following decision problem.
Max-Sub-FRG: Given a directed graph G = (V, E), a function w : E → ℤ+, and a positive integer α, is there a transitive FRG subgraph G′ = (V, E′) such that ∑e∈E′w(e) ≥ α?
Max-Sub-FRG ≤pMax-Trans-Forest: Given an instance (G = (V, E),w,α) of Max-Sub-FRG, we construct the instance (G′ = (V, E′),w′,α) of Max-Trans-Forest such that w′(u,v) = w(u, v) if (u, v) ∈ E and −∞ otherwise. We need to show that (G = (V, E),w,α) ∈ Max-Sub-FRG iff (G′ = (V, E′),w′,α) ∈ Max-Trans-Forest. This is trivial: If there is a transitive FRG subgraph of G whose sum of edges ≥ α, then choosing the same edges will yield a transitive FRG subgraph of G′ whose sum of edges ≥ α. Similarly, any transitive FRG subgraph of G′ whose sum of edges ≥ α cannot use any −∞ edges, and therefore the edges of this FRG are in E and this defines a subgraph of G whose sum of edges ≥ α.
Note that for Max-Sub-FRG, maximizing the sum of weights of edges in the subgraph is equivalent to minimizing the sum of weights of edges not in the subgraph. We now denote by z the sum of weights of the edges deleted from the graph.
Next we show a polynomial reduction from the NP-hard Exact Cover by 3-sets (X3C) problem (Garey and Johnson 1979) to Max-Sub-FRG. The X3C is defined as follows. Given a set X of size 3n, and m subsets, S1, S2, .., Sm of X, each of size 3, decide if there is a collection of nSi's whose union covers X.
X3C ≤pMax-Sub-FRG: Given an instance (X, S) of X3C problem, we construct an instance (G = (V, E),w,z) of the Max-Sub-FRG problem as follows (an illustration of the construction is given in Figure A.1). First, we construct the vertices V: We construct x1, .., x3n vertices, corresponding to the points of X, m vertices s1, .., sm corresponding to the subsets S, m additional vertices t1, t2, .., tm, and one more vertex a. Next we construct the edges E and the weight function w : E → ℤ+.
- 1.
For each 1 ≤ i ≤ m, we add an edge (ti, si) of weight 2.
- 2.
For each 1 ≤ i ≤ m, we add an edge (a, si) of weight 1.
- 3.
For each 1 ≤ j ≤ 3n, we add an edge (a, xj) of weight M = 4m.
- 4.
For each 1 ≤ i ≤ m, if si = {xp, xq,xr}, we add 3 edges of weight 1: (si, xp), (si, xq), and (si, xr).
The graph constructed given an input X of size 3n and S of size m. Each s ∈ S is a set of size 3. In this example, s1 = {x1, x2, x3}, s2 = {x3, x5, x3n − 1}, sm = {x4, x3n − 1, x3n}.
The graph constructed given an input X of size 3n and S of size m. Each s ∈ S is a set of size 3. In this example, s1 = {x1, x2, x3}, s2 = {x3, x5, x3n − 1}, sm = {x4, x3n − 1, x3n}.
⇒: Assume there is an exact 3-cover of X by S. The FRG subgraph will consist of: n edges (a, si) for the n vertices si that cover X, the 3n edges (si, xj), and m − n edges (tf, sf), for the m − n vertices sf that do not cover X. The transitive closure contains all edges (a, xi), and the rest of the edges are deleted: for all sf's that are not part of the cover, the three edges (sf, xj), and the edge (a, sf) are deleted. In addition, the weight 2 edges (ti, si) for the si's that cover X are deleted. The total weight deleted is thus 3(m − n) + m − n + 2n = 4m − 2n. It is easy to verify that the subgraph is a transitive FRG - there are no connected components of size > 1, and in the transitive reduction of G there is no node with more than one parent.
⇐: Assume there is no exact 3-cover of X by S, we will show that any FRG subgraph must delete more than 4m − 2n weight. We cannot omit any edge (a, xi), as the weight of each such edge is too large. Thus all these edges are in the FRG and are either deleted during transitive reduction or not.
Assume first that all these edges are deleted in the transitive reduction, that is, for every xj there exists an si such that (a, si) and (si, xj) are in the forest. Since there is no collection of n subsets Si that cover X, there must be at least k > n such si's. Consequently, for these si's, the forest must not contain the edges (ti, si) (otherwise, si would have two parents and violate both the forest and the transitivity properties). For the m − k nodes with no edge (sf, xj) we can either add an edge (a, sf) or (tf, sf), but not both (otherwise, sf would have more than one parent). Since w(tf, sf) > w(a, sf) it is better to delete the (a, sf) edges. Hence, the total weight of deleted edges is 3m − 3n for the edges between si's and xj's, 2k in the edges (ti, si), for si's that cover the xj's, and m − k for the edges (a, sf). Total weight deleted is 4m − 3n + k > 4m − 2n since k > n.
Assume now that r > 0 edges (a, xj) are not deleted in the transitive reduction. This means that for these xj's there is no edge (si, xj) for any i (otherwise, xj will have more than one parent after transitive reduction). This means that 3n − r of the xj's are covered by si's. To cover xj's we need at least si's. As before, for these si's we also have the edges (a, si) and we delete the edges (ti, si), and for the m − k nodes sf that do not cover any xj it is best to add the edges (tf, sf) and to delete the edges (a, sf). So the weight deleted is 3m − (3n − r) for edges between si and xj, 2k in the edges (ti, si), and m − k for the edges (a, sf). Thus, the weight deleted is 4m − 3n + k + r ≥ 4m − 3n +
.
Because it is known that X3C is NP-hard, this polynomial reduction shows that Max-Trans-Forest is also NP-hard.
Notes
We assume that pij(1),pij(0) ∈ (0,1) and thus logp(xij) is well defined.
In standard graph terminology an edge is from a parent to a child. We choose the opposite definition to conflate edge direction with the direction of the entailment operator ‘⇒’.
Readers are referred to their paper for details on mapping of tuples to typed predicates. The mapping of arguments into types can be downloaded from http://www.cs.washington.edu/research/sherlock-hornclauses/.
In more detail, 1,714 graphs have less than 5 nodes, 326 graphs have 5–10 nodes, 145 graphs have 10–20 nodes, 92 graphs have 20–50 nodes, 18 graphs have 50–100 nodes, and 7 graphs have at least 100 nodes.
We do not need to run the classifier on 10,0002 pairs of predicates, because for the majority of predicate pairs the features are all zeros, and for this set of pairs we can run the classifier only once.
A weekly connected component in a directed graph is a set of nodes S such that for every pair of nodes u, v ∈ S there is an undirected path from u to v and from v to u, that is, there is a path when edge directions are ignored.
We do not explicitly write the X and Y variables for brevity, and thus we focus only on predicates where the X variable is a subject.
References
Author notes
Computer Science Department, Stanford University, Stanford, CA 94305. Web: http://www.stanford.edu/∼joberant.
Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, 6997801, Israel. Web: http://www.tau.ac.il/∼nogaa.
Computer Science Department, Bar-Ilan University, Ramat-Gan, 52900, Israel. Web: http://www.cs.biu.ac.il/∼dagan.
Engineering Department, Bar-Ilan University, Ramat-Gan, 52900, Israel. Web: http://www.eng.biu.ac.il/goldbej.