Abstract
Weighted deduction systems provide a framework for describing parsing algorithms that can be used with a variety of operations for combining the values of partial derivations. For some operations, inside values can be computed efficiently, but outside values cannot. We view out-side values as functions from inside values to the total value of all derivations, and we analyze outside computation in terms of function composition. This viewpoint helps explain why efficient outside computation is possible in many settings, despite the lack of a general outside algorithm for semiring operations.
1. Introduction
In weighted deduction systems such as those used for parsing with context-free grammars, the inside–outside algorithm provides an efficient way of finding the total weight of all derivations passing through a specific item. Weighted deduction systems can be used with different semirings, or even more generally, with other classes of functions for computing the values of items bottom–up in the inside pass. In some cases, efficient inside computation is possible, but efficient outside computation is not. How can these cases be characterized?
We give a very general characterization of the conditions for efficient outside computation in terms of function composition, as well as three more specific examples of sufficient conditions. The first of these conditions, commutative semirings, is discussed by Goodman (1999), while we believe the other two, extremal semirings and the sum of linear functions, to be novel formulations. We discuss general superior functions as a case where efficient outside computation is not possible. We conclude that, despite the emphasis in the literature on describing weighted deduction in terms of semirings, semirings are not the best abstraction for describing the requirements of the general inside–outside algorithm.
2. Weighted Deduction
Weighted deduction systems provide a general framework for expressing and reasoning about dynamic programming algorithms, and in particular about parsing algorithms (Shieber, Schabes, and Pereira 1995; Sikkel 1997; Nederhof 2003). The deduction rule for the basic combination step of CYK parsing of a context-free grammar (CFG) is shown in Figure 1(a). The goal item for CFG parsing with start symbol S and sentence length n is [S; 0; n], where i, j, and k range over positions in a string. In order to simplify our definition of weighted deduction systems, we include the CFG rule S → AB as an antecedent of the rule, although it is sometimes also represented as a “side condition” for the rule, as in Nederhof (2003), in which case the weight w1 of the rule can be incorporated into the function FR. Weighted deduction systems can be used to express other parsing algorithms, including Earley parsing and dependency parsing (Eisner and Satta 1999). Beyond CFG, weighted deduction systems are used for parsing for tree adjoining grammars (Alonso et al. 1999), combinatory categorical grammars (Kuhlmann and Satta 2014), and general linear context-free rewriting systems (Burden and Ljunglöf 2005), as well as for machine translation (Melamed, Satta, and Wellington 2004; Lopez 2009). In all of these applications, a set of general deduction rules is instantiated into a hypergraph for a specific input string. For example, given a sentence of length n, the general rule is shown in Figure 1(a). The goal item for CFG parsing with start symbol S and sentence length n [S; 0; n] is instantiated into a specific rule for each combination of i, j, k ∈ {0, …, n}. Each instantiated item is a vertex in the hypergraph, and each instantiated rule is a hyperedge from the antecedent vertices to the consequent vertex. The resulting hypergraphs are also known as parse forests. In this article, we will deal exclusively with deduction systems that are already instantiated into hypergraphs. We will refer to hyperedges simply as edges. We use E to refer to the set of edges (instantiated rules), and |E| to refer to the number of edges. For CYK parsing of a string of length n with a set of CFG productions P, |E| ∈ O(|P|n3). However, our discussion will apply equally to the various other applications of weighted deduction systems just mentioned. To simplify the presentation, we will assume at first that our deduction system does not have cycles, that is, an item cannot appear as the consequent of any derivation in which it also appears as an antecedent. For parsing CFGs, this is true whenever the grammar is in Chomsky Normal Form. We return to discuss systems with cycles in Section 4.
- •
⊕ is associative and commutative, and has an identity element 0
- •
⊗ is associative and has an identity element 1
- •
⊗ distributes over ⊕, and
- •
for all x ∈ 𝕂, 0 ⊗ x = x ⊗ 0 = 0
In general, one can allow items to have weights of different types, for example, vectors of various dimensions. Dynamic programming is possible as long as each type has a generalized sum operation, and as long as Equation (3) holds for each rule, with the first sum interpreted as the sum operator for the type of the rule’s consequent, and the second sum interpreted as the sum operator for the type of the ith antecedent.
3. Outside Computation
Outside values can be efficiently computed with a top–down or outside pass through the deduction system after first performing a bottom–up pass to compute the inside values for each item.
We depend on the fact that the ⊗ operation is commutative, because we re-order the product V(A1) ⊗ ⋯ ⊗ V(An) by removing V(Ai), in order to later multiply it in from the right in Equation (6).
For non-commutative semirings, the situation is more complex, because one must combine values in the correct order. Goodman (1998, Section 2-C) defines a new semiring, defined from an arbitrary inside semiring, for outside computation. The values of this new outside semiring are sets of pairs of values from the inside semiring. Although this approach shows that there is a semiring that can be used for outside computation, Goodman does not give a general, efficient algorithm for computing outside values. The values in the outside semiring may grow exponentially large (because they are sets of pairs), making the general inside–outside algorithm exponential even when operations on the inside semiring are efficient.
We wish to give a general set of conditions under which efficient outside computation is possible, and to specify the general algorithm. Let us first state the problem by giving a precise definition of efficient outside computation. We will use |γ(X)| to indicate the size of the representation (in memory) of γ(X).
Definition 1
Given a weighted deduction system, let g be a function such that, as the size |E| of the system’s instantiated hypergraphs grows, maxX |γ(X)| ∈ O(g(|E|)). Efficient outside computation refers to any algorithm that computes the total weight γ(X) of all items X in time O(|E| g(|E|)).
We include the term g(|E|) in our definition in order to cover situations such as the derivation semiring, where the size required for the goal item G, |γ(G)|, is exponential in |E|, and |γ(G)| provides an upper bound on |γ(X)| for all items X. However, in most cases, and in all the examples discussed in this article, g(|E|) can be treated as a constant. In this case, efficient outside computation is equivalent to time linear in the size of the hypergraph.
Our definition of efficient outside computation does not explicitly require a top–down or outside pass through the deduction system. It is possible in some settings to compute the total weight of an item without an outside pass. For example, in CYK parsing, one can first eliminate all items not consistent with a fixed item denoting a particular pair of nonterminal and span, and one can then compute the total weight of all remaining derivations bottom–up (Pereira and Schabes 1992), as shown in Algorithm 1.
For CYK parsing |E| ∈ O(n3) with respect to the sentence length n. The outer loop of Algorithm 1 has O(n2) iterations, and each inside pass is O(n3), for a total runtime of O(n5). Thus, using this method to compute the total weight for all items in the system takes time greater than O(|E| g(|E|)). Computing the total weight of all items is necessary for the EM algorithm, perhaps the most common use case for outside computation. As another use case, one may also wish to precompute the best derivation passing through each item, using the Viterbi semiring, in order to be able to later look up in constant time the best derivation for any desired item. Our definition of efficient outside computation is chosen so as not to predetermine any specific algorithm, but also to rule out less efficient procedures such as repeated bottom–up computation.
The formulation of Equation (15) leads to a simple general condition for efficient outside computation.
Theorem 1
Let out(X) be the set of items B such that some rule has X as an antecedent and B as a consequent, and define g(|E|) as in Definition 1. Efficient outside computation is possible if a representation of F¬X can be computed with |out(X)| operations of time O(g(|E|)), given F¬B for each B ∈ out(X), and if the representation can be evaluated in time O(g(|E|)).
Proof. Procedure Outside (Algorithm 2) computes F¬X for all items X using time ∑X |out(X)| O(g(|E|)). The sum ∑X |out(X)| is bounded by summing the number of antecedents for each in E, so ∑X|out(X)| ∈ O(|E|), yielding total time O(|E| g(|E|)) for Algorithm 2. We then compute γ(X) = F¬X(V(X)) in time O(|E| g(|E|)), satisfying the conditions of Definition 1.
We will give examples of settings that do and that do not meet this general criterion for efficient outside computation.
3.1 Commutative Semirings
For any commutative semiring, the representation of the outside function F¬X(x) consists of the outside value Z(X). If semiring operations take time O(g(|E|)), this value can be computed for all items X in time O(|E| g(|E|)) using Equation (7). The outside function can be evaluated with a single semiring multiplication using Equation (8). Therefore the conditions of Theorem 1 are met, yielding the following corollary:
Corollary 1
Efficient outside computation is possible for any commutative semiring whose operations can be computed in time O(g(|E|)).
In particular, efficient outside computation is possible whenever semiring operations take constant time. The general outside pass of Algorithm 2 takes the following form for commutative semirings.
Commutative semirings include the sum-product semiring used for finding the total probability of all parses of a string, as well as the max-product and max-sum (Viterbi) semirings used for finding the score of the best parse. Other examples include: the K-best semiring used to find the scores for the k best parses (Mohri 2002), the expectation semiring used to compute expected feature values for EM or for training log-linear models (Eisner 2002), the variance semiring used in minimum risk training of log-linear models (Li and Eisner 2009), the entropy semiring used to compute the entropy of the distribution over parses (Hwa 2004; Cortes et al. 2006), the generalized entropy semiring used to compute the relative entropy between two grammars (Cohen, Simmons, and Smith 2011), and the k-best + residual semiring used to find the k best scores and total score simultaneously (Gimpel and Smith 2009). Gimpel and Smith (2009) also define “generalized” semirings for approximate inference that do not meet all the criteria that define a semiring, but that have a commutative ⊗ operator and thus admit outside computation with Algorithm 3.
3.2 Extremal Semirings
A semiring is extremal if for all a, b, either a ⊕ b = a or a ⊕ b = b (Vorobev 1963). The max-product semiring is extremal, as is any semiring over real numbers having max as the generalized addition operator. An extremal semiring is always idempotent, meaning that a ⊕ a = a.
Another example of an extremal semiring is the Viterbi derivation semiring of Goodman (1999). Values in this semiring consist of a pair whose first item is a real number, and whose second item is a record of a partial derivation. This semiring is used to find a maximum scoring derivation, rather than merely computing the maximum score as a real number. The record of the partial derivation can be implemented with back pointers; this semiring is a mathematical formalization of the standard use of backpointers in dynamic programming algorithms. The semiring operation a ⊕ b returns whichever of a and b has the highest value as the first element (score) of the pair. The operation a ⊗ b multiplies the scores of a and b and concatenates the derivations into a new derivation. This semiring is non-commutative, because the concatenation in the ⊗ operator is non-commutative. The Viterbi derivation semiring is extremal.1
The representation of F¬X(x) that we have derived results in the following corollary of Theorem 1.
Corollary 2
Efficient outside computation is possible for any extremal semiring whose operations can be computed in time O(g(|E|)).
3.3 Sum of Linear Functions
As an example of a setting where efficient outside computation is possible even though the inside functions are not semiring operations, we consider the case of vectors as item weights. Components of these vectors correspond to latent variables or refined nonterminals in the latent variable parsing models of Matsuzaki, Miyao, and Tsujii (2005), Petrov et al. (2006), and Cohen et al. (2012).
The computation of the matrix Z(¬X) takes time constant in |E|, giving the following corollary of Theorem 1.
Corollary 3
Efficient outside computation is possible for any inside function consisting of a sum of linear functions.
This example does not fall into the semiring framework. The inside function cannot be expressed as a semiring product because the rule tensor and the vectors for inside values do not have the same type. It is also possible to allow different items to have inside values consisting of vectors of different dimensionality, which therefore do not belong to a single semiring. Thus, the sum of linear functions provides a case where efficient outside computation is possible, despite the fact that the inside functions are not semiring operations, much less commutative or extremal semirings.
Matrix multiplication.
The operations of matrix addition and matrix multiplication over d × d matrices of real numbers form a semiring in which efficient inside computation is possible. However, this semiring is non-commutative, and is also not extremal. Nevertheless, because the outside functions are linear, the sum of linear functions technique allows efficient outside computation.
We emphasize that, although standard implementations of matrix muliplication are O(d3) time in the matrix dimension, the time is constant with respect to the size of the hypergraph |E|. Thus the function g(|E|) in the statement of Theorem 1 is constant, and efficient outside computation is equivalent to time linear in |E|.
3.4 Superior Functions
We now give an example where efficient outside computation is not possible.
This implies that the conditions for efficient outside computation neither subsume nor are subsumed by the conditions for best first search, as summarized in Table 1.
. | Efficient Inside Possible . | Best-first Possible . |
---|---|---|
Efficient Outside Possible | commutative semirings | extremal semirings |
Efficient Outside Not Possible | sum of linear functions | general superior functions |
. | Efficient Inside Possible . | Best-first Possible . |
---|---|---|
Efficient Outside Possible | commutative semirings | extremal semirings |
Efficient Outside Not Possible | sum of linear functions | general superior functions |
4. Cycles
For commutative semirings, computing outside values once inside values are known involves solving a similar set of equations. The outside equations are always linear, because they have only one outside value on the right-hand side. For extremal semirings, derivations with cycles can always be discarded, as they have weight less than the same derivation with the cycle removed, assuming that the inside value is well-defined. For the sum of linear functions, outside values can again be computed by solving a set of linear equations.
To summarize, for all cases discussed in this article where efficient outside computation is possible, outside computation with cycles is no more difficult than inside computation with cycles.
5. Conclusion
This article has aimed to provide a deeper understanding of the conditions under which efficient outside computation is possible by making three observations.
First, we give a very general condition for efficient outside computation stated in terms of function composition. Despite the emphasis in the literature on describing weighted deduction in terms of semirings, our general condition does not apply to all semirings, and can apply in situations that do not fall into the semiring framework.
Second, we identify a few more specific situations in which efficient outside computation is possible. Extremal semirings help explain why efficient outside computation is possible for the specific non-commutative semirings described by Goodman (1999), despite the fact that the general outside algorithm given by Goodman is not efficient. The sum of linear functions is a setting that is not a semiring but does allow efficient outside computation.
Third, we show that the conditions for efficient outside computation are incomparable to the conditions for efficient best-first search.
The bottom left cell of Table 1 is empty. It is an interesting open problem to consider whether this is an accident, which is to say, whether efficient outside computation is possible for all semirings. Resolving this problem would require either providing a general efficient algorithm that applies to all semirings, or providing a counterexample by means of a semiring such that outside computation can be used to solve a problem that is NP-complete or otherwise considered to be intractable.
Acknowledgments
We are grateful for feedback from Giorgio Satta, Daniel Štefankovič, Parker Riley, Shay Cohen, Esma Balkır, and the anonymous reviewers. This work was supported by National Science Foundation award IIS-1813823.
Note
In order to break ties between derivations with the same score, one can use an arbitrary ordering over the partial derivations—for example, lexicographic order.