Abstract
We give a general framework for inference in spanning tree models. We propose unified algorithms for the important cases of first-order expectations and second-order expectations in edge-factored, non-projective spanning-tree models. Our algorithms exploit a fundamental connection between gradients and expectations, which allows us to derive efficient algorithms. These algorithms are easy to implement with or without automatic differentiation software. We motivate the development of our framework with several cautionary tales of previous research, which has developed numerous inefficient algorithms for computing expectations and their gradients. We demonstrate how our framework efficiently computes several quantities with known algorithms, including the expected attachment score, entropy, and generalized expectation criteria. As a bonus, we give algorithms for quantities that are missing in the literature, including the KL divergence. In all cases, our approach matches the efficiency of existing algorithms and, in several cases, reduces the runtime complexity by a factor of the sentence length. We validate the implementation of our framework through runtime experiments. We find our algorithms are up to 15 and 9 times faster than previous algorithms for computing the Shannon entropy and the gradient of the generalized expectation objective, respectively.
1 Introduction
Dependency trees are a fundamental combinatorial structure in natural language processing. It follows that probability models over dependency trees are an important object of study. In terms of graph theory, one can view a (non-projective) dependency tree as an arborescence (commonly known as a spanning tree) of a graph. To build a dependency parser, we define a graph where the nodes are the tokens of the sentence, and the edges are possible dependency relations between the tokens. The edge weights are defined by a model, which is learned from data. In this paper, we focus on edge-factored models where the probability of a dependency tree is proportional to the product the weights of its edges. As there are exponentially many trees in the length of the sentence, we require clever algorithms for finding the normalization constant. Fortunately, the normalization constant for edge-factored models is efficient to compute via to the celebrated matrix–tree theorem.
The matrix–tree theorem (Kirchhoff, 1847)— more specifically, its counterpart for directed graphs (Tutte, 1984)—appeared before the NLP community in an onslaught of contemporaneous papers (Koo et al., 2007; McDonald and Satta, 2007; Smith and Smith, 2007) that leverage the classic result to efficiently compute the normalization constant of a distribution over trees. The result is still used in more recent work (Ma and Hovy, 2017; Liu and Lapata, 2018). We build upon this tradition through a framework for computing expectations of a rich family of functions under a distribution over trees. Expectations appear in all aspects of the probabilistic modeling process: training, model validation, and prediction. Therefore, developing such a framework is key to accelerating progress in probabilistic modeling of trees.
Our framework is motivated by the lack of a unified approach for computing expectations over spanning trees in the literature. We believe this gap has resulted in the publication of numerous inefficient algorithms. We motivate the importance of developing such a framework by highlighting the following cautionary tales.
- •
McDonald and Satta (2007) proposed an inefficient algorithm for computing feature expectations, which was much slower than the algorithm obtained by Koo et al. (2007) and Smith and Smith (2007). The authors subsequently revised their paper.
- •
Smith and Eisner (2007) proposed an algorithm for computing entropy. Later, Martins et al. (2010) gave an method for entropy, but not its gradient. Our framework recovers Martins et al.’s (2010) algorithm, and additionally provides the gradient of entropy in .
- •
Druck et al. (2009) proposed an algorithm for evaluating the gradient of the generalized expectation (GE) criterion (McCallum et al., 2007). The runtime bottleneck of their approach is the evaluation of a covariance matrix, which Druck and Smith (2009) later improved to . We show that the gradient of the GE criterion can be evaluated in .
We summarize our main results below:
- •
Unified Framework: We develop an algorithmic framework for calculating expectations over spanning arborescences. We give precise mathematical assumptions on the types of functions that are supported. We provide efficient algorithms that piggyback on automatic differentiation techniques, as our framework is rooted in a deep connection between expectations and gradients (Darwiche, 2003; Li and Eisner, 2009).
- •
Improvements to existing approaches: We give asymptotically faster algorithms where several prior algorithms were known.
- •
Efficient algorithms for new quantities: We demonstrate how our framework calculates several new quantities, such as the Kullback–Leibler divergence, which (to our knowledge) had no prior algorithm in the literature.
- •
Practicality: We present practical speed-ups in the calculation of entropy compared to Smith and Eisner (2007). We observe speed-ups in the range of 4.1 and 15.1 in five languages depending on the typical sentence length. We also demonstrate a 9 times speed-up for evaluating the gradient of the GE objective compared to Druck and Smith (2009).
- •
Simplicity: Our algorithms are simple to implement—requiring only a few lines of PyTorch code (Paszke et al., 2019). We have released a reference implementation at the following URL: https://github.com/rycolab/tree_expectations.
2 Distributions over Trees
We consider the distribution over trees in weighted directed graphs with a designated root node. A (rooted, weighted, and directed) graph is given by . is a set of N +1 nodes where ρ is a designated root node. is a set of weighted edges where each edge is a pair of distinct nodes such that the source node points to a destination node with an edge weight wij ∈ ℝ. We assume—without loss of generality—that the root node ρ has no incoming edges. Furthermore, we assume only one edge can exist between two nodes. We consider the multi-graph case in §2.2.
In natural language processing applications, these weights are typically parametric functions, such as log-linear models (McDonald et al., 2005b) or neural networks (Dozat and Manning, 2017; Ma and Hovy, 2017), which are learned from data.
A tree1d of a graph is a set of N edges such that all non-root nodes j have exactly one incoming edge and the root node ρ has at least one outgoing edge. Furthermore, a tree does not contain any cycles. We denote the set of all trees in a graph by and assume that (this is not necessarily true for all graphs).
2.1 The Matrix–Tree Theorem
2.2 Dependency Parsing and the Laplacian Zoo
Graph-based dependency parsing can be encoded as follows. For each sentence of length N, we create a graph where each non-root node represents a token of the sentence, and ρ represents a special root symbol of the sentence. Each edge in the graph represents a possible dependency relation between head word i and modifier word j. Fig. 1 gives an example dependency tree. In the remainder of this section, we give several variations on the Laplacian matrix that encode different sets of valid trees.3
In many cases of dependency parsing, we want ρ to have exactly one outgoing edge. This is motivated by linguistic theory, where the root of a sentence should be a token in the sentence rather than a special root symbol (Tesnière, 1959). There are exceptions to this, such as parsing Twitter (Kong et al., 2014) and parsing specific languages (e.g., The Prague Treebank [Bejček et al., 2013]). We call these multi-root trees4 and these are represented by the set , as described earlier. Therefore, the normalization constant over all multi-root trees can be computed by a direct application of Theorem 1.
Labeled Trees.
To encode labeled dependency relations in our set of trees, we simply augment edges with labels—resulting in a multi-graph in which multiple edges may exist between pairs of nodes. Now, edges take the form where i and j are the source and destination nodes as before, is the label, and wijy is their weight.
Summary.
We give common settings in which the MTT can be adapted to efficiently compute Z for different sets of trees. The choice is dependent upon the task of interest, and one must be careful to choose the correct Laplacian configuration. The results we present in this paper are modular in the specific choice of Laplacian. For the remainder of this paper, we assume the unlabeled tree setting and will refer to the set of trees as simply and our choice of Laplacian as L.
3 Expectations
In this section, we characterize the family of expectations that our framework supports. Our framework is an extension of Li and Eisner (2009) to distributions over spanning trees. In contrast, their framework considers expectations over distributions that can be factored as B-hypergraphs (Gallo et al., 1993). Our distributions over trees cannot be cast as polynomial-size B-hypergraphs. Another important distinction between our framework and that of Li and Eisner (2009) is that we do not use the semiring abstraction as it is algebraically too weak to compute the determinant efficiently.7
An example of such a function is the gradient of entropy (see §6.2) or the GE objective (McCallum et al., 2007) (see §6.4 with respect to the edge weights. Another example of a second-order additively decomposable function is thecovariance matrix. Given two feature functions and , their covariance matrix is . Thus, it is second-order additively decomposable function as long as r(d) and s(d) are additively decomposable.
4 Connecting Gradients and Expectations
In this section, we build upon a fundamental connection between gradients and expectations (Darwiche, 2003; Li and Eisner, 2009). This connection allows us to build on work in automatic differentiation to obtain efficient gradient algorithms. While the propositions in this section are inspired from past work, we believe that the presentation and proofs of these propositions have previously not been clearly presented.11 We find it convenient to work with unnormalized expectations, or totals (for short). We denote the total of a function f as . We recover the expectation with . We note that totals (on their own) may be of interest in some applications (Vieira and Eisner, 2017, Section 5.3).
The First-Order Case.
Proposition 4 will establish a connection between the unnormalized expectation and ∇Z.
The Second-Order Case.
Proposition 6 will relate ∇2 Z to . This will be used in Proposition 7 to establish a connection between the total and ∇2 Z, and additionally establishes a connection between and .
Remark.
5 Algorithms
Having reduced the computation of and to finding derivatives of Z in §4, we now describe efficient algorithms that exploit this connection. The main algorithmic ideas used in this section are based on automatic differentiation (AD) techniques (Griewank and Walther, 2008). These are general-purpose techniques for efficiently evaluating gradients given algorithms that evaluate the functions. In our setting, the algorithm in question is an efficient procedure for evaluating Z, such as the procedure we described in §2.1. While we provide derivatives §5.1 in our algorithms, these can also be evaluated using any AD library, such as JAX (Bradbury et al., 2018), PyTorch (our choice) (Paszke et al., 2019), or TensorFlow (Abadi et al., 2015).
5.1 Derivatives of Z
5.2 Complexity Analysis
The efficiency of our approach is rooted in the following result from automatic differentiation, which relates the cost of gradient evaluation to the cost of function evaluation. Given a function f, we denote the number of differentiable elementary operations (e.g., +, *, /, −, cos, pow) of f by .
As a special (and common) case, Theorem 2 implies a cheap gradient principle: The cost of evaluating the gradient of a function of one output (M = 1) is as fast as evaluating the function itself.
Algorithm T1.
The cheap gradient principle tells us that ∇Z can be evaluated as quickly as Z itself, and that numerically accurate procedures for Z give rise to similarly accurate procedures for ∇Z. Additionally, many widely used software libraries can do this work for us, such as JAX, PyTorch, and TensorFlow. The runtime of evaluating Z is dominated by evaluating the determinant of the Laplacian matrix. Therefore, we can find both Z and ∇Z in the same complexity: . Line 4 of Fig. 2 is a sum over N2 scalar–vector multiplications of size R, this suggests a runtime of . However, in many applications, R is a sparse function. Therefore, we find it useful to consider the complexities of our algorithms in terms of the size R, and the maximum density R′ of each rij. We can then evaluate Line 4 in , leading to an overall runtime for T1 of . The call to Z uses space to store the Laplacian matrix. Computing the gradient of Z similarly takes to store. Since storing takes space, T1 has a space complexity of .
Algorithm .
Second-order quantities (), appear to require ∇2 Z and so do not directly fit the conditions of the cheap gradient principle: the Hessian ( ∇2 Z) is the Jacobian of the gradient. The approach of to work around this is to make several calls to Theorem 2 for each element of . In this case, the function in question is (11), which has output dimensionality R. Computing can thus be evaluated with R calls to reverse-mode AD, requiring time. We can somewhat support fast accumulation of S′-sparse S in the summation of (Line 6). Unfortunately, will generally be dense, so the cost of the outer product on Line 6 is . Thus, has an overall runtime of .19 Additionally, requires of space because is needed to compute and store the Jacobian of and has size .
Algorithm .
The downside of is that no work is shared between the R evaluations of the loop on Line 3. For our computation of Z, it turns out that substantial work can be shared among evaluations. Specifically, ∇2 Z only relies on the inverse of the Laplacian matrix, as seen in (26), leading to an alternative algorithm for second-order quantities, . This is essentially the same observation made in Druck and Smith (2009). Exploiting this allows us to compute ∇2 Z in time. Note that this runtime is only achievable due to the sparsity of ∇L. The accumulation component (Line 12) of can be done in . Considering space complexity, while not prevalent in our pseudocode, a benefit of is that we do not need to materialize the Hessian of Z as it only makes use of the inverse of the Laplacian matrix. Therefore, we only need space for the Laplacian inverse and space for . Consequently, the requires space.
Algorithm T2.
Now, we can compute and using T1 in and their outer product in . Additionally, we can compute all and values in and , respectively. If r is R′ sparse, then each is sparse. We can compute the sum over all in time. Combining these runtimes, we have that T2 runs in . T2 requires a total of : space for , and space for the and values.
We return to our original question: Can we get the best of and ? In the case when R is small, T2 matches the runtime of . Furthermore, in the case when R is large and R′ ≪ R, T2 matches the runtime of . Therefore, T2 is able to achieve the best runtime regardless of the functions r and s.
6 Applications and Prior Work
In this section, we apply our framework to compute a number of important quantities that are used when working with probabilistic models. We relate our approach to existing algorithms in the literature (where applicable), and mention existing and potential applications. Many of our quantities were covered in Li and Eisner (2009) for B-hypergraphs; we extend their results to spanning trees.
In most applications that involve training a probabilistic model, the edge weights in the model will be parameterized in some fashion. Traditional approaches (Koo et al., 2007; Smith and Smith, 2007; McDonald et al., 2005a; Druck, 2011) use log-linear parameterizations, whereas more recent work (Dozat and Manning, 2017; Liu and Lapata, 2018; Ma and Xia, 2014) use neural-network parameterizations. Our algorithms are agnostic as to how edges are parameterized.
6.1 Risk
6.2 Shannon Entropy
Experiment.
We briefly demonstrate the practical speed-up over Smith and Eisner’s (2007) algorithm. We compare the average runtime per sentence of five different UD corpora.22 The languages have different average sentence lengths to demonstrate the extra speed-up gained when calculating the entropy of longer sentences (that is, would be a larger set). Tab. 1 shows that even for a corpus of short sentences (Finnish), we achieve a 4 times speed-up. This increases to 15 times as we move to corpora with longer sentences (Arabic).
Language . | Sentence . | Entropy . | Average Runtime (ms) . | Speed-up . | |
---|---|---|---|---|---|
. | length . | (nats / word) . | T1(Fig. 2) . | Past Approach . | . |
Finnish | 9.23 | 0.6092 | 0.4623 | 1.882 | 4.1 |
English | 12.45 | 0.8264 | 0.5102 | 2.778 | 5.4 |
German | 17.56 | 0.8933 | 0.5583 | 4.104 | 7.3 |
French | 24.65 | 0.8923 | 0.5635 | 5.742 | 10.2 |
Arabic | 36.05 | 0.7163 | 0.6220 | 9.368 | 15.1 |
Language . | Sentence . | Entropy . | Average Runtime (ms) . | Speed-up . | |
---|---|---|---|---|---|
. | length . | (nats / word) . | T1(Fig. 2) . | Past Approach . | . |
Finnish | 9.23 | 0.6092 | 0.4623 | 1.882 | 4.1 |
English | 12.45 | 0.8264 | 0.5102 | 2.778 | 5.4 |
German | 17.56 | 0.8933 | 0.5583 | 4.104 | 7.3 |
French | 24.65 | 0.8923 | 0.5635 | 5.742 | 10.2 |
Arabic | 36.05 | 0.7163 | 0.6220 | 9.368 | 15.1 |
6.3 Kullback–Leibler Divergence
6.4 Gradient of the GE Objective
The generalized expectation criterion (McCallum et al., 2007; Druck et al., 2009) is a method semi-supervised training using weakly labeled data. GE fits model parameters by encouraging models to match certain expectation constraints, such as marginal-label distributions, on the unlabeled data. More formally, let f be a feature function f(d) ∈ ℝF, and with a target value of f* ∈ℝF that has been specified using domain knowledge. For example, given an English part-of-speech tagged sentence, we can provide the following light supervision to our model: determiners should attach to the nearest noun on their right. This is an example of a very precise heuristic for dependency parsing English that has high precision.
We note that by application of the chain rule, the gradient of the GE objective is a second-order quantity, and so we can use T2 to compute it. As we discussed in §1, the gradient of the GE has led to confusion in the literature (Druck et al., 2009; Druck and Smith, 2009; Druck, 2011). The best runtime bound prior to our work is Druck et al. (2009)’s algorithm. T2 is strictly better at time.23 Alternatively, as the GE objective is a scalar, we can compute its gradient in using reverse-mode AD. Druck (2011) acknowledges that AD can be used, but questions its practicality and numerical accuracy. We hope to dispel this misconception in the following experiment.
Experiment.
We compute the GE objective and its gradient for almost 1500 sentences of the English UD Treebank24 (Nivre et al., 2018) using 20 features extracted using the methodology of Druck et al. (2009). We note that T2 obtains a speed-up of 9 times over Druck and Smith (2009)’s strategy of materializing the covariance matrix (i.e., ). Additionally, the gradients from both approaches are equivalent with an absolute tolerance of 10−16.
7 Conclusion
We presented a general framework for computing first- and second-order expectations for additively decomposable functions. We did this by exploiting a key connection between gradients and expectations that allows us to solve our problems using automatic differentiation. The algorithms we provide are simple, efficient, and extendable to many expectations. The automatic differentiation principle has been applied in other settings, such as weighted context-free grammars (Eisner, 2016) and chain-structured models (Vieira et al., 2016). We hope that this paper will also serve as a tutorial on how to compute expectations over trees so that the list of cautionary tales does not grow further. Particularly, we hope that this will allow for the KL divergence to be used in semi-supervised training of dependency parsers. Our aim is for our approach for computing expectations to be extended to other structured prediction models.
Acknowledgments
We would like to thank action editor Dan Gildea and the three anonymous reviewers for their valuable feedback and suggestions. The first author is supported by the University of Cambridge School of Technology Vice-Chancellor’s Scholarship as well as by the University of Cambridge Department of Computer Science and Technology’s EPSRC.
Notes
The more precise graph-theoretic term is arborescence.
For simplicity, we assume that the runtime of matrix determinants is . However, we would be remiss if we did not mention that algorithms exist to compute the determinant more efficiently (Dumas and Pan, 2016).
The reader may want to skip this section on their first reading.
We follow the conventions of Koo et al. (2007) and say “single-root” and “multi-root” when we technically mean the number of outgoing edges from the root ρ, and not the number of root nodes in a tree, which is always one.
The choice to replace row 1 by the root edges is done by convention, we can replace any row in the construction of .
The algorithms given in later sections will not provide full details for the labeled case due to space constraints, but we assure the reader that our algorithms can be straightforwardly generalized to the labeled setting.
In fact, Jerrum and Snir (1982) proved that the partition function for spanning trees requires an exponential number of additions and multiplications in the semiring model of computation (i.e., assuming that subtraction is not allowed). Interestingly, division is not required, but algorithms for division-free determinant computation run in (Kaltofen, 1992). An excellent overview of the power of subtraction in the context of dynamic programming is given in Miklós (2019, Ch. 3). It would appear as if commutative rings would make a good level of abstraction as they admit efficient determinant computation. Interestingly, this means that we cannot use the MTT in the max-product semiring to (efficiently) find the maximum weight tree because max does not have an inverse. Fortunately, there exist algorithms to find the maximum weight tree for both the single-root and multi-root settings (Zmigrod et al., 2020; Gabow and Tarjan, 1984).
Of course, one could use sampling methods, such as Monte Carlo, to approximate (9) . Sampling methods may be efficient if the variance of f under p is not too large.
Proof: . ⇒ .
The ℓk norm of the distribution p often denoted as for k ≥ 0 . It is computable from a zeroth-order expectation because it can be written as where , which is clearly a zero th-order expectation. Similarly, the Rényi entropy of order α ≥ 0 with α ≠ 1 is .
Li and Eisner (2009, Section 5.1) provide a similar derivation to Proposition 3 and Proposition 4 for hypergraphs.
Some authors (e.g., Wainwright and Jordan, 2008) prefer to work with an exponentiated representation so that . This avoids an explicit division by Z, and multiplication by wij as these operations happens by virtue of the chain rule.
As each edge can only appear once in a tree, .
More precisely, for all and .
Note that when wij = 0 , we can set sij = 0.
The derivative of |L| can also be given using the matrix adjugate, ∇Z = adj(L)⊤. There are benefits to using the adjugate as it is more numerically stable and equally efficient (Stewart, 1998). In fact, any algorithm that computes the determinant can be algorithmically differentiated to obtain an algorithm for the adjugate.
We have that in the labeled case.
We provide a derivation in Appendix A. Druck and Smith (2009) give a similar derivation for the Hessian, which we have generalized to any second-order quantity.
If S <R, we can change the order of to compute in .
Their algorithm calls MTT N times, where the i th call to MTT multiplies the set of incoming edges to ith non-root node by their weight.
Times were measured using an Intel(R) Core(TM) i7-7500U processor with 16GB RAM.
We must apply a chain rule in order to use T2. To do this, we first run T1 to obtain in . We then run T2 with the dot product of f and , which has a dimensionality of 1, and the sparse one-hot vectors as before. The execution of T2 then takes , giving us the desired runtime. Full detail is available in our code.
We used all sentences in the test set, which were between 5 and 150 words.
Note that we do not have to take the derivative of as it is either 1 or − 1.
References
A Derivation of ∇2 Z
B Proof of T2
Author notes
Equal contribution.