## Abstract

We give a general framework for inference in spanning tree models. We propose unified algorithms for the important cases of first-order expectations and second-order expectations in edge-factored, non-projective spanning-tree models. Our algorithms exploit a fundamental connection between gradients and expectations, which allows us to derive efficient algorithms. These algorithms are easy to implement with or without automatic differentiation software. We motivate the development of our framework with several *cautionary tales* of previous research, which has developed numerous inefficient algorithms for computing expectations and their gradients. We demonstrate how our framework efficiently computes several quantities with known algorithms, including the expected attachment score, entropy, and generalized expectation criteria. As a bonus, we give algorithms for quantities that are missing in the literature, including the KL divergence. In all cases, our approach matches the efficiency of existing algorithms and, in several cases, reduces the runtime complexity by a factor of the sentence length. We validate the implementation of our framework through runtime experiments. We find our algorithms are up to 15 and 9 times faster than previous algorithms for computing the Shannon entropy and the gradient of the generalized expectation objective, respectively.

## 1 Introduction

Dependency trees are a fundamental combinatorial structure in natural language processing. It follows that probability models over dependency trees are an important object of study. In terms of graph theory, one can view a (non-projective) dependency tree as an arborescence (commonly known as a spanning tree) of a graph. To build a dependency parser, we define a graph where the nodes are the tokens of the sentence, and the edges are possible dependency relations between the tokens. The edge weights are defined by a model, which is learned from data. In this paper, we focus on edge-factored models where the probability of a dependency tree is proportional to the product the weights of its edges. As there are exponentially many trees in the length of the sentence, we require clever algorithms for finding the normalization constant. Fortunately, the normalization constant for edge-factored models is efficient to compute via to the celebrated matrix–tree theorem.

The matrix–tree theorem (Kirchhoff, 1847)— more specifically, its counterpart for directed graphs (Tutte, 1984)—appeared before the NLP community in an onslaught of contemporaneous papers (Koo et al., 2007; McDonald and Satta, 2007; Smith and Smith, 2007) that leverage the classic result to efficiently compute the normalization constant of a distribution over trees. The result is still used in more recent work (Ma and Hovy, 2017; Liu and Lapata, 2018). We build upon this tradition through a framework for computing expectations of a rich family of functions under a distribution over trees. Expectations appear in all aspects of the probabilistic modeling process: training, model validation, and prediction. Therefore, developing such a framework is key to accelerating progress in probabilistic modeling of trees.

Our framework is motivated by the lack of a unified approach for computing expectations over spanning trees in the literature. We believe this gap has resulted in the publication of numerous inefficient algorithms. We motivate the importance of developing such a framework by highlighting the following *cautionary tales*.

- •
McDonald and Satta (2007) proposed an inefficient $ON5$ algorithm for computing feature expectations, which was much slower than the $ON3$ algorithm obtained by Koo et al. (2007) and Smith and Smith (2007). The authors subsequently revised their paper.

- •
Smith and Eisner (2007) proposed an $ON4$ algorithm for computing entropy. Later, Martins et al. (2010) gave an $ON3$ method for entropy, but not its gradient. Our framework recovers Martins et al.’s (2010) algorithm, and additionally provides the gradient of entropy in $ON3$.

- •
Druck et al. (2009) proposed an $ON5$ algorithm for evaluating the gradient of the generalized expectation (GE) criterion (McCallum et al., 2007). The runtime bottleneck of their approach is the evaluation of a covariance matrix, which Druck and Smith (2009) later improved to $ON4$. We show that the gradient of the GE criterion can be evaluated in $ON3$.

We summarize our main results below:

- •
**Unified Framework**: We develop an algorithmic framework for calculating expectations over spanning arborescences. We give precise mathematical assumptions on the types of functions that are supported. We provide efficient algorithms that piggyback on automatic differentiation techniques, as our framework is rooted in a deep connection between expectations and gradients (Darwiche, 2003; Li and Eisner, 2009). - •
**Improvements to existing approaches**: We give asymptotically faster algorithms where several prior algorithms were known. - •
**Efficient algorithms for new quantities**: We demonstrate how our framework calculates several new quantities, such as the Kullback–Leibler divergence, which (to our knowledge) had no prior algorithm in the literature. - •
**Practicality**: We present practical speed-ups in the calculation of entropy compared to Smith and Eisner (2007). We observe speed-ups in the range of 4.1 and 15.1 in five languages depending on the typical sentence length. We also demonstrate a 9 times speed-up for evaluating the gradient of the GE objective compared to Druck and Smith (2009). - •
**Simplicity**: Our algorithms are simple to implement—requiring only a few lines of PyTorch code (Paszke et al., 2019). We have released a reference implementation at the following URL: https://github.com/rycolab/tree_expectations.

## 2 Distributions over Trees

We consider the distribution over trees in weighted directed graphs with a designated root node. A (rooted, weighted, and directed) **graph** is given by $G=(N,E,\rho )$. $N={1,\u2026,N}\u222a{\rho}$ is a set of *N* +1 nodes where *ρ* is a designated root node. $E$ is a set of weighted edges where each edge $(i\u2192wijj)\u2208E$ is a pair of *distinct* nodes such that the source node $i\u2208N$ points to a destination node $j\u2208N$ with an edge weight *w*_{ij} ∈ ℝ. We assume—without loss of generality—that the root node *ρ* has no incoming edges. Furthermore, we assume only one edge can exist between two nodes. We consider the multi-graph case in §2.2.

In natural language processing applications, these weights are typically parametric functions, such as log-linear models (McDonald et al., 2005b) or neural networks (Dozat and Manning, 2017; Ma and Hovy, 2017), which are learned from data.

A **tree**^{1}*d* of a graph $G$ is a set of *N* edges such that all non-root nodes *j* have exactly one incoming edge and the root node *ρ* has at least one outgoing edge. Furthermore, a tree does not contain any cycles. We denote the set of all trees in a graph by $D$ and assume that $|D|>0$ (this is not necessarily true for all graphs).

**weight of a tree**$d\u2208D$ is defined as:

**probability distribution**:

**normalization constant**is defined as

*proper*distribution, we require

*w*

_{ij}≥ 0 for all $(i\u2192j)\u2208E$, and Z > 0.

### 2.1 The Matrix–Tree Theorem

*N*. Fortunately, there is sufficient structure in the computation of Z that it can be evaluated in $ON3$ time. The Matrix–Tree Theorem (MTT) (Tutte, 1984; Kirchhoff, 1847) establishes a connection between Z and the determinant of the

**Laplacian matrix**,

**L**∈ ℝ

^{N×N}. For all $i,j\u2208N\u2216{\rho}$,

*For any graph,*

*Furthermore, the normalization constant can be computed in*$ON3$

*time.*

^{2}

### 2.2 Dependency Parsing and the Laplacian Zoo

Graph-based dependency parsing can be encoded as follows. For each sentence of length *N*, we create a graph $G=(N,E,\rho )$ where each non-root node represents a token of the sentence, and *ρ* represents a special root symbol of the sentence. Each edge $(i\u2192j)$ in the graph represents a *possible* dependency relation between head word *i* and modifier word *j*. Fig. 1 gives an example dependency tree. In the remainder of this section, we give several variations on the Laplacian matrix that encode different sets of valid trees.^{3}

In many cases of dependency parsing, we want *ρ* to have exactly one outgoing edge. This is motivated by linguistic theory, where the root of a sentence should be a token in the sentence rather than a special root symbol (Tesnière, 1959). There are exceptions to this, such as parsing Twitter (Kong et al., 2014) and parsing specific languages (e.g., The Prague Treebank [Bejček et al., 2013]). We call these **multi-root trees**^{4} and these are represented by the set $D$, as described earlier. Therefore, the normalization constant over all multi-root trees can be computed by a direct application of Theorem 1.

**single-rooted trees**, denoted $D(1)$. Koo et al. (2007) adapt Theorem 1 to efficiently compute Z for the set $D(1)$ with the

**root-weighted Laplacian**,

^{5}$L^\u2208RN\xd7N$

*For any graph, the normalization constant over all single-rooted trees is given by the determinant of the root-weighted Laplacian*(Koo et al., 2007, Prop. 1)

*Furthermore, the normalization constant for single-rooted trees can be computed in*$ON3$

*time*.

#### Labeled Trees.

To encode *labeled* dependency relations in our set of trees, we simply augment edges with labels—resulting in a **multi-graph** in which multiple edges may exist between pairs of nodes. Now, edges take the form $(i\u2192y/wijyj)$ where *i* and *j* are the source and destination nodes as before, $y\u2208Y$ is the label, and *w*_{ijy} is their weight.

*For any multi-graph, the normalization constant for multi-root or single-rooted trees can be calculated using Theorem 1 or Proposition 1 (respectively) with the edge weights,*

*Furthermore, the normalization constant can be computed in*$ON3+|Y|N2$

*time.*

^{6}

#### Summary.

We give common settings in which the MTT can be adapted to efficiently compute Z for different sets of trees. The choice is dependent upon the task of interest, and one must be careful to choose the correct Laplacian configuration. The results we present in this paper are modular in the specific choice of Laplacian. For the remainder of this paper, we assume the unlabeled tree setting and will refer to the set of trees as simply $D$ and our choice of Laplacian as **L**.

## 3 Expectations

In this section, we characterize the family of expectations that our framework supports. Our framework is an extension of Li and Eisner (2009) to distributions over spanning trees. In contrast, their framework considers expectations over distributions that can be factored as B-hypergraphs (Gallo et al., 1993). Our distributions over trees cannot be cast as polynomial-size B-hypergraphs. Another important distinction between our framework and that of Li and Eisner (2009) is that we do not use the semiring abstraction as it is algebraically too weak to compute the determinant efficiently.^{7}

**expected value**of a function $f:D\u21a6RF$ is defined as follows

*f*, computing (9) is intractable.

^{8}In the remainder of this section, we will characterize a class of functions

*f*whose expectations can be efficiently computed.

**additively decomposable**along the edges of the tree. Formally, a function $r:D\u21a6RR$ is additively decomposable if it can be written as

*r*

_{ij}as a vector of edge values. An example of an additively decomposable function is $r(d)=\u2212logp(d)$ whose expectation gives the Shannon entropy.

^{9}Other first-order expectations include the expected attachment score and the Kullback–Leibler divergence. We demonstrate how to compute these in our framework in and §6.1 and §6.3, respectively.

**second-order additively decomposable**along the edges of the tree. Formally, a function

*r*: $D\u21a6RR$ is second-order additively decomposable if it can be written as the outer product of two additively decomposable functions, $r:D\u21a6RR$ and $s:D\u21a6RS$

*t*(

*d*) ∈ ℝ

^{R×S}is generally a matrix.

An example of such a function is the gradient of entropy (see §6.2) or the GE objective (McCallum et al., 2007) (see §6.4 with respect to the edge weights. Another example of a second-order additively decomposable function is thecovariance matrix. Given two feature functions $r:D\u21a6RR$ and $s:D\u21a6RS$, their covariance matrix is $Edr(d)s(d)\u22a4\u2212Edr(d)Ed[s(d)]\u22a4$. Thus, it is second-order additively decomposable function as long as *r*(*d*) and *s*(*d*) are additively decomposable.

**multiplicatively decomposable**over the edges. A function $q:D\u21a6RQ$ is multiplicatively decomposable if it can be written as

*q*

_{ij}is an element-wise vector product. These functions form a family that we will call zero

^{th}-order expectations and can be computed with a constant number of calls to MTT (usually two or three). Examples of these include the Rényi entropy and

*ℓ*

_{p}-norms.

^{10}

## 4 Connecting Gradients and Expectations

In this section, we build upon a fundamental connection between gradients and expectations (Darwiche, 2003; Li and Eisner, 2009). This connection allows us to build on work in automatic differentiation to obtain efficient gradient algorithms. While the propositions in this section are inspired from past work, we believe that the presentation and proofs of these propositions have previously not been clearly presented.^{11} We find it convenient to work with unnormalized expectations, or totals (for short). We denote the **total** of a function *f* as $f\xafdef=\u2211d\u2208Dw(d)f(d)$. We recover the expectation with $Epf=f-/Z$. We note that totals (on their own) may be of interest in some applications (Vieira and Eisner, 2017, Section 5.3).

### The First-Order Case.

**total weight**of trees which include the edge $(i\u2192j)$,

^{12}

*For any edge*$i\u2192j$,

*Proof.*

Proposition 4 will establish a connection between the unnormalized expectation $r\xaf$ and ∇Z.

*For any additively decomposable function*$r:D\u21a6RR$,

*the total*$r\xaf$

*can be computed using a gradient–vector product*

*Proof.*

### The Second-Order Case.

^{13}

*For any pair of edges*$i\u2192j$ and $(k\u2192l)$

*such that*$i\u2192j\u2260(k\u2192l)$,

*Proof.*

Proposition 6 will relate ∇^{2} Z to $\u2207r\xaf$. This will be used in Proposition 7 to establish a connection between the total $t\xaf$ and ∇^{2} Z, and additionally establishes a connection between $t\xaf$ and $\u2207r\xaf$.

*For any additively decomposable function*$r:D\u21a6RR$

*that does not depend onw*,

^{14}

*and edge*$i\u2192j\u2208E$,

*Proof.*

*For any second-order additively decomposable function*$t:D\u21a6RR\xd7S$,

*which is expressed as the outer product of additively decomposable functions,*$r:D\u21a6RR$

*and*$s:D\u21a6RS$,

*t*(

*d*) =

*r*(

*d*)

*s*(

*d*)

^{⊤},

*whererdoes not depend onw*,

*the total*$t\xaf$

*can be computed using a Jacobian–matrix product*

*Proof.*

### Remark.

*n*= 1,…,

*R*. First, some notation; let $1ij\u2192$ be a vector over $E$ with a 1 in dimension $(i\u2192j)$, and zeros elsewhere. By plugging [

*r*

_{ij}]

_{n}and $sij=1wij1ij\u2192$ into (19), we can compute $t\xafn=\u2207r\xafn$.

^{15}However, if

*r*depends on

*w*, we must add the following first-order term, which is due to the product rule

## 5 Algorithms

Having reduced the computation of $r\xaf$ and $t\xaf$ to finding derivatives of Z in §4, we now describe efficient algorithms that exploit this connection. The main algorithmic ideas used in this section are based on automatic differentiation (AD) techniques (Griewank and Walther, 2008). These are general-purpose techniques for efficiently evaluating gradients given algorithms that evaluate the functions. In our setting, the algorithm in question is an efficient procedure for evaluating Z, such as the procedure we described in §2.1. While we provide derivatives §5.1 in our algorithms, these can also be evaluated using any AD library, such as JAX (Bradbury et al., 2018), PyTorch (our choice) (Paszke et al., 2019), or TensorFlow (Abadi et al., 2015).

### 5.1 Derivatives of Z

**L**| , we can express its gradient via Jacobi’s formula and an application of the chain rule

^{16}

**L**

^{−1}, $Li\u2032j\u2032,ij\u2032=\u2202Li\u2032j\u2032\u2202wij$, and ℒ

_{ij}is the set of pairs where (

*i*

*′*,

*j*

*′*) ∈ ℒ

_{ij}means that $Li\u2032j\u2032,ij\u2032\u22600$. We define $B\rho j\u2032def=0$ for any $j\u2032\u2208N$. Koo et al. (2007) show that for any

*i*and

*j*, $|Lij|\u22642$ in the unlabeled case, indeed, $Li\u2032j\u2032,ij\u2032$ is given by

^{17}

^{18}

^{2}

**L**as it is derived from the product rule. Because

**L**is a linear construction, its second derivative is zero and so we can drop this term.

### 5.2 Complexity Analysis

The efficiency of our approach is rooted in the following result from automatic differentiation, which relates the cost of gradient evaluation to the cost of function evaluation. Given a function *f*, we denote the number of differentiable elementary operations (e.g., +, *, /, −, cos, pow) of *f* by $Costf$.

*For any function*

*f*: ℝ

^{K}↦ℝ

^{M}

*and any vector*

*v*∈ ℝ

^{M},

*we can evaluate*(∇

*f*(

*x*))

^{⊤}

*v*∈ ℝ

^{K}

*with cost satisfying the following bound via reverse-mode AD*(Griewank and Walther, 2008, page 44),

*Thus,*$O{(\u2207f(x))\u22a4v}=O{f}$.

As a special (and common) case, Theorem 2 implies a *cheap gradient principle*: The cost of evaluating the gradient of a function of one output (*M* = 1) is as fast as evaluating the function itself.

#### Algorithm T_{1}.

The cheap gradient principle tells us that ∇Z can be evaluated as quickly as Z itself, and that numerically accurate procedures for Z give rise to similarly accurate procedures for ∇Z. Additionally, many widely used software libraries can do this work for us, such as JAX, PyTorch, and TensorFlow. The runtime of evaluating Z is dominated by evaluating the determinant of the Laplacian matrix. Therefore, we can find both Z and ∇Z in the same complexity: $ON3$. Line 4 of Fig. 2 is a sum over *N*^{2} scalar–vector multiplications of size *R*, this suggests a runtime of $ON2R$. However, in many applications, *R* is a sparse function. Therefore, we find it useful to consider the complexities of our algorithms in terms of the size *R*, and the maximum density *R**′* of each *r*_{ij}. We can then evaluate Line 4 in $ON2R\u2032$, leading to an overall runtime for T_{1} of $ON3+N2R\u2032$. The call to Z uses $ON2$ space to store the Laplacian matrix. Computing the gradient of Z similarly takes $ON2$ to store. Since storing $r\xaf$ takes $OR$ space, T_{1} has a space complexity of $ON2+R$.

#### Algorithm $T2v$.

Second-order quantities ($t\xaf$), appear to require ∇^{2} Z and so do not directly fit the conditions of the cheap gradient principle: the Hessian ( ∇^{2} Z) is the Jacobian of the gradient. The approach of $T2v$ to work around this is to make several calls to Theorem 2 for each element of $r\xaf$. In this case, the function in question is (11), which has output dimensionality *R*. Computing $\u2207r\xaf$ can thus be evaluated with *R* calls to reverse-mode AD, requiring $OR(N3+N2R\u2032)$ time. We can somewhat support fast accumulation of *S**′*-sparse *S* in the summation of $T2v$ (Line 6). Unfortunately, $\u2202r\xaf\u2202wij$ will generally be dense, so the cost of the outer product on Line 6 is $ORS\u2032$. Thus, $T2v$ has an overall runtime of $OR(N3+N2R\u2032)+N2RS\u2032$.^{19} Additionally, $T2v$ requires $ON2R+RS$ of space because $ON2R$ is needed to compute and store the Jacobian of $r\xaf$ and $t\xaf$ has size $ORS$.

#### Algorithm $T2h$.

The downside of $T2v$ is that no work is shared between the *R* evaluations of the loop on Line 3. For our computation of Z, it turns out that substantial work can be shared among evaluations. Specifically, ∇^{2} Z only relies on the inverse of the Laplacian matrix, as seen in (26), leading to an alternative algorithm for second-order quantities, $T2h$. This is essentially the same observation made in Druck and Smith (2009). Exploiting this allows us to compute ∇^{2} Z in $ON4$ time. Note that this runtime is only achievable due to the sparsity of ∇**L**. The accumulation component (Line 12) of $T2h$ can be done in $ON4R\u2032S\u2032$. Considering space complexity, while not prevalent in our pseudocode, a benefit of $T2h$ is that we do not need to materialize the Hessian of Z as it only makes use of the inverse of the Laplacian matrix. Therefore, we only need $ON2$ space for the Laplacian inverse and $ORS$ space for $t\xaf$. Consequently, the $T2h$ requires $ON2+RS$ space.

#### Algorithm T_{2}.

*R*is small, that $T2v$ can be much faster than $T2h$. On the other hand, when

*R*is large and

*R*

*′*≪

*R*, $T2h$ can be much faster than $T2v$. Can we get the best of $T2v$ and $T2h$? Our unified algorithm, T

_{2}in Fig. 3, does just that. To derive it, we refactor the bottleneck of $T2h$ using (25) and the distributive property

^{20}

Now, we can compute $r\xaf$ and $s\xaf$ using T_{1} in $ON3+N2(R\u2032+S\u2032)$ and their outer product in $ORS$. Additionally, we can compute all $rj\u2032l\u2032^$ and $sj\u2032l\u2032^$ values in $ON3R\u2032$ and $ON3S\u2032$, respectively. If *r* is *R**′* sparse, then each $rj\u2032l\u2032^$ is $R\xafdef=min(R,NR\u2032)$sparse. We can compute the sum over all $rj\u2032l\u2032^sj\u2032l\u2032^\u22a4$ in $ON2R\xafS\xaf$ time. Combining these runtimes, we have that T_{2} runs in $ON3(R\u2032+S\u2032)+RS+N2R\xafS\xaf$. T_{2} requires a total of $ORS+N2(R\xaf+S\xaf)$: $ORS$ space for $t\xaf$, and $ON2(R\xaf+S\xaf)$ space for the $r^$ and $s^$ values.

We return to our original question: Can we get the best of $T2v$ and $T2h$? In the case when *R* is small, T_{2} matches the runtime of $T2v$. Furthermore, in the case when *R* is large and *R**′* ≪ *R*, T_{2} matches the runtime of $T2h$. Therefore, T_{2} is able to achieve the best runtime regardless of the functions *r* and *s*.

## 6 Applications and Prior Work

In this section, we apply our framework to compute a number of important quantities that are used when working with probabilistic models. We relate our approach to existing algorithms in the literature (where applicable), and mention existing and potential applications. Many of our quantities were covered in Li and Eisner (2009) for B-hypergraphs; we extend their results to spanning trees.

In most applications that involve training a probabilistic model, the edge weights in the model will be parameterized in some fashion. Traditional approaches (Koo et al., 2007; Smith and Smith, 2007; McDonald et al., 2005a; Druck, 2011) use log-linear parameterizations, whereas more recent work (Dozat and Manning, 2017; Liu and Lapata, 2018; Ma and Xia, 2014) use neural-network parameterizations. Our algorithms are agnostic as to how edges are parameterized.

### 6.1 Risk

*d*

^{*}. In the context of dependency parsing,

*r*(

*d*) can be the labeled or unlabeled attachment score (LAS and UAS, respectively), both of which are additively decomposable. The unlabeled case decomposes as follows:

*d*

^{*}is the gold tree and

*N*is the length of the sentence. Note that the use of $1N$ ensures that

*r*(

*d*) will be a score between 0 and 1. We can then obtain the expected attachment score using T

_{1}, and we can evaluate its gradient in the same run-time using reverse-mode AD or T

_{2}. In this case, $s:D\u21a6RS$ is the one-hot representation of the edges; thus, we have

*S*=

*N*

^{2}. However, because

*s*is 1-sparse, we have

*S*

*′*= 1 . Additionally, as

*r*does not depend on

*w*, we do not need to add a first-order term to find the gradient. Therefore, the runtime for the gradient is also $ON3$.

### 6.2 Shannon Entropy

^{21}Recall from §3 that $\u2212logp(d)$ is additively decomposable; thus, running T

_{1}with $rij=1NlogZ\u2212logwij$ computes H(

*p*) in $ON3$. Martins et al.’s (2010) algorithm for computing H(

*p*) is precisely the same as ours. However, they do not describe how to compute its gradient. As with risk, we can find the gradient of entropy using T

_{2}or using reverse-mode AD. When using T

_{2}, since the gradient of

*r*with respect to

*w*is not 0, we add the first-order quantity $T1w,\u2207r$ as in (21). For entropy, we have that $\u2207rij=1NZ\u2207Z\u22121wij1ij\u2192$.

#### Experiment.

We briefly demonstrate the practical speed-up over Smith and Eisner’s (2007) $ON4$ algorithm. We compare the average runtime per sentence of five different UD corpora.^{22} The languages have different average sentence lengths to demonstrate the extra speed-up gained when calculating the entropy of longer sentences (that is, $D$ would be a larger set). Tab. 1 shows that even for a corpus of short sentences (Finnish), we achieve a 4 times speed-up. This increases to 15 times as we move to corpora with longer sentences (Arabic).

Language
. | Sentence
. | Entropy
. | Average Runtime (ms)
. | Speed-up
. | |
---|---|---|---|---|---|

. | length
. | (nats / word)
. | T_{1}(Fig. 2)
. | Past Approach
. | . |

Finnish | 9.23 | 0.6092 | 0.4623 | 1.882 | 4.1 |

English | 12.45 | 0.8264 | 0.5102 | 2.778 | 5.4 |

German | 17.56 | 0.8933 | 0.5583 | 4.104 | 7.3 |

French | 24.65 | 0.8923 | 0.5635 | 5.742 | 10.2 |

Arabic | 36.05 | 0.7163 | 0.6220 | 9.368 | 15.1 |

Language
. | Sentence
. | Entropy
. | Average Runtime (ms)
. | Speed-up
. | |
---|---|---|---|---|---|

. | length
. | (nats / word)
. | T_{1}(Fig. 2)
. | Past Approach
. | . |

Finnish | 9.23 | 0.6092 | 0.4623 | 1.882 | 4.1 |

English | 12.45 | 0.8264 | 0.5102 | 2.778 | 5.4 |

German | 17.56 | 0.8933 | 0.5583 | 4.104 | 7.3 |

French | 24.65 | 0.8923 | 0.5635 | 5.742 | 10.2 |

Arabic | 36.05 | 0.7163 | 0.6220 | 9.368 | 15.1 |

### 6.3 Kullback–Leibler Divergence

_{1}with these weights computes the KL divergence in $ON3$ time. To find the gradient of the KL divergence, we return the sum of $T2w,r,s$ where we chose $sij=1wij1ij\u2192$ and add $T1w,\u2207r$. For the KL divergence, we have that $\u2207rij=1wij1ij\u2192\u2212\u2207Z1NZ$.

### 6.4 Gradient of the GE Objective

The generalized expectation criterion (McCallum et al., 2007; Druck et al., 2009) is a method semi-supervised training using weakly labeled data. GE fits model parameters by encouraging models to match certain expectation constraints, such as marginal-label distributions, on the unlabeled data. More formally, let *f* be a feature function *f*(*d*) ∈ ℝ^{F}, and with a target value of *f*^{*} ∈ℝ^{F} that has been specified using domain knowledge. For example, given an English part-of-speech tagged sentence, we can provide the following light supervision to our model: determiners should attach to the nearest noun on their right. This is an example of a very precise heuristic for dependency parsing English that has high precision.

We note that by application of the chain rule, the gradient of the GE objective is a second-order quantity, and so we can use T_{2} to compute it. As we discussed in §1, the gradient of the GE has led to confusion in the literature (Druck et al., 2009; Druck and Smith, 2009; Druck, 2011). The best runtime bound prior to our work is Druck et al. (2009)’s $ON4F\u2032$ algorithm. T_{2} is strictly better at $ON3+N2F\u2032$ time.^{23} Alternatively, as the GE objective is a scalar, we can compute its gradient in $ON3+N2F\u2032$ using reverse-mode AD. Druck (2011) acknowledges that AD can be used, but questions its practicality and numerical accuracy. We hope to dispel this misconception in the following experiment.

#### Experiment.

We compute the GE objective and its gradient for almost 1500 sentences of the English UD Treebank^{24} (Nivre et al., 2018) using 20 features extracted using the methodology of Druck et al. (2009). We note that T_{2} obtains a speed-up of 9 times over Druck and Smith (2009)’s strategy of materializing the covariance matrix (i.e., $T2h$). Additionally, the gradients from both approaches are equivalent with an absolute tolerance of 10^{−16}.

## 7 Conclusion

We presented a general framework for computing first- and second-order expectations for additively decomposable functions. We did this by exploiting a key connection between gradients and expectations that allows us to solve our problems using automatic differentiation. The algorithms we provide are simple, efficient, and extendable to many expectations. The automatic differentiation principle has been applied in other settings, such as weighted context-free grammars (Eisner, 2016) and chain-structured models (Vieira et al., 2016). We hope that this paper will also serve as a tutorial on how to compute expectations over trees so that the list of *cautionary tales* does not grow further. Particularly, we hope that this will allow for the KL divergence to be used in semi-supervised training of dependency parsers. Our aim is for our approach for computing expectations to be extended to other structured prediction models.

## Acknowledgments

We would like to thank action editor Dan Gildea and the three anonymous reviewers for their valuable feedback and suggestions. The first author is supported by the University of Cambridge School of Technology Vice-Chancellor’s Scholarship as well as by the University of Cambridge Department of Computer Science and Technology’s EPSRC.

## Notes

^{1}

The more precise graph-theoretic term is *arborescence*.

^{2}

For simplicity, we assume that the runtime of matrix determinants is $ON3$. However, we would be remiss if we did not mention that algorithms exist to compute the determinant more efficiently (Dumas and Pan, 2016).

^{3}

The reader may want to skip this section on their first reading.

^{4}

We follow the conventions of Koo et al. (2007) and say “single-root” and “multi-root” when we *technically* mean the number of outgoing edges from the root *ρ*, and *not* the number of root nodes in a tree, which is always one.

^{5}

The choice to replace row 1 by the root edges is done by convention, we can replace *any* row in the construction of $L^$.

^{6}

The algorithms given in later sections will not provide full details for the labeled case due to space constraints, but we assure the reader that our algorithms can be straightforwardly generalized to the labeled setting.

^{7}

In fact, Jerrum and Snir (1982) proved that the partition function for spanning trees requires an exponential number of additions and multiplications in the semiring model of computation (i.e., assuming that subtraction is not allowed). Interestingly, division is not required, but algorithms for division-free determinant computation run in $ON4$ (Kaltofen, 1992). An excellent overview of *the power of subtraction* in the context of dynamic programming is given in Miklós (2019, Ch. 3). It would appear as if commutative rings would make a good level of abstraction as they admit efficient determinant computation. Interestingly, this means that we cannot use the MTT in the max-product semiring to (efficiently) find the maximum weight tree because max does not have an inverse. Fortunately, there exist $ON2$ algorithms to find the maximum weight tree for both the single-root and multi-root settings (Zmigrod et al., 2020; Gabow and Tarjan, 1984).

^{8}

Of course, one could use sampling methods, such as Monte Carlo, to approximate (9) . Sampling methods may be efficient if the variance of *f* under *p* is not too large.

^{9}

Proof: $\u2212logp(d)=\u2212log(1Z\u220f(i\u2192j)\u2208dwij)$$=logZ\u2212\u2211(i\u2192j)\u2208dlogwij$. ⇒ $rij=1NlogZ\u2212logwij$.

^{10}

The *ℓ*_{k} norm of the distribution *p* often denoted as $\u2225p\u2225kdef=\u2211d\u2208Dp(d)k1/k$ for *k* ≥ 0 . It is computable from a zero^{th}-order expectation because it can be written as $(Z(k)Zk)1/k$ where $Z(k)=\u2211d\u2208Dw(d)k=\u2211(i\u2192j)\u2208dwijk$, which is clearly a zero ^{th}-order expectation. Similarly, the Rényi entropy of order *α* ≥ 0 with *α* ≠ 1 is $H\alpha (p)def=11\u2212\alpha log\u2211d\u2208Dp(d)\alpha =$$11\u2212\alpha logZ(\alpha )Z\alpha $.

^{11}

Li and Eisner (2009, Section 5.1) provide a similar derivation to Proposition 3 and Proposition 4 for hypergraphs.

^{12}

Some authors (e.g., Wainwright and Jordan, 2008) prefer to work with an exponentiated representation $wij=exp(\theta ij)$ so that $\u2207\theta ijlogZ=p((i\u2192j)\u2208d)$. This avoids an explicit division by Z, and multiplication by *w*_{ij} as these operations happens by virtue of the chain rule.

^{13}

As each edge can only appear once in a tree, $wij,ij~=0$.

^{14}

More precisely, $\u2202r(d)\u2202wij=0$ for all $d\u2208D$ and $i\u2192j\u2208E$.

^{15}

Note that when *w*_{ij} = 0 , we can set *s*_{ij} = ** 0**.

^{16}

The derivative of |**L**| can also be given using the matrix adjugate, ∇Z = adj(**L**)^{⊤}. There are benefits to using the adjugate as it is more numerically stable and equally efficient (Stewart, 1998). In fact, any algorithm that computes the determinant can be algorithmically differentiated to obtain an algorithm for the adjugate.

^{17}

We have that $|Lij|\u22642|Y|$ in the labeled case.

^{18}

We provide a derivation in Appendix A. Druck and Smith (2009) give a similar derivation for the Hessian, which we have generalized to any second-order quantity.

^{19}

If *S* <*R*, we can change the order of $T2v$ to compute $t\xaf\u22a4$ in $OS(N3+N2S\u2032)+N2R\u2032S$.

^{21}

Their algorithm calls MTT *N* times, where the *i*^{ th } call to MTT multiplies the set of incoming edges to *i*^{th} non-root node by their $log$ weight.

^{22}

Times were measured using an Intel(R) Core(TM) i7-7500U processor with 16GB RAM.

^{23}

We must apply a chain rule in order to use T_{2}. To do this, we first run T_{1} to obtain $f-$ in $ON3+N2F\u2032$. We then run T_{2} with the dot product of *f* and $f-\u2212f*$, which has a dimensionality of 1, and the sparse one-hot vectors as before. The execution of T_{2} then takes $ON3$, giving us the desired runtime. Full detail is available in our code.

^{24}

We used all sentences in the test set, which were between 5 and 150 words.

^{25};

Note that we do not have to take the derivative of $Lk\u2032l\u2032,kl\u2032$ as it is either 1 or − 1.

## References

#### A Derivation of ∇^{2} Z

^{2}Z given in (25). We begin by taking the derivative of ∇Z using (22)

^{25}The first term of the product rule is

#### B Proof of T_{2}

_{2}. First, recall from Proposition 7 that we may find $t\xaf$ by

*r*

_{ij}

*s*

_{ij}

^{⊤}(given as $f-$ in T

_{2}). We can write a sum over all edges as the sum over pairs of nodes in $N$. Similarly, elements in ℒ

_{ij}can be considered as pairs of nodes. Therefore, unless specified otherwise, we assume all variables in the base of a summation are scoped to $N$. Then, the second summand can then be rewritten

*i*

*′*,

*j*

*′*,

*k*

*′*, and

*l*

*′*and so suggests we can compute all $rj\u2032l\u2032^$ and $sj\u2032l\u2032^$ in $ON5(R\u2032+S\u2032)$. However, we can exploit the sparsity of ∇

**L**to improve this. Specifically, the follow algorithm computes $rj\u2032l\u2032^$ for all $j\u2032,l\u2032\u2208N$.

*R*

*′*-sparse vectors added to it (by the inner loop). Hence, $rij^$ is $OR\xaf$ sparse where $R\xafdef=min(R,NR\u2032)$. This means that computing the sum of the outer-products of all $rij^$ and $sij^$ can be done in $ON2R\xafS\xaf$. Then, given that we have

## Author notes

Equal contribution.