We give a general framework for inference in spanning tree models. We propose unified algorithms for the important cases of first-order expectations and second-order expectations in edge-factored, non-projective spanning-tree models. Our algorithms exploit a fundamental connection between gradients and expectations, which allows us to derive efficient algorithms. These algorithms are easy to implement with or without automatic differentiation software. We motivate the development of our framework with several cautionary tales of previous research, which has developed numerous inefficient algorithms for computing expectations and their gradients. We demonstrate how our framework efficiently computes several quantities with known algorithms, including the expected attachment score, entropy, and generalized expectation criteria. As a bonus, we give algorithms for quantities that are missing in the literature, including the KL divergence. In all cases, our approach matches the efficiency of existing algorithms and, in several cases, reduces the runtime complexity by a factor of the sentence length. We validate the implementation of our framework through runtime experiments. We find our algorithms are up to 15 and 9 times faster than previous algorithms for computing the Shannon entropy and the gradient of the generalized expectation objective, respectively.

Dependency trees are a fundamental combinatorial structure in natural language processing. It follows that probability models over dependency trees are an important object of study. In terms of graph theory, one can view a (non-projective) dependency tree as an arborescence (commonly known as a spanning tree) of a graph. To build a dependency parser, we define a graph where the nodes are the tokens of the sentence, and the edges are possible dependency relations between the tokens. The edge weights are defined by a model, which is learned from data. In this paper, we focus on edge-factored models where the probability of a dependency tree is proportional to the product the weights of its edges. As there are exponentially many trees in the length of the sentence, we require clever algorithms for finding the normalization constant. Fortunately, the normalization constant for edge-factored models is efficient to compute via to the celebrated matrix–tree theorem.

The matrix–tree theorem (Kirchhoff, 1847)— more specifically, its counterpart for directed graphs (Tutte, 1984)—appeared before the NLP community in an onslaught of contemporaneous papers (Koo et al., 2007; McDonald and Satta, 2007; Smith and Smith, 2007) that leverage the classic result to efficiently compute the normalization constant of a distribution over trees. The result is still used in more recent work (Ma and Hovy, 2017; Liu and Lapata, 2018). We build upon this tradition through a framework for computing expectations of a rich family of functions under a distribution over trees. Expectations appear in all aspects of the probabilistic modeling process: training, model validation, and prediction. Therefore, developing such a framework is key to accelerating progress in probabilistic modeling of trees.

Our framework is motivated by the lack of a unified approach for computing expectations over spanning trees in the literature. We believe this gap has resulted in the publication of numerous inefficient algorithms. We motivate the importance of developing such a framework by highlighting the following cautionary tales.

• •

McDonald and Satta (2007) proposed an inefficient $ON5$ algorithm for computing feature expectations, which was much slower than the $ON3$ algorithm obtained by Koo et al. (2007) and Smith and Smith (2007). The authors subsequently revised their paper.

• •

Smith and Eisner (2007) proposed an $ON4$ algorithm for computing entropy. Later, Martins et al. (2010) gave an $ON3$ method for entropy, but not its gradient. Our framework recovers Martins et al.’s (2010) algorithm, and additionally provides the gradient of entropy in $ON3$.

• •

Druck et al. (2009) proposed an $ON5$ algorithm for evaluating the gradient of the generalized expectation (GE) criterion (McCallum et al., 2007). The runtime bottleneck of their approach is the evaluation of a covariance matrix, which Druck and Smith (2009) later improved to $ON4$. We show that the gradient of the GE criterion can be evaluated in $ON3$.

We summarize our main results below:

• •

Unified Framework: We develop an algorithmic framework for calculating expectations over spanning arborescences. We give precise mathematical assumptions on the types of functions that are supported. We provide efficient algorithms that piggyback on automatic differentiation techniques, as our framework is rooted in a deep connection between expectations and gradients (Darwiche, 2003; Li and Eisner, 2009).

• •

Improvements to existing approaches: We give asymptotically faster algorithms where several prior algorithms were known.

• •

Efficient algorithms for new quantities: We demonstrate how our framework calculates several new quantities, such as the Kullback–Leibler divergence, which (to our knowledge) had no prior algorithm in the literature.

• •

Practicality: We present practical speed-ups in the calculation of entropy compared to Smith and Eisner (2007). We observe speed-ups in the range of 4.1 and 15.1 in five languages depending on the typical sentence length. We also demonstrate a 9 times speed-up for evaluating the gradient of the GE objective compared to Druck and Smith (2009).

• •

Simplicity: Our algorithms are simple to implement—requiring only a few lines of PyTorch code (Paszke et al., 2019). We have released a reference implementation at the following URL: https://github.com/rycolab/tree_expectations.

We consider the distribution over trees in weighted directed graphs with a designated root node. A (rooted, weighted, and directed) graph is given by $G=(N,E,ρ)$. $N={1,…,N}∪{ρ}$ is a set of N +1 nodes where ρ is a designated root node. $E$ is a set of weighted edges where each edge $(i→wijj)∈E$ is a pair of distinct nodes such that the source node $i∈N$ points to a destination node $j∈N$ with an edge weight wij ∈ ℝ. We assume—without loss of generality—that the root node ρ has no incoming edges. Furthermore, we assume only one edge can exist between two nodes. We consider the multi-graph case in §2.2.

In natural language processing applications, these weights are typically parametric functions, such as log-linear models (McDonald et al., 2005b) or neural networks (Dozat and Manning, 2017; Ma and Hovy, 2017), which are learned from data.

A tree1d of a graph $G$ is a set of N edges such that all non-root nodes j have exactly one incoming edge and the root node ρ has at least one outgoing edge. Furthermore, a tree does not contain any cycles. We denote the set of all trees in a graph by $D$ and assume that $|D|>0$ (this is not necessarily true for all graphs).

The weight of a tree$d∈D$ is defined as:
$w(d)def=∏(i→j)∈dwij$
(1)
Normalizing the weight of each tree yields a probability distribution:
$p(d)def=w(d)Z$
(2)
where the normalization constant is defined as
$Zdef=∑d∈Dw(d)=∑d∈D∏(i→j)∈dwij$
(3)
Of course, for (2) to be a proper distribution, we require wij ≥ 0 for all $(i→j)∈E$, and Z > 0.

### 2.1  The Matrix–Tree Theorem

The normalization constant Z involves a sum over $D$, which can grow exponentially large with N. Fortunately, there is sufficient structure in the computation of Z that it can be evaluated in $ON3$ time. The Matrix–Tree Theorem (MTT) (Tutte, 1984; Kirchhoff, 1847) establishes a connection between Z and the determinant of the Laplacian matrix, L ∈ ℝN×N. For all $i,j∈N∖{ρ}$,
$Lijdef=∑i′∈N∖{j}wi′jifi=j−wijotherwise$
(4)
Theorem 1
(Matrix–Tree Theorem; Tutte (1984, p. 140)). For any graph,
$Z=|L|$
(5)
Furthermore, the normalization constant can be computed in$ON3$time.2

### 2.2  Dependency Parsing and the Laplacian Zoo

Graph-based dependency parsing can be encoded as follows. For each sentence of length N, we create a graph $G=(N,E,ρ)$ where each non-root node represents a token of the sentence, and ρ represents a special root symbol of the sentence. Each edge $(i→j)$ in the graph represents a possible dependency relation between head word i and modifier word j. Fig. 1 gives an example dependency tree. In the remainder of this section, we give several variations on the Laplacian matrix that encode different sets of valid trees.3

Figure 1:

Example of a dependency tree.

Figure 1:

Example of a dependency tree.

Close modal

In many cases of dependency parsing, we want ρ to have exactly one outgoing edge. This is motivated by linguistic theory, where the root of a sentence should be a token in the sentence rather than a special root symbol (Tesnière, 1959). There are exceptions to this, such as parsing Twitter (Kong et al., 2014) and parsing specific languages (e.g., The Prague Treebank [Bejček et al., 2013]). We call these multi-root trees4 and these are represented by the set $D$, as described earlier. Therefore, the normalization constant over all multi-root trees can be computed by a direct application of Theorem 1.

However, in most dependency parsing corpora, only one edge may emanate from the root (Nivre et al., 2018; Zmigrod et al., 2020). Thus, we consider the set of single-rooted trees, denoted $D(1)$. Koo et al. (2007) adapt Theorem 1 to efficiently compute Z for the set $D(1)$ with the root-weighted Laplacian,5$L^∈RN×N$
$L^ij=wρjifi=1∑i′∈N∖{ρ,j}wi′jifi=j−wijotherwise$
(6)
Proposition 1.
For any graph, the normalization constant over all single-rooted trees is given by the determinant of the root-weighted Laplacian (Koo et al., 2007, Prop. 1)
$Z=|L^|$
(7)
Furthermore, the normalization constant for single-rooted trees can be computed in$ON3$time.

#### Labeled Trees.

To encode labeled dependency relations in our set of trees, we simply augment edges with labels—resulting in a multi-graph in which multiple edges may exist between pairs of nodes. Now, edges take the form $(i→y/wijyj)$ where i and j are the source and destination nodes as before, $y∈Y$ is the label, and wijy is their weight.

Proposition 2.
For any multi-graph, the normalization constant for multi-root or single-rooted trees can be calculated using Theorem 1 or Proposition 1 (respectively) with the edge weights,
$wij=∑y∈Ywijy$
(8)
Furthermore, the normalization constant can be computed in$ON3+|Y|N2$time.6

#### Summary.

We give common settings in which the MTT can be adapted to efficiently compute Z for different sets of trees. The choice is dependent upon the task of interest, and one must be careful to choose the correct Laplacian configuration. The results we present in this paper are modular in the specific choice of Laplacian. For the remainder of this paper, we assume the unlabeled tree setting and will refer to the set of trees as simply $D$ and our choice of Laplacian as L.

In this section, we characterize the family of expectations that our framework supports. Our framework is an extension of Li and Eisner (2009) to distributions over spanning trees. In contrast, their framework considers expectations over distributions that can be factored as B-hypergraphs (Gallo et al., 1993). Our distributions over trees cannot be cast as polynomial-size B-hypergraphs. Another important distinction between our framework and that of Li and Eisner (2009) is that we do not use the semiring abstraction as it is algebraically too weak to compute the determinant efficiently.7

The expected value of a function $f:D↦RF$ is defined as follows
$Edf(d)def=∑d∈Dp(d)f(d)$
(9)
Without any assumptions on f, computing (9) is intractable.8 In the remainder of this section, we will characterize a class of functions f whose expectations can be efficiently computed.
The first type of functions we consider are functions that are additively decomposable along the edges of the tree. Formally, a function $r:D↦RR$ is additively decomposable if it can be written as
$r(d)=∑(i→j)∈drij$
(10)
where we abuse notation slightly by for any function $r:D↦RR$, we consider the edge function rij as a vector of edge values. An example of an additively decomposable function is $r(d)=−logp(d)$ whose expectation gives the Shannon entropy.9 Other first-order expectations include the expected attachment score and the Kullback–Leibler divergence. We demonstrate how to compute these in our framework in and §6.1 and §6.3, respectively.
The second type of functions we consider are functions that are second-order additively decomposable along the edges of the tree. Formally, a function r: $D↦RR$ is second-order additively decomposable if it can be written as the outer product of two additively decomposable functions, $r:D↦RR$ and $s:D↦RS$
$t(d)=r(d)s(d)⊤$
(11)
Thus, t(d) ∈ ℝR×S is generally a matrix.

An example of such a function is the gradient of entropy (see §6.2) or the GE objective (McCallum et al., 2007) (see §6.4 with respect to the edge weights. Another example of a second-order additively decomposable function is thecovariance matrix. Given two feature functions $r:D↦RR$ and $s:D↦RS$, their covariance matrix is $Edr(d)s(d)⊤−Edr(d)Ed[s(d)]⊤$. Thus, it is second-order additively decomposable function as long as r(d) and s(d) are additively decomposable.

One family of functions which can be computed efficiently but we will not explore here are those who are multiplicatively decomposable over the edges. A function $q:D↦RQ$ is multiplicatively decomposable if it can be written as
$q(d)=∏(i→j)∈dqij$
(12)
where the product of qij is an element-wise vector product. These functions form a family that we will call zero th-order expectations and can be computed with a constant number of calls to MTT (usually two or three). Examples of these include the Rényi entropy and p-norms.10

In this section, we build upon a fundamental connection between gradients and expectations (Darwiche, 2003; Li and Eisner, 2009). This connection allows us to build on work in automatic differentiation to obtain efficient gradient algorithms. While the propositions in this section are inspired from past work, we believe that the presentation and proofs of these propositions have previously not been clearly presented.11 We find it convenient to work with unnormalized expectations, or totals (for short). We denote the total of a function f as $f¯def=∑d∈Dw(d)f(d)$. We recover the expectation with $Epf=f-/Z$. We note that totals (on their own) may be of interest in some applications (Vieira and Eisner, 2017, Section 5.3).

### The First-Order Case.

Specifically, the partial derivative $∂Z∂wij$ is useful for determining the total weight of trees which include the edge $(i→j)$,
$wij~def=∑d∈Dijw(d)$
(13)
where $Dijdef={d∈D∣(i→j∈d)}$. Furthermore, $p((i→j)∈d)=wij~/Z=wijZ∂Z∂wij$.12

Proposition 3.
For any edge$i→j$,
$wij~=∂Z∂wijwij$
(14)

Proof.
$wij~=∑d∈Dijw(d)=∑d∈Dij∏(i′→j′)∈dwi′j′=wij∑d∈Dij∏(i′→j′)∈d∖{i→j}wi′j′=wij∂∂wij∑d∈D∏(i′→j′)∈dwi′j′=wij∂∂wij∑d∈D∏(i′→j′)∈dwi′j′=∂Z∂wijwij$

Proposition 4 will establish a connection between the unnormalized expectation $r¯$ and ∇Z.

Proposition 4.
For any additively decomposable function$r:D↦RR$, the total$r¯$can be computed using a gradient–vector product
$r¯=∑(i→j)∈Ewij~rij$
(15)

Proof.
$r¯=∑d∈Dw(d)r(d)=∑d∈Dw(d)∑(i→j)∈drij=∑d∈D∑(i→j)∈dw(d)rij=∑(i→j)∈E∑d∈Dijw(d)rij=∑(i→j)∈Ewij~rij$

### The Second-Order Case.

We can similarly use $∂2Z∂wij∂wkl$ to determine the total weight of trees which include both $(i→j)$ and $(k→l)$ with $(i→j)≠(k→l)$13
$wij,kl~def=∑d∈Dij,klw(d)$
(16)
where $Dijdef={d∈D∣(i→j)∈d,(k→l)∈d}$. Furthermore, $wij,kl~Z=p(i→j∈d,(k→l)∈d)$.

Proposition 5.
For any pair of edges$i→j$ and $(k→l)$such that$i→j≠(k→l)$,
$wij,kl~=∂2Z∂wij∂wklwijwkl$
(17)

Proof.
$wij,kl~=∑d∈Dij,klw(d)=∑d∈Dij,kl∏(k′→l′)∈dwk′l′=wijwkl∂2∂wij∂wkl∑d∈D∏(i′→j′)∈dwi′j′=∂2Z∂wij∂wklwijwkl$

Proposition 6 will relate ∇2 Z to $∇r¯$. This will be used in Proposition 7 to establish a connection between the total $t¯$ and ∇2 Z, and additionally establishes a connection between $t¯$ and $∇r¯$.

Proposition 6.
For any additively decomposable function$r:D↦RR$that does not depend onw,14and edge$i→j∈E$,
$wij∂r¯∂wij=wij~rij+∑(k→l)∈Ewij,kl~rkl$
(18)

Proof.
$wij∂r¯∂wij=wij∂∂wij∑(k→l)∈E∂Z∂wklwklrkl=wij∂Z∂wijrij+wij∑(k→l)∈E∂2Z∂wij∂wklwklrkl=wij~rij+∑(k→l)∈Ewij,kl~rkl$

Proposition 7.
For any second-order additively decomposable function$t:D↦RR×S$, which is expressed as the outer product of additively decomposable functions,$r:D↦RR$and$s:D↦RS$,t(d) = r(d)s(d), whererdoes not depend onw, the total$t¯$can be computed using a Jacobian–matrix product
$t¯=∑(i→j)∈E∂r¯∂wijwijsij⊤$
(19)
or a Hessian–matrix product
$t¯=∑(i→j)∈Ewij~rijsij⊤+∑(k→l)∈Ewij,kl~rijskl⊤$
(20)

Proof.
We first prove (19)
$t¯=∑d∈Dw(d)r(d)s(d)⊤=∑d∈Dw(d)r(d)∑(i→j)∈dsij⊤=∑d∈D∑i→j∈dw(d)r(d)sij⊤=∑(i→j)∈E∑d∈Dijw(d)r(d)sij⊤=∑(i→j)∈Ewij∂∂wij∑d∈Dw(d)r(d)sij⊤∑(i→j)∈Ewij∂r¯∂wijsij⊤$
Then (20) immediately follows by substituting (18) into (19) and expanding the summation.

### Remark.

There is a simple recipe to compute $∇r¯n$ for each n = 1,…,R. First, some notation; let $1ij→$ be a vector over $E$ with a 1 in dimension $(i→j)$, and zeros elsewhere. By plugging [rij]n and $sij=1wij1ij→$ into (19), we can compute $t¯n=∇r¯n$.15 However, if r depends on w, we must add the following first-order term, which is due to the product rule
$∇r¯n=t¯n+∑(i→j)∈Ewij~∇[rij]n︸first-order term$
(21)
We provide the details for computing the gradients of two first-order quantities, Shannon entropy and the KL divergence, using this recipe in §6.2 and §6.3, respectively.

Having reduced the computation of $r¯$ and $t¯$ to finding derivatives of Z in §4, we now describe efficient algorithms that exploit this connection. The main algorithmic ideas used in this section are based on automatic differentiation (AD) techniques (Griewank and Walther, 2008). These are general-purpose techniques for efficiently evaluating gradients given algorithms that evaluate the functions. In our setting, the algorithm in question is an efficient procedure for evaluating Z, such as the procedure we described in §2.1. While we provide derivatives §5.1 in our algorithms, these can also be evaluated using any AD library, such as JAX (Bradbury et al., 2018), PyTorch (our choice) (Paszke et al., 2019), or TensorFlow (Abadi et al., 2015).

Proposition 4 is realized as T1 in Fig. 2 and (19) and (20) are realized as $T2v$ and $T2h$ in Fig. 3, respectively. We provide the runtime complexity of each step in the algorithms. These will be discussed in more detail in §5.2.

Figure 2:

Algorithm for first-order totals.

Figure 2:

Algorithm for first-order totals.

Close modal
Figure 3:

Three algorithms for computing second-order totals. We recommend T2 as it achieves the best runtime in general. The algorithms $T2v$ and $T2h$ are presented for pedagogical purposes in §5.2.

Figure 3:

Three algorithms for computing second-order totals. We recommend T2 as it achieves the best runtime in general. The algorithms $T2v$ and $T2h$ are presented for pedagogical purposes in §5.2.

Close modal

### 5.1  Derivatives of Z

All three algorithms rely on first- or second-order derivatives of Z. Since Z = |L| , we can express its gradient via Jacobi’s formula and an application of the chain rule16
$∂Z∂wij=Z∑(i′,j′)∈LijBi′j′Li′j′,ij′$
(22)
where
$B=L−⊤$
(23)
is the transpose of L−1, $Li′j′,ij′=∂Li′j′∂wij$, and ℒij is the set of pairs where (i, j) ∈ ℒij means that $Li′j′,ij′≠0$. We define $Bρj′def=0$ for any $j′∈N$. Koo et al. (2007) show that for any i and j, $|Lij|≤2$ in the unlabeled case, indeed, $Li′j′,ij′$ is given by
$Li′j′,ij′=1ifi′∈{1,j},j′=j−1ifi′=i,j′=j0otherwise$
(24)
Their result holds for any Laplacian encoding we gave in §2.2.17
The second derivative of Z can be evaluated as follows18
$∂2Z∂wij∂wkl=∑(i′,j′)∈Lij(k′,l′)∈LklLi′j′,ij′∂2Z∂Li′j′∂Lk′l′Lk′l′,kl′$
(25)
where
$∂2Z∂Li′j′∂Lk′l′=ZBi′j′Bk′l′−Bi′l′Bk′j′$
(26)
Note that (25) also contains a term with ∇2L as it is derived from the product rule. Because L is a linear construction, its second derivative is zero and so we can drop this term.

### 5.2  Complexity Analysis

The efficiency of our approach is rooted in the following result from automatic differentiation, which relates the cost of gradient evaluation to the cost of function evaluation. Given a function f, we denote the number of differentiable elementary operations (e.g., +, *, /, −, cos, pow) of f by $Costf$.

Theorem 2
(Cheap Jacobian–vector Products) For any functionf: ℝK↦ℝMand any vectorv ∈ ℝM, we can evaluate (∇f(x))v ∈ ℝKwith cost satisfying the following bound via reverse-mode AD (Griewank and Walther, 2008, page 44),
$Cost(∇f(x))⊤v≤4⋅Costf$
(27)
Thus,$O{(∇f(x))⊤v}=O{f}$.

As a special (and common) case, Theorem 2 implies a cheap gradient principle: The cost of evaluating the gradient of a function of one output (M = 1) is as fast as evaluating the function itself.

#### Algorithm T1.

The cheap gradient principle tells us that ∇Z can be evaluated as quickly as Z itself, and that numerically accurate procedures for Z give rise to similarly accurate procedures for ∇Z. Additionally, many widely used software libraries can do this work for us, such as JAX, PyTorch, and TensorFlow. The runtime of evaluating Z is dominated by evaluating the determinant of the Laplacian matrix. Therefore, we can find both Z and ∇Z in the same complexity: $ON3$. Line 4 of Fig. 2 is a sum over N2 scalar–vector multiplications of size R, this suggests a runtime of $ON2R$. However, in many applications, R is a sparse function. Therefore, we find it useful to consider the complexities of our algorithms in terms of the size R, and the maximum density R of each rij. We can then evaluate Line 4 in $ON2R′$, leading to an overall runtime for T1 of $ON3+N2R′$. The call to Z uses $ON2$ space to store the Laplacian matrix. Computing the gradient of Z similarly takes $ON2$ to store. Since storing $r¯$ takes $OR$ space, T1 has a space complexity of $ON2+R$.

#### Algorithm $T2v$⁠.

Second-order quantities ($t¯$), appear to require ∇2 Z and so do not directly fit the conditions of the cheap gradient principle: the Hessian ( ∇2 Z) is the Jacobian of the gradient. The approach of $T2v$ to work around this is to make several calls to Theorem 2 for each element of $r¯$. In this case, the function in question is (11), which has output dimensionality R. Computing $∇r¯$ can thus be evaluated with R calls to reverse-mode AD, requiring $OR(N3+N2R′)$ time. We can somewhat support fast accumulation of S-sparse S in the summation of $T2v$ (Line 6). Unfortunately, $∂r¯∂wij$ will generally be dense, so the cost of the outer product on Line 6 is $ORS′$. Thus, $T2v$ has an overall runtime of $OR(N3+N2R′)+N2RS′$.19 Additionally, $T2v$ requires $ON2R+RS$ of space because $ON2R$ is needed to compute and store the Jacobian of $r¯$ and $t¯$ has size $ORS$.

#### Algorithm $T2h$⁠.

The downside of $T2v$ is that no work is shared between the R evaluations of the loop on Line 3. For our computation of Z, it turns out that substantial work can be shared among evaluations. Specifically, ∇2 Z only relies on the inverse of the Laplacian matrix, as seen in (26), leading to an alternative algorithm for second-order quantities, $T2h$. This is essentially the same observation made in Druck and Smith (2009). Exploiting this allows us to compute ∇2 Z in $ON4$ time. Note that this runtime is only achievable due to the sparsity of ∇L. The accumulation component (Line 12) of $T2h$ can be done in $ON4R′S′$. Considering space complexity, while not prevalent in our pseudocode, a benefit of $T2h$ is that we do not need to materialize the Hessian of Z as it only makes use of the inverse of the Laplacian matrix. Therefore, we only need $ON2$ space for the Laplacian inverse and $ORS$ space for $t¯$. Consequently, the $T2h$ requires $ON2+RS$ space.

#### Algorithm T2.

So far we have seen that when R is small, that $T2v$ can be much faster than $T2h$. On the other hand, when R is large and RR, $T2h$ can be much faster than $T2v$. Can we get the best of $T2v$ and $T2h$? Our unified algorithm, T2 in Fig. 3, does just that. To derive it, we refactor the bottleneck of $T2h$ using (25) and the distributive property20
$∑(i→j)∈E(k→l)∈E∂2Z∂wij∂wklwijwklrijskl⊤=1Zr¯s¯⊤−Z∑j′,l′∈Nrj′l′^sj′l′^⊤$
(28)
where
$rj′l′^=∑(k→l)∈E∑k′∈NBk′j′L′k′l′,klwklrkl$
(29)
$sj′l′^=∑(i→j)∈E∑i′∈NBi′l′L′i′j′,ijwijsij$
(30)
The remainder of $t¯$ is given by
$f¯def=∑(i→j)∈Ewij~rijsij⊤$
(31)
Therefore, we can find $t¯$ by
$t¯=f¯+1Zr¯s¯⊤−Z∑j′,l′∈Nrj′l′^sj′l′^⊤$
(32)
We provide a proof in  App. B.

Now, we can compute $r¯$ and $s¯$ using T1 in $ON3+N2(R′+S′)$ and their outer product in $ORS$. Additionally, we can compute all $rj′l′^$ and $sj′l′^$ values in $ON3R′$ and $ON3S′$, respectively. If r is R sparse, then each $rj′l′^$ is $R¯def=min(R,NR′)$sparse. We can compute the sum over all $rj′l′^sj′l′^⊤$ in $ON2R¯S¯$ time. Combining these runtimes, we have that T2 runs in $ON3(R′+S′)+RS+N2R¯S¯$. T2 requires a total of $ORS+N2(R¯+S¯)$: $ORS$ space for $t¯$, and $ON2(R¯+S¯)$ space for the $r^$ and $s^$ values.

We return to our original question: Can we get the best of $T2v$ and $T2h$? In the case when R is small, T2 matches the runtime of $T2v$. Furthermore, in the case when R is large and RR, T2 matches the runtime of $T2h$. Therefore, T2 is able to achieve the best runtime regardless of the functions r and s.

In this section, we apply our framework to compute a number of important quantities that are used when working with probabilistic models. We relate our approach to existing algorithms in the literature (where applicable), and mention existing and potential applications. Many of our quantities were covered in Li and Eisner (2009) for B-hypergraphs; we extend their results to spanning trees.

In most applications that involve training a probabilistic model, the edge weights in the model will be parameterized in some fashion. Traditional approaches (Koo et al., 2007; Smith and Smith, 2007; McDonald et al., 2005a; Druck, 2011) use log-linear parameterizations, whereas more recent work (Dozat and Manning, 2017; Liu and Lapata, 2018; Ma and Xia, 2014) use neural-network parameterizations. Our algorithms are agnostic as to how edges are parameterized.

### 6.1  Risk

Risk minimization is a technique for training structured prediction models (Li and Eisner, 2009; Smith and Eisner, 2006; Stoyanov and Eisner, 2012). Risk is the expectation of a cost function $r:D↦R$ that measures the number of mistakes in comparison to a target tree d*. In the context of dependency parsing, r(d) can be the labeled or unlabeled attachment score (LAS and UAS, respectively), both of which are additively decomposable. The unlabeled case decomposes as follows:
$rij=1Nif(i→j)∈d*0otherwise$
(33)
where d* is the gold tree and N is the length of the sentence. Note that the use of $1N$ ensures that r(d) will be a score between 0 and 1. We can then obtain the expected attachment score using T1, and we can evaluate its gradient in the same run-time using reverse-mode AD or T2. In this case, $s:D↦RS$ is the one-hot representation of the edges; thus, we have S = N2. However, because s is 1-sparse, we have S = 1 . Additionally, as r does not depend on w, we do not need to add a first-order term to find the gradient. Therefore, the runtime for the gradient is also $ON3$.

### 6.2  Shannon Entropy

Entropy is a useful measure of uncertainty, which has been used a number of times in dependency parsing (Smith and Eisner, 2007; Druck and Smith, 2009; Ma and Xia, 2014) for semi-supervised learning. Smith and Eisner (2007) employ entropy regularization (Grandvalet and Bengio, 2004) to bootstrap dependency parsing. However, they give an algorithm for the Shannon entropy,
$H(p)def=Ed−logp(d)$
(34)
that runs in $ON4$.21 Recall from §3 that $−logp(d)$ is additively decomposable; thus, running T1 with $rij=1NlogZ−logwij$ computes H(p) in $ON3$. Martins et al.’s (2010) algorithm for computing H(p) is precisely the same as ours. However, they do not describe how to compute its gradient. As with risk, we can find the gradient of entropy using T2 or using reverse-mode AD. When using T2, since the gradient of r with respect to w is not 0, we add the first-order quantity $T1w,∇r$ as in (21). For entropy, we have that $∇rij=1NZ∇Z−1wij1ij→$.

#### Experiment.

We briefly demonstrate the practical speed-up over Smith and Eisner’s (2007) $ON4$ algorithm. We compare the average runtime per sentence of five different UD corpora.22 The languages have different average sentence lengths to demonstrate the extra speed-up gained when calculating the entropy of longer sentences (that is, $D$ would be a larger set). Tab. 1 shows that even for a corpus of short sentences (Finnish), we achieve a 4 times speed-up. This increases to 15 times as we move to corpora with longer sentences (Arabic).

Table 1:

Average runtime of computing entropy of dependency parser output on five languages. We use the weights of the Stanford Dependency Parser (Qi et al., 2018). The past approach is that of Smith and Eisner (2007).

LanguageSentenceEntropyAverage Runtime (ms)Speed-up
length(nats / word) T1(Fig. 2)Past Approach
Finnish 9.23 0.6092 0.4623 1.882 4.1
English 12.45 0.8264 0.5102 2.778 5.4
German 17.56 0.8933 0.5583 4.104 7.3
French 24.65 0.8923 0.5635 5.742 10.2
Arabic 36.05 0.7163 0.6220 9.368 15.1
LanguageSentenceEntropyAverage Runtime (ms)Speed-up
length(nats / word) T1(Fig. 2)Past Approach
Finnish 9.23 0.6092 0.4623 1.882 4.1
English 12.45 0.8264 0.5102 2.778 5.4
German 17.56 0.8933 0.5583 4.104 7.3
French 24.65 0.8923 0.5635 5.742 10.2
Arabic 36.05 0.7163 0.6220 9.368 15.1

### 6.3  Kullback–Leibler Divergence

To the best of our knowledge, no algorithms to compute the Kullback–Leibler (KL) divergence between two graph-based parsers (nor its gradient) have been given in the literature. We show how this can be achieved easily within our framework. The KL divergence is defined as
$KL(p∥q)def=∑d∈Dp(d)logp(d)q(d)$
(35)
This takes a similar form to the Shannon entropy in (34). We can therefore choose our additively decomposable function to be $rij=logwijqij−1NlogZ$. Running T1 with these weights computes the KL divergence in $ON3$ time. To find the gradient of the KL divergence, we return the sum of $T2w,r,s$ where we chose $sij=1wij1ij→$ and add $T1w,∇r$. For the KL divergence, we have that $∇rij=1wij1ij→−∇Z1NZ$.

### 6.4  Gradient of the GE Objective

The generalized expectation criterion (McCallum et al., 2007; Druck et al., 2009) is a method semi-supervised training using weakly labeled data. GE fits model parameters by encouraging models to match certain expectation constraints, such as marginal-label distributions, on the unlabeled data. More formally, let f be a feature function f(d) ∈ ℝF, and with a target value of f* ∈ℝF that has been specified using domain knowledge. For example, given an English part-of-speech tagged sentence, we can provide the following light supervision to our model: determiners should attach to the nearest noun on their right. This is an example of a very precise heuristic for dependency parsing English that has high precision.

GE then minimizes the following objective,
$GE(p,f*)=12Edf(d)−f*2$
(36)
which encourages the model parameters to match the target expectations. Most methods for optimizing (36) will make use of the gradient.

We note that by application of the chain rule, the gradient of the GE objective is a second-order quantity, and so we can use T2 to compute it. As we discussed in §1, the gradient of the GE has led to confusion in the literature (Druck et al., 2009; Druck and Smith, 2009; Druck, 2011). The best runtime bound prior to our work is Druck et al. (2009)’s $ON4F′$ algorithm. T2 is strictly better at $ON3+N2F′$ time.23 Alternatively, as the GE objective is a scalar, we can compute its gradient in $ON3+N2F′$ using reverse-mode AD. Druck (2011) acknowledges that AD can be used, but questions its practicality and numerical accuracy. We hope to dispel this misconception in the following experiment.

#### Experiment.

We compute the GE objective and its gradient for almost 1500 sentences of the English UD Treebank24 (Nivre et al., 2018) using 20 features extracted using the methodology of Druck et al. (2009). We note that T2 obtains a speed-up of 9 times over Druck and Smith (2009)’s strategy of materializing the covariance matrix (i.e., $T2h$). Additionally, the gradients from both approaches are equivalent with an absolute tolerance of 10−16.

We presented a general framework for computing first- and second-order expectations for additively decomposable functions. We did this by exploiting a key connection between gradients and expectations that allows us to solve our problems using automatic differentiation. The algorithms we provide are simple, efficient, and extendable to many expectations. The automatic differentiation principle has been applied in other settings, such as weighted context-free grammars (Eisner, 2016) and chain-structured models (Vieira et al., 2016). We hope that this paper will also serve as a tutorial on how to compute expectations over trees so that the list of cautionary tales does not grow further. Particularly, we hope that this will allow for the KL divergence to be used in semi-supervised training of dependency parsers. Our aim is for our approach for computing expectations to be extended to other structured prediction models.

We would like to thank action editor Dan Gildea and the three anonymous reviewers for their valuable feedback and suggestions. The first author is supported by the University of Cambridge School of Technology Vice-Chancellor’s Scholarship as well as by the University of Cambridge Department of Computer Science and Technology’s EPSRC.

1

The more precise graph-theoretic term is arborescence.

2

For simplicity, we assume that the runtime of matrix determinants is $ON3$. However, we would be remiss if we did not mention that algorithms exist to compute the determinant more efficiently (Dumas and Pan, 2016).

3

The reader may want to skip this section on their first reading.

4

We follow the conventions of Koo et al. (2007) and say “single-root” and “multi-root” when we technically mean the number of outgoing edges from the root ρ, and not the number of root nodes in a tree, which is always one.

5

The choice to replace row 1 by the root edges is done by convention, we can replace any row in the construction of $L^$.

6

The algorithms given in later sections will not provide full details for the labeled case due to space constraints, but we assure the reader that our algorithms can be straightforwardly generalized to the labeled setting.

7

In fact, Jerrum and Snir (1982) proved that the partition function for spanning trees requires an exponential number of additions and multiplications in the semiring model of computation (i.e., assuming that subtraction is not allowed). Interestingly, division is not required, but algorithms for division-free determinant computation run in $ON4$ (Kaltofen, 1992). An excellent overview of the power of subtraction in the context of dynamic programming is given in Miklós (2019, Ch. 3). It would appear as if commutative rings would make a good level of abstraction as they admit efficient determinant computation. Interestingly, this means that we cannot use the MTT in the max-product semiring to (efficiently) find the maximum weight tree because max does not have an inverse. Fortunately, there exist $ON2$ algorithms to find the maximum weight tree for both the single-root and multi-root settings (Zmigrod et al., 2020; Gabow and Tarjan, 1984).

8

Of course, one could use sampling methods, such as Monte Carlo, to approximate (9) . Sampling methods may be efficient if the variance of f under p is not too large.

9

Proof: $−logp(d)=−log(1Z∏(i→j)∈dwij)$$=logZ−∑(i→j)∈dlogwij$. ⇒ $rij=1NlogZ−logwij$.

10

The k norm of the distribution p often denoted as $∥p∥kdef=∑d∈Dp(d)k1/k$ for k ≥ 0 . It is computable from a zeroth-order expectation because it can be written as $(Z(k)Zk)1/k$ where $Z(k)=∑d∈Dw(d)k=∑(i→j)∈dwijk$, which is clearly a zero th-order expectation. Similarly, the Rényi entropy of order α ≥ 0 with α ≠ 1 is $Hα(p)def=11−αlog∑d∈Dp(d)α=$$11−αlogZ(α)Zα$.

11

Li and Eisner (2009, Section 5.1) provide a similar derivation to Proposition 3 and Proposition 4 for hypergraphs.

12

Some authors (e.g., Wainwright and Jordan, 2008) prefer to work with an exponentiated representation $wij=exp(θij)$ so that $∇θijlogZ=p((i→j)∈d)$. This avoids an explicit division by Z, and multiplication by wij as these operations happens by virtue of the chain rule.

13

As each edge can only appear once in a tree, $wij,ij~=0$.

14

More precisely, $∂r(d)∂wij=0$ for all $d∈D$ and $i→j∈E$.

15

Note that when wij = 0 , we can set sij = 0.

16

The derivative of |L| can also be given using the matrix adjugate, ∇Z = adj(L). There are benefits to using the adjugate as it is more numerically stable and equally efficient (Stewart, 1998). In fact, any algorithm that computes the determinant can be algorithmically differentiated to obtain an algorithm for the adjugate.

17

We have that $|Lij|≤2|Y|$ in the labeled case.

18

We provide a derivation in  Appendix A. Druck and Smith (2009) give a similar derivation for the Hessian, which we have generalized to any second-order quantity.

19

If S <R, we can change the order of $T2v$ to compute $t¯⊤$ in $OS(N3+N2S′)+N2R′S$.

20

Refactoring sum–product expressions via the distributive property is the cornerstone of dynamic programming; similar examples in natural language processing include Eisner and Blatz (2007) and Gildea (2011).

21

Their algorithm calls MTT N times, where the i th call to MTT multiplies the set of incoming edges to ith non-root node by their $log$ weight.

22

Times were measured using an Intel(R) Core(TM) i7-7500U processor with 16GB RAM.

23

We must apply a chain rule in order to use T2. To do this, we first run T1 to obtain $f-$ in $ON3+N2F′$. We then run T2 with the dot product of f and $f-−f*$, which has a dimensionality of 1, and the sparse one-hot vectors as before. The execution of T2 then takes $ON3$, giving us the desired runtime. Full detail is available in our code.

24

We used all sentences in the test set, which were between 5 and 150 words.

25 ;

Note that we do not have to take the derivative of $Lk′l′,kl′$ as it is either 1 or − 1.

Martín
,
Ashish
Agarwal
,
Paul
Barham
,
Eugene
Brevdo
,
Zhifeng
Chen
,
Craig
Citro
,
Greg S.
,
Andy
Davis
,
Jeffrey
Dean
,
Matthieu
Devin
,
Sanjay
Ghemawat
,
Ian
Goodfellow
,
Andrew
Harp
,
Geoffrey
Irving
,
Michael
Isard
,
Yangqing
Jia
,
Rafal
Jozefowicz
,
Lukasz
Kaiser
,
Manjunath
Kudlur
,
Josh
Levenberg
,
Dan
Mané
,
Rajat
Monga
,
Sherry
Moore
,
Derek
Murray
,
Chris
Olah
,
Mike
Schuster
,
Jonathon
Shlens
,
Benoit
Steiner
,
Ilya
Sutskever
,
Kunal
Talwar
,
Paul
Tucker
,
Vincent
Vanhoucke
,
Vijay
Vasudevan
,
Fernanda
Viégas
,
Oriol
Vinyals
,
Pete
Warden
,
Martin
Wattenberg
,
Martin
Wicke
,
Yuan
Yu
, and
Xiaoqiang
Zheng
.
2015
.
TensorFlow: Large-scale machine learning on heterogeneous systems
.
Software available from tensorflow.org
.
Eduard
Bejček
,
Eva
Hajičová
,
Jan
Hajič
,
Pavlína
Jínová
,
Václava
Kettnerová
,
Veronika
Kolářová
,
Marie
Mikulová
,
Jiří
Mírovský
,
Anna
Nedoluzhko
,
Jarmila
Panevová
,
Lucie
Poláková
,
Magda
Ševčíková
,
Jan
Štěpánek
, and
Šárka
Zikánová
.
2013
.
Prague dependency treebank 3.0
.
James
,
Roy
Frostig
,
Peter
Hawkins
,
Matthew James
Johnson
,
Chris
Leary
,
Dougal
Maclaurin
, and
Skye
Wanderman-Milne
.
2018
.
JAX: Composable transformations of Python+ NumPy programs
.
Darwiche
.
2003
.
A differential approach to inference in Bayesian networks
.
Journal of the ACM
,
50
(
3
).
Timothy
Dozat
and
Christopher D.
Manning
.
2017
.
Deep biaffine attention for neural dependency parsing
. In
Proceedings of the International Conference on Learning Representations
.
Gregory
Druck
.
2011
.
Generalized Expectation Criteria for Lightly Supervised Learning
.
Ph.D. thesis
,
University of Massachusetts Amherst
.
Gregory
Druck
,
Gideon
Mann
, and
Andrew
McCallum
.
2009
.
Semi-supervised learning of dependency parsers using generalized expectation criteria
. In
Proceedings of the International Joint Conference on Natural Language Processing
.
Gregory
Druck
and
David
Smith
.
2009
.
Computing conditional feature covariance in non-projective tree conditional random fields
.
Technical Report UM-CS-2009-060
,
University of Massachusetts
.
Jean-Guillaume
Dumas
and
Victor
Pan
.
2016
.
Fast matrix multiplication and symbolic computation
.
arXiv preprint arXiv:1612.05766
.
Jason
Eisner
.
2016
.
Inside-outside and forward-backward algorithms are just backprop (tutorial paper)
. In
Proceedings of the Workshop on Structured Prediction for NLPatEMNLP 2016, Austin, TX, USA, November 5, 2016
.
Jason
Eisner
and
John
Blatz
.
2007
.
Program transformations for optimization of parsing algorithms and other weighted logic programs
. In
Proceedings of the Conference on Formal Grammar
, pages
45
85
,
CSLI Publications;
.
Harold N.
Gabow
and
Robert Endre
Tarjan
.
1984
.
Efficient algorithms for a family of matroid intersection problems
.
Journal of Algorithms
,
5
(
1
).
Giorgio
Gallo
,
Giustino
Longo
, and
Stefano
Pallottino
.
1993
.
Directed hypergraphs and applications
.
Discrete Applied Mathematics
,
42
(
2
).
Daniel
Gildea
.
2011
.
Grammar factorization by tree decomposition
.
Computational Linguistics
,
37
(
1
):
231
248
.
Yves
Grandvalet
and
Yoshua
Bengio
.
2004
.
Semi-supervised learning by entropy minimization
. In
Advances in Neural Information Processing Systems
.
Andreas
Griewank
and
Andrea
Walther
.
2008
.
Evaluating Derivatives–Principles and Techniques of Algorithmic Differentiation
, second edition.
SIAM
.
M.
Jerrum
and
M.
Snir
.
1982
.
Some exact complexity results for straight-line computations over semirings
.
Journal of the Association for Computing Machinery
,
29
(
3
).
Erich
Kaltofen
.
1992
.
On computing determinants of matrices without divisions
. In
Papers from the International Symposium on Symbolic and Algebraic Computation
.
Gustav
Kirchhoff
.
1847
.
Über die auflösung der gleichungen, auf welche man bei der untersuchung der linearen vertheilung galvanischer ströme geführt wird
.
Annalen der Physik
,
148
(
12
).
Lingpeng
Kong
,
Nathan
Schneider
,
Swabha
Swayamdipta
,
Archna
Bhatia
,
Chris
Dyer
, and
Noah A.
Smith
.
2014
.
A dependency parser for tweets
. In
Proceedings of the Conference on Empirical Methods in Natural Language Processing
.
Terry
Koo
,
Amir
Globerson
,
Xavier
Carreras
, and
Michael
Collins
.
2007
.
Structured prediction models via the matrix-tree theorem
. In
Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
.
Zhifei
Li
and
Jason
Eisner
.
2009
.
First and second-order expectation semirings with applications to minimum-risk training on translation forests
. In
Proceedings of the Conference on Empirical Methods in Natural Language Processing
.
Yang
Liu
and
Mirella
Lapata
.
2018
.
Learning structured text representations
.
Transactions of the Association for Computational Linguistics
,
6
.
Xuezhe
Ma
and
Eduard
Hovy
.
2017
.
Neural probabilistic model for non-projective MST parsing
. In
Proceedings of the International Joint Conference on Natural Language Processing
.
Xuezhe
Ma
and
Fei
Xia
.
2014
.
Unsupervised dependency parsing with transferring distribution via parallel guidance and entropy regularization
. In
Proceedings of the Annual Meeting of the Association for Computational Linguistics
.
André
Martins
,
Noah
Smith
,
Eric
Xing
,
Pedro
Aguiar
, and
Mário
Figueiredo
.
2010
.
Turbo parsers: Dependency parsing by approximate variational inference
. In
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
, pages
34
44
.
Andrew
McCallum
,
Gideon
Mann
, and
Gregory
Druck
.
2007
.
Generalized expectation criteria
.
University of Massachusetts
.
Ryan
McDonald
,
Koby
Crammer
, and
Fernando
Pereira
.
2005a
.
Online large-margin training of dependency parsers
. In
Proceedings of the Annual Meeting of the Association for Computational Linguistics
.
Ryan
McDonald
,
Fernando
Pereira
,
Kiril
Ribarov
, and
Jan
Hajič
.
2005b
.
Non-projective dependency parsing using spanning tree algorithms
. In
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing
.
Ryan
McDonald
and
Giorgio
Satta
.
2007
.
On the complexity of non-projective data-driven dependency parsing
. In
Proceedings of the International Conference on Parsing Technologies
.
István
Miklós
.
2019
.
Computational Complexity of Counting and Sampling
.
CRC Press
. https://www.taylorfrancis.com/books/9781315266954.
Joakim
Nivre
,
Mitchell
Abrams
,
željko
Agić
,
Lars
Ahrenberg
,
Lene
Antonsen
,
Katya
Aplonova
,
Maria Jesus
Aranzabe
,
Gashaw
Arutie
,
Masayuki
Asahara
,
Luma
Ateyah
,
Mohammed
Attia
,
Aitziber
Atutxa
,
Liesbeth
Augustinus
,
Elena
,
Miguel
Ballesteros
,
Esha
Banerjee
,
Sebastian
Bank
,
Verginica Barbu
Mititelu
,
Victoria
Basmov
,
John
Bauer
,
Sandra
Bellato
,
Kepa
Bengoetxea
,
Yevgeni
Berzak
,
Bhat
,
Bhat
,
Erica
Biagetti
,
Eckhard
Bick
,
Rogier
Blokland
,
Victoria
Bobicev
,
Carl
Börstell
,
Cristina
Bosco
,
Gosse
Bouma
,
Sam
Bowman
,
Boyd
,
Aljoscha
Burchardt
,
Marie
Candito
,
Bernard
Caron
,
Gauthier
Caron
,
Gülşen Cebirŏglu
Eryiı̆git
,
Flavio Massimiliano
Cecchini
,
Giuseppe G. A.
Celano
,
Slavomír
Čéplö
,
Savas
Cetin
,
Fabricio
Chalub
,
Jinho
Choi
,
Yongseok
Cho
,
Jayeol
Chun
,
Silvie
Cinková
,
Aurélie
Collomb
,
Çăgrı
Çöltekin
,
Miriam
Connor
,
Marine
Courtin
,
Elizabeth
Davidson
,
Marie-Catherine
de Marneffe
,
Valeria
de Paiva
,
Arantza
Diaz de Ilarraza
,
Carly
Dickerson
,
Peter
Dirix
,
Kaja
Dobrovoljc
,
Timothy
Dozat
,
Kira
Droganova
,
Puneet
Dwivedi
,
Marhaba
Eli
,
Ali
Elkahky
,
Binyam
Ephrem
,
Tomaž
Erjavec
,
Aline
Etienne
,
Richárd
Farkas
,
Hector Fernandez
Alcalde
,
Jennifer
Foster
,
Cláudia
Freitas
,
Katarína
Gajdošová
,
Daniel
Galbraith
,
Marcos
Garcia
,
Moa
Gärdenfors
,
Sebastian
Garza
,
Kim
Gerdes
,
Filip
Ginter
,
Iakes
Goenaga
,
Koldo
Gojenola
,
Memduh
Gökırmak
,
Yoav
Goldberg
,
Xavier Gómez
Guinovart
,
Berta Gonzáles
Saavedra
,
Matias
Grioni
,
Normunds
Grūzı-tis
,
Bruno
Guillaume
,
Céline
Guillot-Barbance
,
Nizar
Habash
,
Jan
Hajič
,
Jan
Hajič
jr.
,
Linh Hà
My~
,
Na-Rae
Han
,
Kim
Harris
,
Dag
Haug
,
Barbora
,
Jaroslava
Hlaváčcová
,
Florinel
Hociung
,
Petter
Hohle
,
Jena
Hwang
,
Ion
,
Elena
Irimia
,
Ọlájídé
Ishola
,
Tomáš
Jelínek
,
Anders
Johannsen
,
Fredrik
Jørgensen
,
Hüner
Kaşıkara
,
Sylvain
Kahane
,
Hiroshi
Kanayama
,
Jenna
Kanerva
,
Boris
Katz
,
Tolga
,
Jessica
Kenney
,
Václava
Kettnerová
,
Jesse
Kirchner
,
Kamil
Kopacewicz
,
Natalia
Kotsyba
,
Simon
Krek
,
Sookyoung
Kwak
,
Veronika
Laippala
,
Lorenzo
Lambertino
,
Lucia
Lam
,
Tatiana
Lando
,
Septina Dian
Larasati
,
Alexei
Lavrentiev
,
John
Lee
,
Phuong
Lê Hồng
,
Alessandro
Lenci
,
Saran
,
Herman
Leung
,
Cheuk Ying
Li
,
Josie
Li
,
Keying
Li
,
KyungTae
Lim
,
Nikola
Ljubešić
,
Olga
,
Olga
Lyashevskaya
,
Teresa
Lynn
,
Vivien
Macketanz
,
Aibek
Makazhanov
,
Michael
Mandl
,
Christopher
Manning
,
Ruli
Manurung
,
Cătălina
Mărănduc
,
David
Mareček
,
Katrin
Marheinecke
,
Héctor Martínez
Alonso
,
André
Martins
,
Jan
Mašek
,
Yuji
Matsumoto
,
Ryan
McDonald
,
Gustavo
Mendonça
,
Niko
Miekka
,
Margarita
Misirpashayeva
,
Anna
Missilä
,
Cătălin
Mititelu
,
Yusuke
Miyao
,
Simonetta
Montemagni
,
Amir
More
,
Laura Moreno
Romero
,
Keiko Sophie
Mori
,
Shinsuke
Mori
,
Bjartur
Mortensen
,
Bohdan
Moskalevskyi
,
Muischnek
,
Yugo
Murawaki
,
Kaili
Müürisep
,
Pinkey
Nainwani
,
Juan
Ignacio Navarro Horñiacek
,
Anna
Nedoluzhko
,
Gunta
Nešpore-Bērzkalne
,
Luong
Nguy໅n Thị
,
Huyền
Nguyễn Thiị Minh
,
Vitaly
Nikolaev
,
Rattima
Nitisaroj
,
Hanna
Nurmi
,
Stina
Ojala
,
Olúòkun
,
Mai
Omura
,
Petya
Osenova
,
Robert
Östling
,
Lilja
Øvrelid
,
Niko
Partanen
,
Elena
Pascual
,
Marco
Passarotti
,
Agnieszka
Patejuk
,
Guilherme
Paulino-Passos
,
Siyao
Peng
,
Cenel-Augusto
Perez
,
Guy
Perrier
,
Slav
Petrov
,
Jussi
Piitulainen
,
Emily
Pitler
,
Barbara
Plank
,
Thierry
Poibeau
,
Martin
Popel
,
Lauma
Pretkalniņa
,
Sophie
Prévost
,
Prokopis
Prokopidis
,
Przepiórkowski
,
Tiina
Puolakainen
,
Sampo
Pyysalo
,
Andriela
Rääbis
,
Alexandre
,
Loganathan
Ramasamy
,
Taraka
Rama
,
Carlos
Ramisch
,
Vinit
Ravishankar
,
Livy
Real
,
Siva
Reddy
,
Georg
Rehm
,
Michael
Rießler
,
Larissa
Rinaldi
,
Laura
Rituma
,
Luisa
Rocha
,
Mykhailo
Romanenko
,
Rudolf
Rosa
,
Davide
Rovati
,
Valentin
Rosça
,
Olga
Rudina
,
Jack
Rueter
,
Shoval
,
Benoît
Sagot
,
Saleh
,
Tanja
Samardžić
,
Stephanie
Samson
,
Manuela
Sanguinetti
,
Baiba
Saulīte
,
Yanin
Sawanakunanon
,
Nathan
Schneider
,
Sebastian
Schuster
,
Djamé
Seddah
,
Wolfgang
Seeker
,
Mojgan
Seraji
,
Mo
Shen
,
Atsuko
,
Muh
Shohibussirri
,
Dmitry
Sichinava
,
Natalia
Silveira
,
Maria
Simi
,
Simionescu
,
Katalin
Simkó
,
Mária
Šimková
,
Kiril
Simov
,
Isabela
Soares-Bastos
,
Carolyn
,
Antonio
Stella
,
Milan
Straka
,
Jana
,
Alane
Suhr
,
Umut
Sulubacak
,
Zsolt
Szántó
,
Dima
Taji
,
Yuta
Takahashi
,
Takaaki
Tanaka
,
Isabelle
Tellier
,
Trond
Trosterud
,
Anna
Trukhina
,
Reut
Tsarfaty
,
Francis
Tyers
,
Sumire
Uematsu
,
Zdeňka
Urešová
,
Larraitz
Uria
,
Hans
Uszkoreit
,
Sowmya
Vajjala
,
Daniel van
Niekerk
,
Gertjan van
Noord
,
Viktor
Varga
,
Eric
Villemonte de la Clergerie
,
Veronika
Vincze
,
Lars
Wallin
,
Jing Xian
Wang
,
Jonathan North
Washington
,
Seyi
Williams
,
Mats
Wirén
,
Tsegay
Woldemariam
,
Tak-sum
Wong
,
Chunxiao
Yan
,
Marat M.
Yavrumyan
,
Zhuoran
Yu
,
Zdenčk
žabokrtský
,
Amir
Zeldes
,
Daniel
Zeman
,
Manying
Zhang
, and
Hanzhi
Zhu
.
2018
.
Universal dependencies 2.3
.
LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
.
Paszke
,
Sam
Gross
,
Francisco
Massa
,
Lerer
,
James
,
Gregory
Chanan
,
Trevor
Killeen
,
Zeming
Lin
,
Natalia
Gimelshein
,
Luca
Antiga
,
Alban
Desmaison
,
Andreas
Kopf
,
Edward
Yang
,
Zachary
DeVito
,
Martin
Raison
,
Alykhan
Tejani
,
Sasank
Chilamkurthy
,
Benoit
Steiner
,
Lu
Fang
,
Junjie
Bai
, and
Soumith
Chintala
.
2019
.
PyTorch: An imperative style, high-performance deep learning library
. In
Advances in Neural Information Processing Systems
.
Peng
Qi
,
Timothy
Dozat
,
Yuhao
Zhang
, and
Christopher D.
Manning
.
2018
.
Universal dependency parsing from scratch
. In
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
.
David A.
Smith
and
Jason
Eisner
.
2006
.
Minimum risk annealing for training log-linear models
. In
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions
, pages
787
794
,
Sydney, Australia
.
Association for Computational Linguistics
.
David A.
Smith
and
Jason
Eisner
.
2007
.
Bootstrapping feature-rich dependency parsers with entropic priors
. In
Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
.
David A.
Smith
and
Noah A.
Smith
.
2007
.
Probabilistic models of nonprojective dependency trees
. In
Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)
.
G. W.
Stewart
.
1998
.
.
Linear Algebra and its Applications
,
283
(
1–3
).
Veselin
Stoyanov
and
Jason
Eisner
.
2012
.
Minimum-risk training of approximate CRF-based NLP systems
. In
Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
.
Lucien
Tesnière
.
1959
.
Eléments de syntaxe structurale
.
Klincksieck
.
W. T.
Tutte
.
1984
.
Graph Theory
.
.
Tim
Vieira
,
Ryan
Cotterell
, and
Jason
Eisner
.
2016
.
Speed-accuracy tradeoffs in tagging with variable-order CRFs and structured sparsity
. In
Proceedings of the Conference on Empirical Methods in Natural Language Processing
.
Tim
Vieira
and
Jason
Eisner
.
2017
.
Learning to prune: Exploring the frontier of fast and accurate parsing
.
Transactions of the Association for Computational Linguistics
,
5
:
263
278
.
Martin J.
Wainwright
and
Michael I.
Jordan
.
2008
.
Graphical Models, Exponential Families, and Variational Inference
.
Now Publishers Inc;
.
Ran
Zmigrod
,
Tim
Vieira
, and
Ryan
Cotterell
.
2020
.
Please mind the root: Decoding arborescences for dependency parsing
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
4809
4819
.

#### A Derivation of ∇2 Z

In this section, we will provide a derivation for the expression of ∇2 Z given in (25). We begin by taking the derivative of ∇Z using (22)
$∂2Z∂wij∂wkl=∂∂wijZ∑(k′,lv)∈LklBk′l′L′k′l′,kl$
We solve this by applying the product rule.25 The first term of the product rule is
$∂Z∂wij∑(k′,l′)∈LklBk′l′L′k′l′,kl=Z∑(i′,j′)∈Lij(k′,l′)∈LklBi′j′Bk′l′Li′j′,ij′L′k′l′,kl$
The second term of the product rule is
$Z∑(k′,l′)∈Lkl∂Bk′l′∂wijLk′l′,kl′=−Z∑(i′,j′)∈Lij(k′,l′)∈LklBi′l′Bk′j′Li′j′,ij′L′k′l′,kl$
Summing these together yields (25) .

#### B Proof of T2

In this section, we will prove the decomposition of $t¯$ that allows for the efficient factoring used in T2. First, recall from Proposition 7 that we may find $t¯$ by
$t¯=∑(i→j)∈E∂Z∂wijwijrijsij⊤+∑(i→j)∈E∑(k→l)∈E∂2Z∂wij∂wklwijwklrijskl⊤$
The first summand is the first-order total for function rijsij (given as $f-$ in T2). We can write a sum over all edges as the sum over pairs of nodes in $N$. Similarly, elements in ℒij can be considered as pairs of nodes. Therefore, unless specified otherwise, we assume all variables in the base of a summation are scoped to $N$. Then, the second summand can then be rewritten
$∑i→j∈E∑(k→l)∈E∂2Z∂wij∂wklwijwklrijskl⊤=∑i,j,k,l,i′,j′,k′,l′Li′j′,ij′ZBi′j′Bk′l′Lk′l′,kl′wijwklrijskl⊤−Li′j′,ij′ZBi′l′Bk′j′Lk′l′,kl′wijwklrijskl⊤$
By distributivity, the first term equals
$Z∑i,j,i′,j′Bi′j′L′i′j′,ijwijrij∑k,l,k′,l′Bk′l′Lk′l′,kl′wklskl⊤=1Zr¯s¯⊤$
By distributivity, the second term equals
$Z∑j′,l′∑k′,k,lBk′j′Lk′l′,kl′wklrkl︸def=rj′l′^∑k′,k,lBi′l′Li′j′,ij′∑i′,i,jBi′,l′Li′j′,ij′wijsij︸def=sj′l′^⊤=Z∑j′,l′rj′l′^sj′l′^⊤$
The above decomposition assumed we sum over all i, j, k, and l and so suggests we can compute all $rj′l′^$ and $sj′l′^$ in $ON5(R′+S′)$. However, we can exploit the sparsity of ∇L to improve this. Specifically, the follow algorithm computes $rj′l′^$ for all $j′,l′∈N$.

Therefore, we can compute all $rj′l′^$ and $sj′l′^$ in $ON3(R′+S′)$. Each $rij^$ is at most $ONR′$ dense, because there are at most $ON$R-sparse vectors added to it (by the inner loop). Hence, $rij^$ is $OR¯$ sparse where $R¯def=min(R,NR′)$. This means that computing the sum of the outer-products of all $rij^$ and $sij^$ can be done in $ON2R¯S¯$. Then, given that we have
$t¯=f¯+1Zr¯s¯−Z∑j′,l′rj′l′^sj′l′^⊤$
(37)
We can find $t¯$ in
$ON3(R′+S′)+RS+N2R¯S¯$
(38)

## Author notes

*

Equal contribution.

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode