## Abstract

We show that a previously proposed algorithm for the *N*-best trees problem can be made more efficient by changing how it arranges and explores the search space. Given an integer *N* and a weighted tree automaton (wta) *M* over the tropical semiring, the algorithm computes *N* trees of minimal weight with respect to *M*. Compared with the original algorithm, the modifications increase the laziness of the evaluation strategy, which makes the new algorithm asymptotically more efficient than its predecessor. The algorithm is implemented in the software Betty, and compared to the state-of-the-art algorithm for extracting the *N* best runs, implemented in the software toolkit Tiburon. The data sets used in the experiments are wtas resulting from real-world natural language processing tasks, as well as artificially created wtas with varying degrees of nondeterminism. We find that Betty outperforms Tiburon on all tested data sets with respect to running time, while Tiburon seems to be the more memory-efficient choice.

## 1 Introduction

Trees are standard in natural language processing (NLP) to represent linguistic analyses of sentences. Similarly, tree automata provide a compact representation for a set of such analyses. *Bottom–up tree automata* act as recognizing devices and process their input trees in a step-wise fashion, working upwards from the leaf nodes towards the root of the tree. For memory, they have a finite set of **states**, some of which are said to be **accepting**, and their internal logic is represented as a finite set of **transition rules**. A **run** of a tree automaton *M* on an input tree *t* is a mapping from the nodes of *t* to the states, which is compatible with the transition rules. In general, *M* can have several distinct runs on *t*, and **accepts***t* if one of these runs maps the root of *t* to an accepting state. By equipping the transition rules with weights, *M* can be made to associate *t* with a likelihood or score: The weight of a run of *M* on *t* is the product of the weights of the transition rules used in the run, and the weight of *t* is the sum of the weights of all runs on *t*. This type of automaton is called a weighted-tree automaton (wta) and is popular in, for example, dependency parsing and machine translation.

A central task is the extraction of the highest ranking trees with respect to a weighted tree automaton. For example, when the automaton represents a large set of intermediate solutions, it may be desirable to prune these down to a more manageable number before continuing the computation. This problem is known as the **best trees problem**. It is related to the **best runs problem** that asks for highest-ranking runs of the automaton on not necessarily distinct trees. The computational difficulty of the *N*-best trees problem depends on the algebraic domain from which weights are taken. Here we assume this domain to be the tropical semiring. Hence, the weight of a tree *t* is the minimal weight of any run on *t*, and the weight of a run is the sum of the weights of the transitions used in the run, which are non-negative real numbers. The tropical semiring is particularly common in speech and text processing (Benesty, Sondhi, and Huang 2008), since probabilistic devices can be modeled using negative log likelihoods. Moreover, the semiring has the advantage of being extremal: The sum of two elements *a* and *b* always equals one of *a* and *b*. As a consequence, it is not necessary to consider all runs of an automaton on an input tree to find the weight of the tree, as its weight is equal to the weight of the optimal run. (In the non-extremal case, the problem is NP-complete even for strings (Lyngsø and Pedersen 2002).) The *N*-best trees problem for a given wta *M* over an extremal semiring can be solved indirectly by computing a list of *N′* best runs for *M*, for a sufficiently large number *N′*, and outputting the corresponding trees while discarding previously outputted trees. A complicating factor with this approach, however, is that *M* can have exponentially many runs on a single tree, so *N′* may have to be very large to guarantee that the output contains *N* distinct trees.

Best trees extraction is useful in any application that includes some type of re-ranking of hypotheses. One example that makes use of best trees extraction is the work by Socher et al. (2013) on syntactical language analysis. The team of authors improve the Stanford parser by composing a probabilistic context-free grammar (PCFG) with a recurrent neural network (RNN) that learns vector representations. Intuitively, each nonterminal in the PCFG is associated with a continuous vector space. The vector space induces an unbounded refinement of the category represented by the nonterminal into subcategories, and the RNN computes transitions between such vectors. For efficiency reasons, the device is not applied to the input sentence directly. Instead, the *N* = 200 highest-scoring parse trees with respect to the PCFG are computed, whereupon the RNN is used to rerank these to find the best parse tree. The work has raised interest in hybrid finite-state continuous-state approaches (see, e.g., the work by Zhao, Zhang, and Tu (2018)), and underlines the value of the *N*-best problem in language processing.

In previous work, Björklund, Drewes, and Zechner (2019) generalized an *N*-best algorithm by Mohri and Riley (2002) from strings to trees, resulting in the algorithm Best Trees v.1. Intuitively, the algorithm performs a lazy implicit determinization and uses a priority queue to output *N* best trees in the right order. The running time of Best Trees v.1 was shown to be in $O(max(Nmn\u22c5(Nr+rlogr+NlogN),N2n3,mr2))$, where *m* and *n* are the numbers of transitions and states of *M*, respectively, and *r* is the maximum number of children (the **rank**) of symbols in the input alphabet. Best Trees v.1 was evaluated empirically in Björklund, Drewes, and Jonsson (2018) against the *N* best runs algorithm by Huang and Chiang (2005), which represents the state of the art. Although Büchse et al. (2010) proved that the algorithm by Huang and Chiang works for cyclic input wtas and generalized it by extending it to structured weight domains, the core idea of the algorithm remains the same. The algorithm by Huang and Chiang (2005) is implemented in the widely referenced Tiburon toolkit (May and Knight 2006). From here on, we refer to this implementation simply as Tiburon, even though the best runs procedure is only one out of many that the toolkit has to offer. The conclusion of Björklund, Drewes, and Jonsson (2018) was that Best Trees v.1 is faster if the input wtas exhibit a high degree of nondeterminism, whereas Tiburon is the better option when the input wtas are large but essentially deterministic.

We now improve Best Trees v.1 by exploring the search space in a more structured way, resulting in the algorithm Best Trees. In Best Trees v.1, all assembled trees were kept in a single queue. In this work, we split the queue into as many queues as there are transitions in the input automaton. The queue *K*_{τ} of transition τ contains trees that are instantiations of τ, that is, trees with a run that applies τ at the root. This makes it possible to improve the strategy to prune the queue that was used by Björklund, Drewes, and Zechner (2019), and avoid pruning altogether. The intuition is simple: To assemble *N* distinct output trees, at most *N* instantiations of any one transition may be needed. We furthermore assemble the instantiations of τ in a lazy fashion, constructing an instantiation explicitly only when it is dequeued from *K*_{τ}. We formally prove the correctness of Best Trees and derive an upper bound on its running time, namely, $O(Nm(log(m)+r2+rlog(Nr)))$.

In addition to solving the best trees problem, Best Trees can also solve the best runs problem by removing the control structure that makes it discard duplicate trees (see Section 5). In this article, we make use of this possibility to compare this algorithm, implemented as Betty, with Tiburon on the home turf of the latter, that is, with respect to the computation of best runs rather than best trees. For our experiments, we use both largely deterministic wtas from a machine translation project and from Grammatical Framework (Ranta 2011), and more nondeterministic wtas that were artificially created to expose the algorithms to challenging instances. Our results show that Betty is generally more time efficient than Tiburon, despite the fact that the former is more general as it can also compute best trees (with almost the same efficiency as it computes best runs). Moreover, we perform a limited set of experiments measuring the memory usage of the applications, and can conclude that overall, the memory efficiency of Tiburon is slightly better than that of Best Trees.

### 1.1 Related Work

The proposed algorithm adds to a line of research that spans two decades. It originates with an algorithm by Eppstein (1998) that finds the *N* best paths from one source node to the remaining nodes in a weighted directed graph. When applied to graphs representing weighted string automata, the list returned by Eppstein’s algorithm may in case of nondeterminism contain several paths that carry the same string, that is, the list is not guaranteed to be free from duplicate strings. Four years later, Mohri and Riley (2002) presented an algorithm that computes the *N* best strings with respect to a weighted string automaton and thereby creates duplicate-free lists. To reduce the amount of redundant computation, consisting in the exploration of alternative runs on one and the same substring, they incorporate the *N* shortest paths algorithm by Dijkstra (1959). Moreover, Mohri and Riley work with on-the-fly determinization of the input automaton *M*, which avoids the problem that the determinized automaton can be exponentially larger than *M*, but has at most one run on each input string. Jiménez and Marzal (2000) lift the problem to the tree domain by finding the *N* best parse trees with respect to a context-free grammar in Chomsky normal form for a given string. Independently, Huang and Chiang (2005) published an algorithm that computes the best runs for weighted hypergraphs (which is equivalent to weighted tree automata and weighted regular tree grammars), and that is a generalization of the algorithm by Jiménez and Marzal in that it does not require the input to be in normal form. Huang and Chiang combine dynamic programming and lazy evaluation to keep the number of intermediate computations small, and derive a lower bound on the worst-case running time of their algorithm than Jiménez and Marzal do.

As previously mentioned, the algorithm by Huang and Chiang (2005) is implemented in the Tiburon toolkit by May and Knight (2006). Its initial usage was as part of a machine-translation pipeline, to extract the *N* best trees from a weighted tree automaton. In connection with this work, Knight and Graehl (2005) noted that there was no known efficient algorithm to solve this problem directly. As an alternative way forward, Knight and Graehl thus enumerate the best runs and discard duplicate trees, until sufficiently many unique trees have been found. Since Tiburon is highly optimized and the current state-of-the-art tool for best runs extraction, it is a natural choice of reference implementation for an empirical evaluation of our solution. The algorithm by Huang and Chiang (2005) was later generalized by Büchse et al. (2010) to allow a linear pre-order on the weights, as opposed to a total order. Büchse et al. (2010) also prove that the algorithm is correct on cyclic input hypergraphs, provided that the Viterbi algorithm for finding an optimal run (Jurafsky and Martin 2009) is replaced by Knuth’s algorithm (Knuth 1974).^{1}

Also, Finkel, Manning, and Ng (2006) remark on the lack of sub-exponential algorithms for the best trees problem. They propose an algorithm that approximates the solution when the automaton is expressed as a cascade of probabilistic tree transducers. The authors model the cascade as a Bayesian network and consider every step as a variable. This allows them to sample a set of alternative labels from each prior step, to propagate onward in the current step. The approximation algorithm runs in polynomial time in the size of the input device and the number of samples, but the convergence rate to the exact solution is not analyzed. To explain why their approach is preferable to finding *N* best runs, they extract the *N* = 50 best runs from the Stanford parser and observe that about half of the output trees are actually duplicates—enough to affect the outcome of the processing pipeline. Thus, they argue, extracting the highest ranking trees rather than the highest ranking runs is not only theoretically better, but is also of practical significance.

Finally, we note that the best trees problem also has applications outside of NLP. In fact, whenever we are considering a set of objects, each of which can be expressed uniquely by an expression in some particular algebra, and the set of expressions is the language of a wta, then the best trees algorithm can be used to produce the best objects. For instance, Björklund, Drewes, and Ericson (2016) propose a restricted class of hypergraphs that are uniquely described by expressions in a certain graph algebra, so the best trees algorithm makes it possible to find the optimal such graphs with respect to a wta. The reason why the representation needs to be unique is that otherwise we will have to check equivalence between objects as an added step. In the case of graphs, this would mean deciding graph isomorphism, which is not known to be tractable in general.

## 2 Preliminaries

We write ℕ for the set of nonnegative integers, ℕ_{ +} for ℕ ∖{0}, and ℝ_{ +} for the set of non-negative reals; $N\u221e$ and $R+\u221e$ denote $N\u222a{\u221e}$ and $R+\u222a{\u221e}$, respectively. For *n* ∈ℕ, [*n*] = {*i* ∈ℕ∣1 ≤ *i* ≤ *n*}. Thus, in particular, [0] = *∅* and $[\u221e]=N$. The cardinality of a (countable) set *S* is written |*S*|. The *n*-fold Cartesian product of a set *S* with itself is denoted by *S*^{n}.

The set of all finite sequences over *S* is denoted by *S*^{*}, and the empty sequence by λ. A sequence of *l* copies of a symbol *s* is denoted by *s*^{l}. Given a sequence *σ* = *s*_{1}⋯*s*_{n} of *n* elements *s*_{i} ∈ *S*, we denote its length *n* by |*σ*|. Given an integer *i* ∈ [*n*], we write *σ*_{i} for the *i*-th element *s*_{i} of *σ*. For notational simplicity, we occasionally use sequences as if they were sets, for example, writing *s* ∈ *σ* to express that *s* occurs in *σ*, or *S* ∖ *σ* to denote the set of all elements of a set *S* that do not occur in the sequence *σ*.

A (commutative) semiring is a structure $(D,\u2295,\u2297,0,1)$ such that both $(D,\u2295,0)$ and $(D,\u2297,1)$ are commutative monoids, the semiring multiplication ⊗ distributes over the semiring addition ⊕ from both left and right, and 0 is an annihilator for ⊗, that is, 0 ⊗ *d* = 0 = *d* ⊗ 0 for all $d\u2208D$. In this article, we will exclusively consider the *tropical semiring*. Its domain is $R+\u221e$, with $min$ serving as semiring addition and ordinary plus as semiring multiplication.

*A*, an

*A*-labeled

**tree**is a partial function $t:N+*\u2192A$ whose domain dom(

*t*) is a finite non-empty set that is closed to the left and under taking prefixes; whenever

*vi*∈dom(

*t*) for some $v\u2208N+*$ and

*i*∈ℕ

_{ +}, it holds that

*vj*∈dom(

*t*) for all 1 ≤

*j*≤

*i*(closedness to the left) and

*v*∈dom(

*t*) (prefix-closedness). The size of

*t*is |

*t*| = |dom(

*t*)|. An element

*v*of dom(

*t*) is called a

**node**of

*t*, and |{

*i*∈ℕ

_{ +}∣

*vi*∈dom(

*t*)}| is the

**rank**of

*v*. The

**subtree**of

*t*∈

*T*

_{Σ}rooted at

*v*is the tree

*t*/

*v*defined by

*t*/

*v*(

*u*) =

*t*(

*vu*) for every $u\u2208N+*$. If

*t*(λ) =

*f*and

*t*/

*i*=

*t*

_{i}for all

*i*∈ [

*k*], where

*k*is the rank of λ in

*t*, then we denote

*t*by

*f*[

*t*

_{1},…,

*t*

_{k}], which may be simplified to

*f*if

*k*= 0.

A **ranked alphabet** is a disjoint union of finite sets of symbols, $\Sigma =\u22c3k\u2208N\Sigma (k)$. For *f* ∈ Σ, the *k* ∈ℕ such that *f* ∈ Σ_{(k)} is the *rank* of *f*, denoted by rank(*f*). The set *T*_{Σ} of ranked trees over Σ consists of all Σ-labeled trees *t* in which the rank of every node *v* ∈dom(*t*) equals the rank of *t*(*v*). For a set *T* of trees we denote by Σ(*T*) the set of trees which have a symbol from Σ at their root, with direct subtrees in *T*, more precisely, {*f*[*t*_{1},…,*t*_{k}]∣*k* ∈ℕ,*f* ∈ Σ_{(k)},and *t*_{1},…,*t*_{k} ∈ *T*}.

**contexts over Σ**is the set

*C*

_{Σ}of trees $c\u2208T\Sigma \u222a{\u25a1}$ containing exactly one node

*v*∈dom(

*c*) with $c(v)=\u25a1$. We define the

**depth**of

*c*to be

*depth*(

*c*) = |

*v*|, i.e., the depth of

*c*is the distance of $\u25a1$ from the root of

*c*. The

**substitution**of another tree

*t*into

*c*results in the tree $c\u27e6t\u27e7$ given by $dom(c\u27e6t\u27e7)=domc\u222a{vu\u2223u\u2208dom(t)}$ and, for all $w\u2208dom(c\u27e6t\u27e7)$,

A **weighted tree language** over the tropical semiring is a mapping $L:T\Sigma \u2192R+\u221e$, where Σ is a ranked alphabet. Weighted tree languages can be specified in a number of equivalent ways. Three of the standard ones, mirroring the ways in which regular string languages are traditionally specified, are weighted regular tree grammars, weighted tree automata, and weighted finite-state diagrams formalized as hypergraphs. The equivalence of the second and the third is shown explicitly in Jonsson (2021). All three have been used in the context of *N*-best problems: weighted regular tree grammars by May and Knight (2006), weighted tree automata by Björklund, Drewes, and Zechner (2019), and hypergraphs by Huang and Chiang (2005) and Büchse et al. (2010). In this article, we use weighted tree automata.

A weighted tree automaton (wta) over the tropical semiring is a system *M* = (*Q*, Σ, *R*, *ω*, *q*_{f}) consisting of:

a finite set

*Q*of symbols of rank 0 called*states*;a ranked alphabet Σ of

*input symbols*disjoint with*Q*;a finite set $R\u2286\u22c3k\u2208NQk\xd7\Sigma (k)\xd7Q$ of

*transition rules*;a mapping

*ω*:*R*→ℝ_{ +}; anda final state

*q*_{f}∈*Q*.

From here on, we write $f[q1,\u2026,qk]\u2192wq$ to denote that τ = (*q*_{1},…,*q*_{k},*f*,*q*) ∈ *R* and *ω*(τ) = *w*, and consider *R* to be the set of these weighted rules, thus dropping the component *ω* from the definition of *M*.

A transition rule $\tau :f[q1,\u2026,qk]\u2192wq$ will also be viewed as a symbol of rank *k*, turning *R* into a ranked alphabet. We let *tar*(τ) denote the target state *q* of τ, *src*(τ) denotes the sequence of source states *q*_{1}…*q*_{k}, and rank(τ) = rank( *f*). In addition, we view every state *q* ∈ *Q* as a symbol of rank 0.

We define the set *runs*_{M} ⊆ *T*_{R∪Q} of *runs* ρ *of**M*, their input trees *input*_{M}(ρ), their *intrinsic weights**wt*_{M}(ρ), and their *target state**tar*(ρ) inductively, as follows:

For every

*q*∈*Q*, we have that*q*∈*runs*_{M}with*input*_{M}(*q*) =*q*,*wt*_{M}(*q*) = 0, and*tar*(*q*) =*q*.- For every transition rule $\tau :f[q1,\u2026,qk]\u2192wq$ in
*R*and all runs ρ_{1},…,ρ_{k}∈*runs*_{M}such that*tar*(ρ_{i}) =*q*_{i}for all*i*∈ [*k*], we let ρ = τ[ρ_{1},…,ρ_{k}] ∈*runs*_{M}with$inputM(\rho )=f[inputM(\rho 1),\u2026,inputM(\rho k)]wtM(\rho )=w+\u2211i\u2208[k]wtM(\rho i), andtar(\rho )=q$

*weight*of a run ρ ∈

*runs*

_{M}is

*recognized by M*is given by

*t*∈

*T*

_{Σ}(where, by convention, $min\u2205=\u221e$). In other words,

*M*(

*t*) is the minimal weight of any run resulting in

*t*– which is the sum of all weights of

*t*in the tropical semiring. Note that we, by a slight abuse of notation, denote by

*M*both the wta and the weight assignments to runs and trees it computes. Moreover, for

*q*∈

*Q*we define the mapping $Mq:C\Sigma \u2192R+\u221e$ by $Mq(c)=M(c\u27e6q\u27e7)$ for every

*c*∈

*C*

_{Σ}.

Throughout the rest of the article, we will generally drop the subscript *M* in *runs*_{M}, *input*_{M}, and *wt*_{M}, because the wta in question will always be clear from the context.

Given as input a wta *M* and an integer *N* ∈ℕ, the *N*-*best***runs problem** consists in computing a sequence of *N* runs of minimal weight according to *M*. More precisely, an algorithm solving the problem will output a sequence ρ_{1},ρ_{2},… of *N* pairwise distinct runs such that there do not exist *i* ∈ [*N*] and ρ ∈ *runs* ∖{ρ_{1},…,ρ_{i}} with *M*(ρ) < *M*(ρ_{i}).

*General Assumption.* To make sure that the *N*-best runs problem always possesses a solution, and to simplify the presentation of our algorithms, we assume from now on that all considered wtas *M* have infinitely many runs ρ such that *tar*(ρ) = *q*_{f}. In particular, *T*_{Σ} is assumed to be infinite. Apart from simplifying some technical details, this assumption does not affect any of the reasonings in the paper.

Similarly to the *N*-best runs problem, the *N*-*best***trees problem** for the wta *M* consists in computing a sequence of pairwise distinct trees *t*_{1},*t*_{2},… in *T*_{Σ} of minimal weight. In other words, we seek a sequence of trees such that there do not exist *i* ∈ [*N*] and *t* ∈ *T*_{Σ} ∖{*t*_{1},…,*t*_{i}} with *M*(*t*) < *M*(*t*_{i}). Note that the *N*-best trees problem always has a solution because we assume that *T*_{Σ} is infinite.

*t*is its size |

*t*|. Looking at the rules, we obtain the following recursive equations for the number

*#*

_{i}(

*t*) of runs ρ on a tree

*t*ending in state

*tar*(ρ) =

*q*

_{i}:

*t*will occur

*#*

_{0}(

*t*) times in an

*N*-best list based on best runs (provided that

*N*is large enough).

**Table 1**

Best runs
. | Best trees
. | ||
---|---|---|---|

Input tree . | Weight . | Tree . | Weight . |

a | 1 | a | 1 |

f[a,a] | 3 | f[a,a] | 3 |

f[a,a] | 3 | f[f[a,a],a] | 5 |

f[a,a] | 3 | f[a,f[a,a]] | 5 |

f[a,f[a,a]] | 5 | f[f[a,a],f[a,a]] | 7 |

f[f[a,a],a] | 5 | f[f[f[a,a],a],a] | 7 |

f[a,f[a,a]] | 5 | f[f[a,f[a,a]],a] | 7 |

f[f[a,a],a] | 5 | f[a,f[f[a,a],a]] | 7 |

f[a,f[a,a]] | 5 | f[a,f[a,f[a,a]]] | 7 |

f[f[a,a],a] | 5 | f[f[a,f[a,a]],f[a,a]] | 9 |

Best runs
. | Best trees
. | ||
---|---|---|---|

Input tree . | Weight . | Tree . | Weight . |

a | 1 | a | 1 |

f[a,a] | 3 | f[a,a] | 3 |

f[a,a] | 3 | f[f[a,a],a] | 5 |

f[a,a] | 3 | f[a,f[a,a]] | 5 |

f[a,f[a,a]] | 5 | f[f[a,a],f[a,a]] | 7 |

f[f[a,a],a] | 5 | f[f[f[a,a],a],a] | 7 |

f[a,f[a,a]] | 5 | f[f[a,f[a,a]],a] | 7 |

f[f[a,a],a] | 5 | f[a,f[f[a,a],a]] | 7 |

f[a,f[a,a]] | 5 | f[a,f[a,f[a,a]]] | 7 |

f[f[a,a],a] | 5 | f[f[a,f[a,a]],f[a,a]] | 9 |

We end this section by discussing the choice of our particular weight structure, the tropical semiring. In the literature on wta, Definitions 1 and 2 are generalized to wta over arbitrary commutative semirings, and their resulting weighted tree languages, simply by replacing + and $min$ by ⊕ and ⊗, respectively. The tropical semiring $(R+\u221e,min,+,\u221e,0)$ used in Definitions 1 and 2 is frequently used in natural language processing. Equally popular is the *Viterbi semiring*$([0,1],max,\u22c5,0,1)$ that acts on the unit interval of probabilities, with maximum and standard multiplication as operations. In the setting discussed here, both semirings are equivalent. To see this, transform a wta *M* over the Viterbi semiring to a wta *M′* over the tropical semiring by simply mapping every weight *p* of a transition rule of *M* to $\u2212lnp$, that is, taking negative logarithms everywhere. Since $\u2212lnp+\u2212lnp\u2032=\u2212ln(p\u22c5p\u2032)$ and $\u2212lnp<\u2212lnp\u2032\u21d4p>p\u2032$, it holds that *M*(*t*) (now calculated using the Viterbi semiring operations) is equal to $exp(\u2212M\u2032(t))$ for all trees *t*. It follows that the trees *t*_{1},…,*t*_{N} form an *N*-best list according to *M* (now looking for trees with *maximal* weights) if and only if they form an *N*-best list according to *M′* in the sense defined above.

## 3 The Improved Best Trees Algorithm

In this section, we explain how the algorithm in Björklund, Drewes, and Zechner (2019) can be made lazier, and hence more efficient, by exploring the search space with respect to transitions rather than states. From here on, let *M* = (*Q*,Σ,*R*,*q*_{f}) be a wta with *m* transition rules, *n* states, and a maximum rank of *r* among the symbols in Σ.

### 3.1 Best Trees v.1

*T*that collects all processed trees, and a priority queue

*K*of trees in Σ(

*T*) that will be examined next. The priority of a tree

*t*in

*K*is determined by the minimal value in the set of all $M(c\u27e6t\u27e7)$, where

*c*ranges over all possible contexts. Let

*M*

^{q}denote the wta obtained from

*M*by making

*q*its final state. Then, for every context

*c*and all trees

*t*,

*M*

_{q}(

*c*) is independent of

*t*, it is possible to compute in advance a

**best context**𝕔

_{q}that minimizes it. The technique will be discussed in Section 3.2.

*best context*of a state

*q*∈

*Q*is a context 𝕔

_{q}with

*M*

_{q}(𝕔

_{q}) is denoted by 𝕨

_{q}.

*optimal state*

*opt*(

*t*) for each tree

*t*that it encounters: $opt(t)=argminq\u2208Qcq\u27e6t\u27e7$. Throughout the article, we shall make use of the following weight functions derived from

*M*, where

*t*∈

*T*

_{Σ}:

*K*is initialized with the trees in Σ

_{0}. Its priority order <

_{K}is defined as follows, for all trees

*t*and

*t′*in

*K*:

_{lex}is any lexical order that orders trees first by size and then by viewing them as strings to be compared alphabetically from left to right.

We can now reproduce the pseudocode of the base algorithm from Björklund, Drewes, and Zechner (2019) in Algorithm 1. Given a wta *M* and *N* ∈ℕ, it solves the *N*-best trees problem. After outputting *i* ∈ [*N*] trees, the set of trees enqueued in line 23 is pruned so that for every *q* ∈ *Q*, at most *N* − *i* trees are kept for which *q* is an optimal state. The function expand(*T*,*t*), which computes the trees to be enqueued in each step, returns the set of all trees in Σ(*T*) such that the “new” tree *t* occurs at least once among the direct subtrees of the root.

The correctness of this approach is formally proved in Björklund, Drewes, and Zechner (2019).

#### 3.2 Computation of Best Contexts

We now recall the computation of best contexts 𝕔_{q}, *q* ∈ *Q*, to the extent needed to understand the improved algorithm. This computation consists of two phases: first, a 1-best tree $tqbest$ is computed for each state *q* ∈ *Q*. The desired property of $tqbest$ is that it is a 1-best tree of *M*^{q}, that is, it is a tree *t* ∈ *T*_{Σ} that minimizes *M*^{q}(*t*). After that, the second phase computes the actual best context 𝕔_{q} for every *q* ∈ *Q*, in other words, a context *c* ∈ *C*_{Σ} with *M*_{q}(*c*) =𝕨_{q}.

The first phase can be accomplished using a dynamic programming algorithm by Knuth (1977) that computes 1-best runs. (Note that if ρ is a 1-best run, then *input*(ρ) is a 1-best tree.) The algorithm maintains a min-priority queue of all transition rules and collects, in |*R*| iterations, the desired best runs ρ_{q}. Initially, ρ_{q} is undefined for every *q* ∈ *Q*. The value determining the priority of a transition rule $\tau :f[q1,\u2026,qk]\u2192wq$ is $\u221e$ if any $\rho qi$ (*i* ∈ [*k*]) is still undefined. Otherwise, it is $w+\u2211i\u2208[k]wt(\rho qi)$, that is, the weight of the run $\tau [\rho q1,\u2026,\rho qk]$. The algorithm repeatedly dequeues the highest priority element $\tau :f[q1,\u2026,qk]\u2192wq$ from the queue. If ρ_{q} is still undefined, it sets $\rho q=\tau [\rho q1,\u2026,\rho qk]$ and $tqbest=f[tq1best,\u2026,tqkbest]$. It then updates the priorities of transition rules having *q* among the states in their right-hand sides and repeats. We note that the trees $tqbest$ are discovered by the algorithm in the order of ascending weight. This observation will soon become important for the initialization phase of Algorithm 2.

As a side remark, we note that the set of 1-best runs and 1-best trees determined by this algorithm are **subtree closed**, meaning that every subtree of ρ_{q} and $tqbest$ is itself one of the trees ρ_{q′} and $tq\u2032best$, respectively. It follows that the entire set of these trees can be stored as a maximally shared directed acyclic graph with |*Q*| nodes.

*M*, where

*M*is viewed as a weighted edge-labeled graph. More precisely, consider the graph with node set

*Q*such that, for every transition rule $(\tau :f[q1,\u2026,qk]\u2192wq)\u2208R$ and every

*q′*∈{

*q*

_{1},…,

*q*

_{k}}, there is an edge

*e*= (

*q*,τ,

*q′*) from

*q*to

*q′*with label τ. The weight of such an edge is given by

*q′*=

*q*

_{i}), in which case

*wt*(

*e*) is the weight that needs to be added to 𝕨

_{q}to obtain 𝕨

_{q′}(under the assumption that, indeed, $cq\u2032=c\u27e6q\u27e7f[tq1best,\u2026,tqi\u22121best,\u25a1,tqi+1best,\u2026,tqkbest]$).

Now, having computed the paths of minimal weight in this graph using Dijkstra’s algorithm, consider the edge sequence π of least weight from *q*_{f} to a state *q′*. Then 𝕔_{q′} = 𝕔(π), where 𝕔(π) is defined recursively as follows:

If π = λ then

*q′*=*q*_{f}and we set $c(\pi )=\u25a1$.- If π =
*π′e*where*e*= (*q*, τ,*q′*) for some transition rule $\tau :f[q1,\u2026,qk]\u2192wq$, we choose some*i*∈ [*k*] with*q*_{i}=*q′*, and set$c(\pi )=c(\pi \u2032)\u27e6f[tq1best,\u2026,tqi\u22121best,\u25a1,tqi+1best,\u2026,tqkbest]\u27e7$

We note here that this way of computing best contexts results in contexts of a very peculiar kind: Every such context 𝕔_{q} consists of a “spine” leading to the node *v* such that $cq(v)=\u25a1$, and all subtrees that branch out from this spine are of the form $tq\u2032best$ for the required states *q′*.

#### 3.3 Transition-Based Best Trees Computation

The improved algorithm for computing *N* best trees hinges on the observation that to generate *N* best trees, at most *N* distinct instantiations of each transition rule are needed. Here, an instantiation of a rule $\tau =f[q1,\u2026,qk]\u2192wq$ is a tree *f*[*t*_{1},…,*t*_{k}]. Furthermore, the algorithm creates these instantiations in a lazy fashion.

To this end, we build sequences *T*_{q} of *N′* ≤ *N* best trees for each state *q*. Most importantly, a separate priority queue *K*_{τ} is kept for each transition rule $\tau =f[q1,\u2026,qk]\u2192wq$ in *R*. Every tree in this queue is of the form $f[(Tq1)i1,\u2026,(Tqk)ik]$ and is hence uniquely determined by the tuple (*i*_{1},…,*i*_{k}), each *i*_{j} working as a pointer into the sequence $Tqj$.^{3} Hence, each such tuple can be understood as an abstract instruction of how to instantiate τ by previously dequeued trees. As outlined in Section 3.5, an efficient implementation of the algorithm can make this assembly “just in time,” so as to avoid unnecessary work.

In each iteration, the algorithm chooses the highest-priority element across all of the queues *K*_{τ}, where the priority order (to be described later) is similar to <_{K}, but improved by replacing the use of the lexical order by a more goal-oriented component. To efficiently pick the highest-priority element across all queues, we organize the queues in a meta-queue *K′*, where the priority of every *K*_{τ} in *K′* is given by the priority of its highest-priority element. This organization is schematically illustrated in Figure 2.

The algorithm itself is outlined in Algorithm 2. The sequences *T*_{q} mentioned above are the previously dequeued best instantiations of rules τ with *tar*(τ) = *q*. Thus, as mentioned before, *T*_{q} is a prefix of a solution of the *N*-best problem for the wta *M*^{q}.

For every $\tau :f[q1,\u2026,qk]\u2192wq$ and every index tuple *u* = (*i*_{1},…,*i*_{k}) ∈ℕ^{k}, the instantiated transition rule is denoted by τ[*u*]. As previously mentioned, it is defined as the tree $f[(Tq1)i1,\u2026,(Tqk)ik]$, but since this is well defined only if $ij\u2264|Tqj|$ for every *j* ∈ [*k*], we let τ[*u*] be undefined otherwise.

*inc*returns, for every tuple

*u*= (

*i*

_{1},…,

*i*

_{k}) ∈ℕ

^{k}(

*k*∈ℕ), the set of all its successors obtained by increasing exactly one of

*i*

_{j},

*j*∈ [

*k*], by 1. Formally, for every tuple

*u*= (

*i*

_{1},…,

*i*

_{k}) ∈ℕ

^{k},

Each queue *K*_{τ} in Algorithm 2 is a min-priority queue in which the least priority is assigned to elements *u* such that τ[*u*] is undefined. This reflects that we do not yet know the weight of the instantiation of τ with respect to *u*, and that every instantiation for which we do know the weight of the resulting tree can be shown to be preferable.

*K*

_{τ}, we follow the previous approach, using a variant of <

_{K}, but in addition to the necessary adaptations, we shall replace the lexical component by a more goal-oriented one. The priority of every element

*u*of

*K*

_{τ}is primarily determined by a variant of Δ, denoted by Δ

_{τ}(

*u*). Recall that, for

*q*∈

*Q*, Δ

_{q}(

*t*) is the least weight of all runs on trees of the form $c\u27e6t\u27e7$, where only runs are considered whose target state at the root of the subtree

*t*is

*q*. Thus, Δ

_{q}(

*t*) ≥ Δ(

*t*), where equality holds for

*q*=

*opt*(

*t*). Now, Δ

_{τ}specializes this further by assuming that the specific transition rule applied at the root of the subtree

*t*is τ. Moreover, because we only apply this weight function to trees of the form τ[

*u*], we let its argument be

*u*rather than τ[

*u*]. Formally, consider a transition rule $\tau :f[q1,\u2026,qk]\u2192wq$. If τ[

*u*] is undefined, we simply put $\Delta \tau (u)=\u221e$. If τ[

*u*] is defined, we define

*q*

_{i}is an optimal state for $(Tqi)ui$ for

*i*∈ [

*k*]. Thus, if ρ

_{i}is the lowest-weight run on $(Tqi)ui$ with $input(\rho i)=(Tqi)ui$ and

*tar*(ρ

_{i}) =

*q*

_{i}, and we set ρ = τ[ρ

_{1},…,ρ

_{k}], then the expression $w+\u2211i=1kMqi((Tqi)ui)$ is simply the definition of

*wt*(ρ), which is equal to

*M*

^{q}(τ[

*u*]). Thus, adding 𝕨

_{q}, Δ

_{τ}(

*u*) turns out to be the minimum of all

*M*(ρ), where ρ ranges over all runs with ρ(

*v*) = τ and

*input*(ρ/

*v*) = τ[

*u*] for some

*v*∈domρ. Thus, Δ

_{τ}(

*u*) ≥ Δ

_{q}(τ[

*u*]) ≥ Δ(τ[

*u*]).

*K*

_{τ}(τ ∈

*R*) and in

*K′*, let $\Lambda \tau (u)\u2208N\u221e\xd7N$ be given by

_{q}=

*depth*(𝕔

_{q})) for every

*q*∈

*Q*. We order $(i,j),(i\u2032,j\u2032)\u2208N\u221e\xd7N$ as usual, namely, (

*i*,

*j*) < (

*i′*,

*j′*) if

*i*<

*i′*or

*i*=

*i′*and

*j*<

*j′*.

Now, let *q* ∈ *Q* and $(\tau :f[q1,\u2026,qk]\u2192wq)\u2208R$. Having computed best contexts (and, in the process, also best trees) for all states, *T*_{q} is initialized to contain only the best tree $tqbest$ for *q*. The queue *K*_{τ} is initialized to contain only 1^{k} if τ≠ρ_{q}(λ), since the latter means that τ was not used to build $tqbest$, which means that the first instantiation of τ is still waiting to be added to *T*_{q} at a suitable position. Otherwise, *K*_{τ} is initialized to contain all successors of 1^{k} because, by the way in which $tqbest$ was constructed and the definition of τ[*u*], we have $(Tq)1=tqbest=\tau [1k]$ and are now looking for the next best instantiation of τ.

After initializing *T*_{q} and *K*_{τ}, the algorithm enters the main loop. This loop is executed until *N* trees have been outputted. The first step in the loop is to extract a minimal *u* from the set of all queues *K*_{τ}, τ ∈ *R* (by first dequeuing *K*_{τ} from *K′* and then *u* from *K*_{τ}). Thanks to our assumption that *M* possesses infinitely many runs ending in *q*_{f}, it can be shown that Δ_{τ}(*u*) ∈ℕ, that is, the tree τ[*u*] is defined (see the next section). The tree is appended at the end of the list of the best (smallest-weighted) trees that reach the target state *tar*(τ) of τ. If *tar*(τ) = *q*_{f} and τ[*u*] is “new,” that is, has not been outputted before, then τ[*u*] is outputted now, and the counter *c* tracking the number of output trees is increased. Note that, in contrast to Algorithm 1, we have to check whether τ[*u*] was outputted before, because some *τ′*[*u′*] outputted earlier may actually have been equal to τ[*u*]. However, this can only be the case if τ≠*τ′*, and can thus only happen *m* times for every tree.

#### 3.4 Correctness

Let us now show that Algorithm 2 is correct. To simplify the reasoning, we shall first consider a variant of the algorithm, referred to as Algorithm 2*′*, obtained by removing the inequality |*T*_{tar(τ)}| < *N* from the condition on line 14. In the following lemma, we say that *T*_{q} = *t*_{1}⋯*t*_{m} is **appropriate** if *t*_{1},…,*t*_{m} is a solution of the *m*-best trees problem for *M*^{q}.

Consider a run of Algorithm 2*′*. During every execution of the main loop the following statements hold:

- (1)
When the loop is entered, each

*T*_{q}(*q*∈*Q*) is appropriate. - (2)
When line 13 has been executed with

*tar*(τ) =*q*, there do not exist any*q′*∈*Q*and*t*∈*T*_{Σ}∖*T*_{q′}such that Δ_{q′}(*t*) < Δ_{τ}(*u*) (where τ and*u*denote the values of the corresponding variables in Algorithm 2*′*at that point).

*Proof.*

We use (1) as a loop invariant. Because it holds before the first execution of the loop (as each *T*_{q} consists of a single best tree with respect to *M*^{q}), we need to show that, under the condition that (1) holds when the loop is entered, statement (2) holds as well and, when line 21 has been executed, (1) still holds.

*t*did actually exist and let $s=cq\u2032\u27e6t\u27e7$, where $cq\u2032(v0)=\u25a1$, that is,

*v*

_{0}is the node such that

*s*/

*v*

_{0}=

*t*. By the definition of Δ

_{q′}, there is a run ρ with

*input*(ρ) =

*s*,

*tar*(ρ/

*v*

_{0}) =

*q′*, and

*wt*(ρ) = Δ

_{q′}(

*t*). For every

*v*∈dom(

*s*), let

*q*

_{v}=

*tar*(ρ/

*v*) be the state the run is in after having processed

*s*/

*v*. There is at least one node

*v*∈dom(

*s*) such that $s/v\u2209Tqv$ (namely,

*v*=

*v*

_{0}). Now, choose

*v*∈dom(

*s*) with $s/v\u2209Tqv$ in such a way that |

*s*/

*v*| is minimal, and suppose that

*s*/

*v*=

*g*[

*s*

_{1},…,

*s*

_{ℓ}] and $\rho (v)=(\tau \u2032:g[p1,\u2026,p\u2113]\u2192wqv)$. (Thus,

*p*

_{i}=

*q*

_{vi}for all

*i*∈ [

*ℓ*].) By the minimality of |

*s*/

*v*|, $s1\u2208Tp1,\u2026,s\u2113\u2208Tp\u2113$, and thus there is

*u′*= (

*j*

_{1},…,

*j*

_{ℓ}) ∈ℕ

^{ℓ}such that $(Tpi)ji=si$ for all

*i*∈ [

*ℓ*]. With

*p*=

*opt*(

*s*/

*v*) it follows that

Because $s/v\u2209Tqv$, the tuple *u′* has not yet been dequeued from *K*_{τ′}. Thus, while *u′* itself may not yet be in *K*_{τ′}, *K*_{τ′} must contain some *u*″ = (*j*_{1}*′*,…,*j*_{ℓ}*′*) with *j*_{i}*′* ≤ *j*_{i} for all *i* ∈ [*ℓ*]. This is because when an element is dequeued on line 13 then all of its direct successors will be enqueued on line 21. In particular, *K*_{τ′} cannot be empty at the start of an iteration, as long as there is some *u′* ∈ℕ^{ℓ} that has not yet been dequeued from *K*_{τ′} (which will always be the case if *ℓ* > 0).

Note that *j*_{i}*′* ≤ *j*_{i} for all *i* ∈ [*ℓ*] implies that $\Delta qv(u\u2033)\u2264\Delta qv(u\u2032)$ since $Tp1,\u2026,Tp\u2113$ are appropriate. Hence, the inequality Δ_{τ′}(*u*″) ≤ Δ_{τ′}(*u′*) < Δ_{τ}(*u*) contradicts the assumption that *u* was dequeued on line 13.

*u*] is appended to

*T*

_{q}on line 15, it remains to be shown that

*M*

^{q}(τ[

*w*]) ≤

*M*

^{q}(τ[

*u*]) for all trees τ[

*w*] that have been appended to

*T*

_{q}during earlier iterations. Choosing

*q*as

*q′*, τ[

*u*] as

*t*, and

*w*as

*u*in (2) we get

*T*

_{q}and the second holds because

*tar*(τ) =

*q*. Furthermore, by (2) there is no tree

*t′*∈

*T*

_{Σ}∖

*T*

_{q}such that

*M*

^{q}(

*t′*) <

*M*

^{q}(τ[

*u*]). Thus,

*T*

_{q}is still appropriate when τ[

*u*] has been appended to it.

Algorithm 2*′* computes a solution to the *N*-best trees problem.

*Proof.*

We first observe that, due to the output condition on line 17 of Algorithm 2*′*, the sequence of trees outputted by the algorithm does not contain repetitions.

Next, whenever a tree *s* = τ[*u*] is outputted on line 18, we show that *M*(*t*) ≥ *M*(*s*) for all trees *t* ∈ *T*_{Σ} that have not yet been outputted. Let $\tau :f[q1,\u2026,qk]\u2192wq$. Then *q* = *q*_{f} and thus $cq=\u25a1$, 𝕨_{q} = 0, and *M*(*s*) = Δ_{τ}(*u*). Now, consider any *t* ∈ *T*_{Σ} that has not yet been outputted. Then we have $t\u2208T\Sigma \u2216Tqf$. Consequently, $M(t)=\Delta qf(t)\u2265\Delta \tau (u)$ by Lemma 1, as required.

To complete the proof, we have to show that there cannot be an infinite number of iterations without any tree being outputted. We show the following, stronger statement:

Claim 1.Let $\delta =maxq\u2208Q\delta q$, where δ_{q}is defined as in Equation 2 for allq∈Q. At any point in time during the execution of Algorithm 2′, it takes at most δ iterations until the selected transition rule τ satisfiestar(τ) =q_{f}.

*K*

_{τ}be dequeued on line 12 and let

*q*=

*tar*(τ). If

*q*=

*q*

_{f}, the statement holds. Otherwise, the context 𝕔

_{q}has the form

*p*

_{i}=

*q*. The tree

*t*= τ[

*u*] is appended to

*T*

_{q}, say at position

*n*. It follows that

*t′*=

*τ′*[

*u′*] with

there is no

*u*″ in any of the queues*K*_{τ″}with Δ_{τ″}(*u*″) < Δ_{τ}(*u*) (Lemma 1(2)),$\Delta \tau \u2032(u\u2032)=M(cq\u27e6t\u27e7)=M(cq\u2032\u27e6t\u2032\u27e7)=\Delta \tau (u)$, and

δ

_{q′}= δ_{q}− 1.

It follows that the queue *K*_{τ″} from which a tuple is dequeued on line 12 at the start of the next iteration satisfies δ_{tar(τ″)} = δ_{q} − 1, thus bounding the number of iterations until this quantity reaches 0 from above by δ.

Now, to finish the proof, note that τ[*u*]≠τ[*u′*] whenever *u*≠*u′* because the sequences *T*_{q}, *q* ∈ *Q*, do not contain repetitions. Because, furthermore, no tuple *u* is enqueued twice in *K*_{τ}, line 18 will be reached after at most δ iterations.

We finally show that Algorithm 2 is correct as well.

Algorithm 2 computes a solution to the *N*-best trees problem.

Consider an execution of Algorithm 2*′* and assume that we assign every tree a color, red or black, where black is the default color. (The color attribute of a subtree may differ from that of the tree itself.) Suppose that, in an iteration of the main loop, we have $\tau :f[q1,\u2026,qk]\u2192wq$ and *u* = (*i*_{1},…,*i*_{k}), and thus τ[*u*] = *f*(*t*_{1},…,*t*_{k}) where $tj=(Tqj)ij$ for all *j* ∈ [*k*]. If |*T*_{q}|≥ *N* on line 14 and we append τ[*u*] to it on line 15, we color τ[*u*] red while its subtrees *t*_{1},…,*t*_{k} keep their colors as given by their positions in $Tq1,\u2026,Tqk$.

Now, assume that some *T*_{q} contains a tree *t* = *f*[*t*_{1},…,*t*_{k}] such that *t*_{j} is red for some *j* ∈ [*k*]. We show that this implies that *t* is red. We know that *t* was appended to *T*_{q} as a tree of the form τ[*u*] for some *u* = (*i*_{1},…,*i*_{k}) with *i*_{j} > *N* (because of the assumption that *t*_{j} is red). We know also that earlier iterations have dequeued all tuples from *K*_{τ} of the form *u*^{i} = (*i*_{1},…,*i*_{j−1},*i*,*i*_{j +1},…,*i*_{k}) for *i* = 1,…,*N*. (This is an immediate consequence of Lemma 2 because, by Lemma 1(1), Δ_{τ}(*u*^{i}) < Δ_{τ}(*u*) for all *i* ∈ [*k*].) Since the τ[*u*^{i}] are pairwise distinct (as they differ in the *j*-th direct subtree) this means that |*T*_{q}|≥ *N* before τ[*u*] was enqueued, thus proving that *t* is red, as claimed.

Because Algorithm 2*′* terminates when $|Tqf|=N$, none of the trees in $Tqf$ will ever be red. By the above, this implies that all subtrees of trees in $Tqf$ are black as well. As subtrees inherit their color from the *T*_{q} they are taken from, this shows that no red tree occurring in an execution of Algorithm 2*′* can ever have an effect on the output of the algorithm. Hence, the output of Algorithm 2 is the same as that of Algorithm 2*′* and the result follows from Lemma 2.

#### 3.5 Time Complexity

Recall that the input *M* = (*Q*,Σ,*R*,*q*_{f}) is assumed to be a wta with *m* transition rules, *n* states, and a maximum rank of *r* among its symbols. In the complexity analysis, we consider an efficient implementation of Algorithm 2 along the lines illustrated in Figure 2, with priority queues based on heaps (Cormen et al. 2009). This enables us to implement the following details efficiently:

Consider a transition $\tau :f[q1,\u2026,qk]\u2192wq$. At a given stage of the algorithm, some of the elements

*u*= (*u*_{1},…,*u*_{k}) in*K*_{τ}may contain elements*u*_{i}such that $|Tqi|<ui$, that is, τ[*u*] is still undefined and hence $\Delta \tau (u)=\u221e$. Thus, Δ_{τ}(*u*) must be decreased from $\u221e$ to its final value in $R+\u221e$ when $|Tqi|$ has reached*u*_{i}for all*i*∈ [*k*]. For this, we record for every*p*∈*Q*and for |*T*_{p}| <*j*< |*N*| a list of all*u*∈*K*_{τ}such that τ is as above,*q*_{i}=*p*for some*i*∈ [*k*], and*u*_{i}=*j*. When |*T*_{p}| reaches the value*j*, this list is used to adjust the priority of each*u*on that list to the new value of Δ_{τ}(*u*) (which is either still $\u221e$ or has reached its final value).The queue

*K′*that contains the individual queues*K*_{τ}as elements is implemented in the straightforward way, also using priority queues based on heaps. As described earlier, the priority between*K*_{τ}and*K*_{τ′}is given by comparing their top-priority elements: if these elements are*u*and*u′*, respectively, then*K*_{τ}takes priority over*K*_{τ′}if (Δ_{τ}(*u*),δ_{tar(τ)}) <(Δ_{τ′}(*u′*),δ_{tar(τ′)}). If one of the queues*K*_{τ}runs empty (which can only happen if rank(τ) = 0), then*K*_{τ}is removed from*K*.

We now establish an upper bound on the running time of the algorithm.

*Proof.*

For the proof, we look at the maximum number of instantiations that we encounter during a run of the algorithm.

Because *K′* contains (at most) the *m* queues *K*_{τ}, enqueuing into *K′* is in $O(log(m))$. Furthermore, each rule is limited to *N* instantiations, due to the fact that the sequences *T*_{q} do not contain repetitions and thus τ[*u*]≠τ[*u′*] for *u*≠*u′*. This implies that the maximum number of iterations of the main loop is *Nm*, yielding an upper bound of $O(Nmlog(m))$ for the management of *K′*.

Next, we have the rule-specific queues *K*_{τ}. Each time a tuple *u* ∈ℕ^{k} is dequeued from *K*_{τ}, at most |*inc*(*u*)| = *k* ≤ *r* new tuples are enqueued. In total, the creation of these *k* tuples of size *k* each takes *k*^{2} ≤ *r*^{2} operations. Thus, there will be at most *Nr* elements in *K*_{τ} for any τ, which gives us a time bound of $O(log(Nr))$ per queue operation, a total time of $O(N(r2+rlog(Nr)))$ for the management of *K*_{τ}, and thus a total of $O(Nm(r2+rlog(Nr)))$ for the *m* queues *K*_{τ} altogether.

To complete the analysis, we have to argue that the time that needs to be spent to check whether τ[*u*] has been outputted before, can be made negligible. We do this by implementing the forest of outputted trees in such a way that equal subtrees are shared. Hence, trees are equal if (and only if) they have the same address in memory. Assuming a good hashing function, the construction of τ[*u*] from the (already previously constructed) trees referred to by *u*, can essentially be done in constant time. Now, if we maintain with every previously constructed tree a flag $\u2713$ indicating whether that tree had already been outputted once, the test boils down to constructing τ[*u*] (which would return the already existing tree if it did exist) and checking the flag $\u2713$.

*O*(

*Nn*).

We have not performed a detailed space complexity analysis, but because we know that the logarithmic factors are due to heap operations, we can conclude that the memory space consumption of the algorithm is in *O*(*Nm*).

In practical applications, we expect to see Algorithm 2 used in two ways. The first is, as discussed in the Introduction, the situation in which *N* best trees are computed for a relatively small value of *N*, for example, *N* = 200 as suggested by Socher et al. (2013). Here, based on a comparison of the upper bounds on the running time, and assuming that they are reasonably tight, Algorithm 2 will outperform Algorithm 1 if *N* is larger than $m$, which we believe is the common case. In the second scenario, Algorithm 2 is invoked with a very large *N* to enumerate the trees recognized by the input automaton, outputting them in ascending order by weight. In this scenario, our exploration by transition rule is even more valuable, as it saves redundant computation.

## 4 The Algorithm Best Runs

We now recall the algorithm by Huang and Chiang (2005), which we henceforth will refer to as Best Runs. To facilitate comparison, we express Best Runs in terms of wta.

The type of wta used as input to Best Runs differs slightly from the one used in Definition 1 in that the admissible weight structures are not restricted to the tropical semiring. Instead, each transition rule τ: *f*[*q*_{1},…,*q*_{k}] → *q* is equipped with a weight function $wt\tau :R+k\u2192R+$. The definition of the weight of a run ρ = τ[ρ_{1},…,ρ_{k}] is then changed to wt(ρ) =wt_{τ}(wt(ρ_{1}),…,wt(ρ_{k})).

For the algorithm to work, these weight functions wt_{τ} are required to be monotonic: wt_{τ}(*w*_{1},…,*w*_{k}) ≥wt_{τ}(*w*_{1}*′*,…,*w*_{k}*′*) whenever *w*_{i} ≥ *w*_{i}*′* for all *i* ∈ [*k*]. Definition 1, which Best Trees is based on, corresponds to the special case where each of these weight functions is of the form $wt\tau (w1,\u2026,wk)=w+\u2211i\u2208[k]wi$ for a constant *w*. In other words, the weight functions Best Trees can work with are a restriction of those Best Runs works on. This will be discussed in Section 5.

The input to Best Runs is a pair (*M*,*N*), where *M* is a wta with *m* transition rules and *n* states, and *N* ∈ℕ. The algorithm is outlined in Algorithm 3. Line 2 is a preprocessing step that can be performed in *O*(*m*) time using the Viterbi algorithm, given that the rank of the alphabet used is considered a constant. A list *input*[*q*] is used to store, for each state *q* ∈ *Q*, the at most *N* discovered best runs arriving at *q*. The search space of candidate runs is represented by an array of heaps, here denoted *cands*. For each state *q*, *cands*[*q*] holds a heap storing the (at most *N*) best, so far unexploited, runs arriving at *q*. That is, if we have already picked the *N′* best runs arriving at a node, the heap lets us pick the next best unpicked candidate efficiently when so requested by the recursive call.

To expand the search space, the *N′*-th best run ρ is used as follows: if τ = ([*q*_{1},…,*q*_{k}] → *q*), we obtain a new candidate by replacing the *i*-th direct subtree of ρ with the next (and thereby minimally worse) run in *input* arriving at *q*_{i}. Note that this is equivalent to the increment method used in Best Trees.

As shown by Huang and Chiang (2005), the worst case running time of Best Runs is $O(mlog|V|+smaxNlogN)$ where $smax$ is the size of the largest run among the *N* results.^{4} Using similar reasoning for the memory complexity as for Best Trees, Best Runs achieves a $O(m+smaxN)$ memory complexity bound.

## 5 Comparison

A major difference between Best Runs and Best Trees is that the former solves the *N*-best runs problem whereas the latter solves the *N*-best trees problem: On lines 14 and 17 of Algorithm 2, duplicate trees are discarded. If these conditions are removed, Best Trees solves the best runs problem. (Provided, of course, that the objects outputted are changed to being runs rather than trees.)

For the sake of comparison, we adopt the view of Huang and Chiang (2005) that the ranked alphabet Σ can be considered fixed. This yields that the running time of Best Trees is $O(Nm(logm+logN))$. Since *m* ∈ *O*(*n*^{r +1}), where *r* is the (now fixed) maximal rank of symbols in Σ, it follows that $logm\u2208O(logn)$, so the second expression simplifies to $Nm(logn+logN)\u2264mlogn\u22c5NlogN$. Moreover, recall that the worst case running time of Best Runs is $O(mlogn+smaxNlogN)$. If *N* is large enough to make the second term the dominating one, the difference between both running times is thus a factor of $(mlogn)/smax$ (assuming for the sake of the comparison that the given bounds are reasonably tight). A further comparison between the running times does not appear to be all that meaningful because $smax$ depends on both *N* and the structure of the input wta, and the algorithms are specialized for different problems.

A conceptual comparison of the two algorithms may be more insightful. The algorithms differ mainly in two ways. The first difference is that, while Best Runs enumerates candidates of best runs using one priority queue per node of *G*, Best Trees uses a more fine-grained approach, maintaining one priority queue per hyperedge. Since the upper bound on the length of queues is *O*(*N*) in both cases, in total Best Trees may need to handle *O*(*Nm*) candidates whereas Best Runs needs only *O*(*Nn*). This ostensible disadvantage of Best Trees is not a real one as it occurs only when solving the *N*-best trees problem. The difference vanishes if the algorithm is used to compute best runs (by changing lines 14 and 17). To see this, consider a given state *q* ∈ *Q*. Each time a queue element (*i*_{1}, …, *i*_{k}) is dequeued from a queue *K*_{τ} with τ: *f*[*q*_{1},…,*q*_{k}] → *q*, the corresponding run is appended to the list *T*_{q} on line 15, and *k* queue elements are inserted on line 21. As this can only happen at most *N* times per state *q*, in total only *O*(*Nn*) queue elements are ever created in the worst case. In other words, if Best Trees is set to solve the *N*-best runs problem, the splitting of queues does not result in a disadvantage compared to Best Runs.

*N*-best strings algorithm by Mohri and Riley (2002). It precomputes and uses (the weight and depth of) a best context

*c*

_{q}for every state

*q*in order to explore the search space in a more goal-oriented fashion. The possibility of using this optimization depends on the use of the tropical semiring. As mentioned in the Introduction, Büchse et al. (2010) extend the algorithm of Huang and Chiang to structured weight domains. This does not seem to be possible for Algorithm 2 (nor for Algorithm 1). The reason is that the computation and use of best contexts requires that the semiring is extremal, which is the case for the tropical semiring, but not for structured weight domains in general. Let

*ℓ*be the maximum of the distances of states in

*Q*to

*q*

_{f}, that is,

*dist*(

*q*,

*q*

_{f}) denotes the length of the shortest sequence of transition rules τ

_{0}⋯τ

_{n}such that

*q*∈

*src*(τ

_{0}),

*tar*(τ

_{i−1}) ∈

*src*(τ

_{i}) for all

*i*∈ [

*n*], and

*q*

_{f}=

*tar*(τ

_{n}). The argument used to show Claim 1 in the proof of Lemma 2 yields that, at every point in time at most

*ℓ*further loop executions are made before the next best run is outputted. To see this, consider an execution of the main loop, in which a run ρ = τ[⋯ ] arriving at a state

*q*is constructed. The priority of the corresponding queue element on line 13 is given by (wt(ρ) +

*w*,

*d*), where

*w*and

*d*are the weight and depth of

*c*

_{q}. If

*q*≠

*q*

_{f}, then $cq\u2260\u25a1$, and thus it is of the form $cq\u2032\u27e6\tau \u2032[\rho 1,\u2026,\rho i\u22121,\u25a1,\rho i+1,\u2026,\rho k]\u27e7$, where the ρ

_{j}are the 1-best runs arriving at their respective nodes (see Figure 3).

It follows that line 21 inserts the element into *K*_{τ} that represents the run *ρ′* = *τ′*[ρ_{1},…,ρ_{i−1},ρ,ρ_{i +1},…,ρ_{k}]. Because each of the runs ρ_{i}, arriving at some state *q*_{i}, is already in the respective list $Tqi$, the run *ρ′* immediately becomes a current candidate arriving at *q* = *tar*(τ). The corresponding pair (*w*_{q},*d*_{q}) = (wt(*c*_{q}),*depth*(*c*_{q})) satisfies *w*_{q} + wt(*ρ′*) = *w* + wt(ρ) and *d*_{q} = *d* − 1. Hence, *ρ′* (or another run with the same priority) will be picked in the next loop execution, meaning that after at most *ℓ* steps the node arrived at by the constructed run will be *q*_{f}, that is, the run will be outputted. This also shows that queues the elements of which are not needed for generating output trees will never become filled beyond their initial element.

We end the comparison by looking at Table 2, which summarizes the discussion above and provides comparison data for the other *N*-best algorithms discussed in this article. Keep in mind that *N* must be interpreted differently depending on the problem at hand. For example, the output of Best Runs is not equivalent to the output of Best Trees (unless the input is deterministic), which is why we cannot directly compare the time complexities. This output inequivalence should also be considered in Section 7 where, for simplicity, we plot data for Best Runs and Best Trees side by side.

**Table 2**

Algorithm
. | Objects
. | Time complexity
. | Best contexts
. | Search-space expansion
. |
---|---|---|---|---|

Eppstein (1998) | Paths | $O(nlogn+Nn+m)$ | No | Adding sidetracks to implicit heap representations of paths |

Mohri and Riley (2002) | Strings | No formal analysis provided | Yes | On-the-fly determinization |

Huang and Chiang (2005) | Runs | $O(mlogn+smaxNlogN)$^{5} | No | Increment |

Best Trees v.1 | Trees | $O(N2n(n2+mlogN))$ | Yes | Eppstein’s algorithm |

Best Trees | Trees | $O(Nm(logm+logN))$ | Yes | Increment |

Runs | $O(N(logm+logN))$^{6} | Yes | Increment |

Algorithm
. | Objects
. | Time complexity
. | Best contexts
. | Search-space expansion
. |
---|---|---|---|---|

Eppstein (1998) | Paths | $O(nlogn+Nn+m)$ | No | Adding sidetracks to implicit heap representations of paths |

Mohri and Riley (2002) | Strings | No formal analysis provided | Yes | On-the-fly determinization |

Huang and Chiang (2005) | Runs | $O(mlogn+smaxNlogN)$^{5} | No | Increment |

Best Trees v.1 | Trees | $O(N2n(n2+mlogN))$ | Yes | Eppstein’s algorithm |

Best Trees | Trees | $O(Nm(logm+logN))$ | Yes | Increment |

Runs | $O(N(logm+logN))$^{6} | Yes | Increment |

## 6 Implementation Details

In the upcoming section, we experimentally compare Best Runs and Best Trees. In preparation of that, we want to make a few comments on the implementation of Best Trees. As previously mentioned, Best Runs is implemented in the Java toolkit Tiburon by May and Knight, and this is the implementation we use in our experiments. Therefore, we simply refer to the Tiburon GitHub page^{7} for implementation details.

We have extended our code repository Betty,^{8} which originally provided an implementation of Best Trees v.1, to additionally implement the improved Best Trees as its standard choice of algorithm. A flag -runs can be passed on as an argument to compute best runs instead of best trees. Below follow a number of central facts about the Betty implementation.

First, recall Algorithm 2, and in particular that the best tree $tqbest$ is inserted into *T*_{q} for all *q* ∈ *Q* prior to the start of the main loop rather than letting *T*_{q} be empty and initializing *K*_{τ} to 1^{rankτ} for all *q* ∈ *Q* and τ ∈ *R*. The reason is that the latter would not guarantee that at most *ℓ* iterations are made until a run is outputted because some ρ_{i} may still not be in *T*_{q}. If this happens, transition rules adding the weight 0 may repeatedly be picked because some other *τ′* is not yet enabled. However, with a trick the initialization can nevertheless be simplified as indicated. The idea is to delay the execution of line 22 for every queue *K*_{τ} until τ has actually appeared in an output tree. Thus, until this has happened, *K*_{τ} is disabled from contributing another tree to *T*_{tar(q)}. This variant turned out to have efficiency advantages in practice and is therefore the variant implemented in Betty. Another advantage is that it allows Betty to handle a *set* of final states rather than a single one. This is not possible with the original initialization of Algorithm 2, because line 3 would have to be generalized to outputting the best trees of all final states, which cannot be done while maintaining the correctness of the algorithm, since the second best tree for a state *q* may have a lesser weight than the best tree for another state *q′*.

In lines 14 and 17 of Algorithm 2, trees are checked for equivalence. To perform the comparisons efficiently, we make use of hash tables. We use immutable trees, which need to be hashed only once when they are created; we then save the hash code together with the tree to make the former accessible in constant time for each tree. To compare trees for inequality, their hash codes are compared. If equal (which seldom happens if the trees are not equal), the comparison is continued recursively on the direct subtrees. This is theoretically less efficient than the method of representing trees uniquely in memory (as described in the last paragraph of the proof of Theorem 2), but practically sufficient and much easier to implement.

When creating new tuples for a transition rule τ as given by line 21, we add the tuples that can be instantiated directly to the corresponding queue *K*_{τ}. The tuples that cannot be instantiated must, however, be stored until they can. We want to be able to efficiently access the tuples that are affected when adding a tree *t* to *T*_{q′} for some *q′* ∈ *Q* (line 15). Therefore, we connect each tuple to the memory locations that will contain the data needed by the tuple. In more detail: Let τ = ( *f*[*q*_{1}⋯*q*_{k}] → *q*) be any transition rule in *R* and let (*i*_{1},…,*i*_{k}) be a tuple originating from τ. For every *i*_{j} ∈{*i*_{1},…,*i*_{k}}, the tuple is saved in a list of tuples affected by $Tqj(ij)$ and marked with a counter that shows how many trees remain until it can be instantiated. Thus, when *t* is added to *T*_{q′}(*i*_{j}) for *i*_{j} ∈{*i*_{1},…,*i*_{k}} and *q′* = *q*_{j}, we can immediately access all tuples that can possibly be instantiated. If a tuple cannot be instantiated, its counter is decreased appropriately.

## 7 Experiments

Let us now describe our experiments. For each problem instance (i.e., combination of input file and value of *N*), we perform a number of test runs, measure the elapsed time, and compute the average over the test runs. To avoid noise in our data caused by, for example, garbage collection, we only measure the time consumed by the thread the actual application runs in. Also, we disregard the time it takes to read the input files.

The number of test runs that are performed per problem instance is decided by the relationship between the mean *μ* and the standard deviation *σ* of the recorded times—these values are computed every fifth test run, and to finish the testing of the current problem instance, we require that *σ* < 0.01*μ*. However, five test runs per problem instance has turned out to be sufficient for fulfilling the requirement in practically all instances seen.

The memory usage is measured in terms of the maximum resident set size of the application process by using the Linux time command. Moreover, the strategy described above for computing the average over several test runs is used here as well.

All test scripts have been written in Python, in contrast to the tested implementations, which use Java. We run the experiments on a computer with a 3.60 Hz Intel Core i7-4790 processor.

### 7.1 Corpora

The corpora that were used in our experiments (see the list below) contain both real-world and synthetic data. The first corpus is derived from an actual machine-translation system and is thus representative for real-world usage. The second corpus consists of manually engineered grammars for a set of natural languages, used in a range of research and industry applications. The last two corpora are artificially created for the purpose of investigating the effect of increasing degrees of nondeterminism. Let us now present each in closer detail.

**MT-data**This data set consists of tree automata resulting from an English-to-German machine-translation task, described in the doctoral thesis of Quernheim (2017). The data consists of 927 files, each file containing a wta corresponding to one sentence; the files are indexed from 0 to 926. The smallest file has 24 lines and the largest one has 338, 937 lines; all lines but the first hold a transition rule. Moreover, these wtas have a large number of states and are essentially deterministic in the sense that each state corresponds to a particular part-of-sentence structure and input node.

**GF-data**Unweighted context-free grammars from the Grammatical Framework (GF) by Ranta (2011) provided the basis for this corpus. GF is, among other things, a programming language and processing platform for multilingual grammar applications. It provides combinatory categorial grammars (Steedman 1987) for more than 60 natural languages, out of which we export a subset to context-free grammars in Backus–Naur form with the help of the built-in export tool, and assign every grammar rule the weight 1. The number of production rules varies between languages: The Latin grammar has, for example, 1.6 million productions, whereas the Italian grammar has 5.8 million. (These differences are due to the level of coverage chosen by the grammar designers, and do not necessarily reflect inherent complexities of the languages.)

**PolyNonDet**This is a family of automata of increasing size indexed by

*i*. The states of member*i*are*q*_{0},…,*q*_{i}, where*q*_{i}is the final state and the rules are:$a\u21920qj$ for

*j*= 0,…,*i*$f[qj,qj]\u21921qj$ for

*j*= 0,…,*i*$f[qj,qj\u22121]\u21921qj\u22121$ for

*j*= 1,…,*i*

There are then Θ(

*n*^{i}) runs for a tree of size*n*(and the number of rules grows linearly). Therefore, we call these*polynomially nondeterministic*.**ExpNonDet**Finally, we use another family of automata of increasing size, also indexed by

*i*and with states*q*_{0},…,*q*_{i}and a final state*q*_{f}. The rules for member*i*of the family for*j*,*k*= 0,…,*i*and*j*≠*k*are:$a\u21920qj$

$f[qj,qk]\u21921qj$

$f[qk,qj]\u21921qj$

$qj\u21920qf$

Thus, the number of rules grows quadratically in

*i*and the number of runs for a tree of size*n*is Ω((*i*+1)^{n}); we say that these are*exponentially nondeterministic*.

The variables in all result-displaying plots in this article are either *N* (the number of trees or runs to output), *m* (the number of transition rules, i.e., lines in the input file), or both. For the artificially created corpora PolyNonDet and ExpNonDet, we exchange *m* for the variable *i* with which they are indexed. Note that for the former, *m* and *i* are interchangeable, but for the latter, *m* grows quadratically with increasing *i*.

### 7.2 Comparison with Best Trees v.1

First, we verify that the algorithm Best Trees is at least as efficient as its predecessor Best Trees v.1. Therefore, we run experiments on the ExpNonDet data set for both of the algorithms—both solving the best trees problem. The ExpNonDet data set was chosen because it has the largest degree of nondeterminism achievable, which should challenge both of the algorithms maximally. The results are presented in figures 4 and 5 for varying *N* and *i*, respectively. Note that both of these plots have logarithmic *y* axes. We see that Best Trees outperforms Best Trees v.1 considerably for both increasing *N* and increasing *i*.

More interestingly, in Figure 4, Best Trees displays a step-like behavior at *N* ≈ 100, 200, and 600. The nature of the ExpNonDet corpus is the explanation of this: When we have found all of the distinct trees of size *s* (all of the same weight), then the algorithm has to discard all of the duplicates of size *s* that are still in the queue, before arriving at a tree of size *s* + 1 with higher weight.

### 7.3 Comparison with Tiburon

Before turning to the comparison between Betty and Tiburon, recall from Section 5 that when using Betty to find the best trees, the input *N* cannot be equated with the *N* that is input to Tiburon and Betty to find the best runs. When the practical application actually demands best trees, a best runs implementation like Tiburon will potentially have to compute many more runs than trees needed, thus by necessity resulting in larger *N*. We have, however, for simplicity chosen to show the best trees data in the same plot as the data for best runs. It should also be noted that this comparison between Betty and Tiburon is essentially a “stress test” for Betty, which was designed to solve best trees but is here reduced to compute best runs.

Next, we compare Betty with Tiburon for all data sets, starting with the MT-data corpus. Figure 6 shows the running times for finding the *N* best runs using Tiburon for every file of the MT-data corpus and *N* = 1, 000,2, 000,…,20, 000; recall that *m* is the number of transition rules in the input file. The corresponding results for Betty can be seen in Figure 7. When we instead let Betty solve the best trees problem for the same input, we achieve the result in Figure 8. Figures 9 and 10 show the results for all three implementations in the same plots for a fixed file and value of *N*, respectively. It seems that Betty is faster than Tiburon on both tasks, but to be certain, we perform a statistical test. Because we have paired results from two populations, we run a Wilcoxon signed rank test on the results. We prepare both result sets for statistical testing by splitting them into 9 vectors of size 103 for each value of *N*; the splitting is done to allow us to find patterns in which implementation performs better given the parameters. Let *v*_{Betty} and *v*_{Tiburon} be such result vectors from the test runs of Betty and Tiburon, respectively. Our null hypothesis *H*_{0} is that *v*_{diff} ≔*v*_{Betty} − *v*_{Tiburon} comes from a distribution with median 0, and our alternative hypothesis *H*_{1} is that *v*_{diff} comes from a distribution with median less than 0. For our statistical tests, we use a significance level of 0.01. The result has the form of probabilities of observing values as or more extreme than the data under the null hypothesis *H*_{0}, and it shows that each probability is smaller than 10^{−18}. (Because the numbers are insignificantly small, we exclude a table of detailed results.) We can therefore conclude that *H*_{0} is rejected for all partitions of our data. This implies that Betty is statistically significantly faster than Tiburon on the MT-data corpus for both tasks. For the rest of the experiments, we omit similar statistical tests.

Next, we apply the three algorithms to the GF-data corpus. We run tests for 42 different languages, and present a selection of typical results along with a single atypical one. The typical results are shown in figures 11, 12, and 13, and concern the languages Persian, Thai, and Somali. The Thai result shows one of the extremes among the typical results with the three implementations starting at about the same running time but then diverging, while the Somali plot shows the other extreme, that is, when the running times differ significantly already at start. In all of these plots, Betty displays a better performance than Tiburon, regardless of the task solved. The one atypical result is found in Figure 14, which is the result from the Latin file. Here we see that Tiburon is faster than Best Trees for all values of *N* tested, although the slope of the Tiburon curve indicates that it will eventually surpass both Betty curves with increasing *N*. To verify this, we ran the same experiment for *N* = 120, 000, and confirm that it has indeed happened: Tiburon runs in 3, 105 ms whereas Betty runs in 2, 681 and 3, 055 ms for best runs and best trees, respectively. We conjecture that this unusual result depends on certain characteristics of the Latin corpus: The Latin and the Somali file are approximately the same size, but the Latin output trees are small and many represent one-word utterances, whereas the Somali output trees are larger and represent more complex sentence structures. The combination of a large file and small output trees makes the computation of the best contexts a large overhead, which is also visible in the plot: Even for zero output runs Betty seems to require about 2 seconds of computation time. Even though this is a combination that we rarely see in practice, it is an example of when computing the best contexts could be considered superfluous and thus detrimental to efficiency.

Finally, we measure the memory usage for the various implementations on one file from each natural language corpus. The results for the largest MT-data file and the Persian GF-data file can be seen in figures 15 and 16, respectively. For the for-mer, the memory usage of Tiburon seems to grow faster than the one for Betty, but for the latter, the roles are reversed. To explain why this happens, the memory usage of the implementations needs to be investigated in greater detail, a task left for fu-ture work.

Now we have arrived at the artificial corpora that were created to investigate the effect of nondeterminism on the algorithms. Running Tiburon on the PolyNonDet corpus yields the running times in Figure 17, and using Betty to solve the best runs and the best trees problems results in the numbers in figures 18 and 19, respectively. We observe that Tiburon’s running times are significantly larger than those of Betty. When solving the Best Trees problem, Betty does not seem to be affected by increasing *i*, which means it can all handle increasing degrees of polynomial nondeterminism well. By inspection of Figure 20, which compares the algorithms with respect to *N*, it is possible to conclude that they are all in *O*(*N*) on this particular example.

We now use the ExpNonDet corpus to expose the algorithms to exponential nondeterminism. The expectation is that the best runs algorithms will continue to do their job well, simply because the exponential nondeterminism does not play a significant role in this case. Indeed, this turns out to be the case, with Betty (Figure 22) being slightly faster than Tiburon (Figure 21). However, the best trees problem is no longer that easily solvable. As can be seen in Figure 23, we had to restrict the intervals for both *i* and *N* to make the task feasible for the computer used. This outcome is expected since there is now an exponential number of runs on each tree, and these duplicates have to be discarded before a larger tree can be found: The plateaus in Figure 23 indicate where we go from including trees of sizes smaller than *s* to including trees of size *s*. Asymptotic testing verifies that the best trees curve grows quadratically with *i* (meaning that it is linear in *m*) and is not worse than $NlogN$ for the second dimension. Figures 24 and 25 make a closer comparison of Tiburon and Betty for the best runs task with respect to the variables *N* and *i*. The effect of increasing *N* is linear as for the PolyNonDet data, but when considering *i*, the previously constant behavior is now linear.

We conclude the experiments by presenting a smaller number of results for memory usage: Figures 26 and 27 show the memory usage measured in kbytes for varying *N* while figures 28 and 29 show the corresponding numbers for varying *i*. As expected, the memory usage is much larger for the best trees task than for the best runs task. Moreover, for the latter, Tiburon seems to have a slight advantage over Betty with respect to *i* but a clear advantage with respect to *N*.

## 8 Conclusion

We have presented an improved version of the algorithm by Björklund, Drewes, and Zechner (2019) that solves the *N*-best trees problem for weighted tree automata over the tropical semiring. The main novelty lies in the exploration of the search space with a focus on instantiations of transition rules rather than on states, and the lazy assembly of these instantiations. We have proved the new algorithm to be correct and derived an upper bound on its running time—a bound that is smaller than that of the previous algorithm. We believe that this speed-up makes it superior for usage in typical language-processing applications.

Moreover, we have complemented the theoretical work with an experimental evaluation. Because Best Trees can be easily modified to produce the best runs instead of the best trees, we considered both tasks in our evaluation. To achieve a large coverage, we used two types of data: data from real-world language processing tasks and artificially created data. The real-world data consist of machine translation output and corpus-based rule sets for natural languages; these corpora are meant to display the kind of behavior one can expect when applying our algorithm to language processing tasks. The artificial corpora were designed to expose Best Trees to its worst-case scenario for the best trees task: the case when we have an exponential number of duplicate trees in a best runs list. We also covered the more moderate case where the nondeterminism only gives rise to a polynomial number of duplicates.

In the experiments focusing on running time, we first used the exponentially nondeterministic data to show that Best Trees is better at the best trees task than its predecessor Best Trees v.1 that uses a less efficient pruning scheme. Then, we compared Best Trees with the state-of-the-art best-runs algorithm of Huang and Chiang (2005), implemented in Tiburon by May and Knight (2006) for both tasks on all data sets. The results made it clear that Best Trees is preferable when extracting *N*-best lists of both runs and trees: Betty outperforms Tiburon for the best runs task on all 2,269 input wtas except one. The single exception revealed a corner case where the Huang and Chiang algorithm is faster, namely, when there are millions of rules and only a small percentage of them are used to produce *N* very small (height 0 or 1) runs. Additionally, we performed a smaller number of experiments to measure the memory usage of the three applications, and, while no final conclusion could be made, it seemed as though Tiburon had an advantage over Best Trees with respect to memory usage in total.

Prior to the experiments, we compared the two algorithms at a conceptual level and discussed the expected effects on their running time. The allocation of queues to transition rules instead of states mainly serves to structure the implementation. As we saw, it does not have a disadvantage with respect to running time. The use of best contexts, generalizing the idea of Mohri and Riley (2002) to trees, has a positive effect: it ensures that the maximum distance of a state to the final state is an upper bound on the maximum number of main loop iterations before the next run is outputted. This is because the best contexts guide the algorithm to take the shortest route in constructing the next best run that reaches a final state.^{9}

The disadvantage of using best contexts is that it limits which semirings can be used. The technique is compatible with the tropical semiring and, as discussed in Section 2, with the equivalent Viterbi semiring, but seemingly not with semirings that are not extremal. It is currently unclear to us whether an appropriate extension is possible, and we leave this question for future work. However, such extensions seem only relevant for the *N*-best runs problem, because it appears highly unlikely that the *N*-best trees problem could ever be solved with reasonable efficiency in cases where the weight semiring is not extremal. The reason is that, in that case, the best run on a tree does not determine the weight of that tree, making it unclear how the problem can be solved even if disregarding efficiency aspects. However, because the tropical semiring and the Viterbi semiring are the most prominent ones used to rank hypotheses in NLP, the computational advantages seen in this article seem to justify the restriction to these semirings, even when looking only at the *N*-best runs problem.

## Acknowledgments

We are thankful to Andreas Maletti for providing the machine translation data set used in our experiments; to Jonathan May for his support in the application of Tiburon to the *N*-best problem; to André Berg for sharing his expertise in mathematical statistics; to Aarne Ranta, Peter Ljunglöf, and Krasimir Angelov for introducing us to Grammatical Framework; and to the reviewers for suggesting numerous improvements to the article.

## Notes

^{1}

This replacement had already been done in Tiburon, without an explicit remark.

^{2}

We assume that enq(*K*_{τ},*U*) enqueues all *u* ∈*U* in *K*_{τ} except those already in it, thus skipping duplicates.

^{3}

Recall that $(Tqj)ij$ denotes the *i*_{j}-th element of the sequence $Tqj$.

^{4}

In fact, the running time obtained by Huang and Chiang (2005) is $O(m+smaxNlogN)$, but this assumes (the graph representation of) *M* to be acyclic. When *M* is cyclic, line 2 must be implemented by using Knuth’s algorithm, resulting in an additional factor $logn$ in the first term.

^{5}

Allows for cyclic input wta; $smax$ is the size of the largest output.

^{6}

Note that a factor *m* is removed, compared with when the same algorithm is used for finding the best trees. This is because all runs originating at the same rule queue are distinct (and naturally the same also holds for different rule queues).

^{9}

We have conducted a small experiment not mentioned in the previous section, by switching off that feature. It showed that the use of best contexts (in that particular, randomly chosen case) reduced the size of rule queues by a factor of 10.

## References

*k*shortest paths

*k*-best parsing

*Best Trees Extraction and Contextual Grammars for Language Processing*

*n*-best-strings problem

*Bimorphism Machine Translation*