Abstract
We show that a previously proposed algorithm for the N-best trees problem can be made more efficient by changing how it arranges and explores the search space. Given an integer N and a weighted tree automaton (wta) M over the tropical semiring, the algorithm computes N trees of minimal weight with respect to M. Compared with the original algorithm, the modifications increase the laziness of the evaluation strategy, which makes the new algorithm asymptotically more efficient than its predecessor. The algorithm is implemented in the software Betty, and compared to the state-of-the-art algorithm for extracting the N best runs, implemented in the software toolkit Tiburon. The data sets used in the experiments are wtas resulting from real-world natural language processing tasks, as well as artificially created wtas with varying degrees of nondeterminism. We find that Betty outperforms Tiburon on all tested data sets with respect to running time, while Tiburon seems to be the more memory-efficient choice.
1 Introduction
Trees are standard in natural language processing (NLP) to represent linguistic analyses of sentences. Similarly, tree automata provide a compact representation for a set of such analyses. Bottom–up tree automata act as recognizing devices and process their input trees in a step-wise fashion, working upwards from the leaf nodes towards the root of the tree. For memory, they have a finite set of states, some of which are said to be accepting, and their internal logic is represented as a finite set of transition rules. A run of a tree automaton M on an input tree t is a mapping from the nodes of t to the states, which is compatible with the transition rules. In general, M can have several distinct runs on t, and acceptst if one of these runs maps the root of t to an accepting state. By equipping the transition rules with weights, M can be made to associate t with a likelihood or score: The weight of a run of M on t is the product of the weights of the transition rules used in the run, and the weight of t is the sum of the weights of all runs on t. This type of automaton is called a weighted-tree automaton (wta) and is popular in, for example, dependency parsing and machine translation.
A central task is the extraction of the highest ranking trees with respect to a weighted tree automaton. For example, when the automaton represents a large set of intermediate solutions, it may be desirable to prune these down to a more manageable number before continuing the computation. This problem is known as the best trees problem. It is related to the best runs problem that asks for highest-ranking runs of the automaton on not necessarily distinct trees. The computational difficulty of the N-best trees problem depends on the algebraic domain from which weights are taken. Here we assume this domain to be the tropical semiring. Hence, the weight of a tree t is the minimal weight of any run on t, and the weight of a run is the sum of the weights of the transitions used in the run, which are non-negative real numbers. The tropical semiring is particularly common in speech and text processing (Benesty, Sondhi, and Huang 2008), since probabilistic devices can be modeled using negative log likelihoods. Moreover, the semiring has the advantage of being extremal: The sum of two elements a and b always equals one of a and b. As a consequence, it is not necessary to consider all runs of an automaton on an input tree to find the weight of the tree, as its weight is equal to the weight of the optimal run. (In the non-extremal case, the problem is NP-complete even for strings (Lyngsø and Pedersen 2002).) The N-best trees problem for a given wta M over an extremal semiring can be solved indirectly by computing a list of N′ best runs for M, for a sufficiently large number N′, and outputting the corresponding trees while discarding previously outputted trees. A complicating factor with this approach, however, is that M can have exponentially many runs on a single tree, so N′ may have to be very large to guarantee that the output contains N distinct trees.
Best trees extraction is useful in any application that includes some type of re-ranking of hypotheses. One example that makes use of best trees extraction is the work by Socher et al. (2013) on syntactical language analysis. The team of authors improve the Stanford parser by composing a probabilistic context-free grammar (PCFG) with a recurrent neural network (RNN) that learns vector representations. Intuitively, each nonterminal in the PCFG is associated with a continuous vector space. The vector space induces an unbounded refinement of the category represented by the nonterminal into subcategories, and the RNN computes transitions between such vectors. For efficiency reasons, the device is not applied to the input sentence directly. Instead, the N = 200 highest-scoring parse trees with respect to the PCFG are computed, whereupon the RNN is used to rerank these to find the best parse tree. The work has raised interest in hybrid finite-state continuous-state approaches (see, e.g., the work by Zhao, Zhang, and Tu (2018)), and underlines the value of the N-best problem in language processing.
In previous work, Björklund, Drewes, and Zechner (2019) generalized an N-best algorithm by Mohri and Riley (2002) from strings to trees, resulting in the algorithm Best Trees v.1. Intuitively, the algorithm performs a lazy implicit determinization and uses a priority queue to output N best trees in the right order. The running time of Best Trees v.1 was shown to be in , where m and n are the numbers of transitions and states of M, respectively, and r is the maximum number of children (the rank) of symbols in the input alphabet. Best Trees v.1 was evaluated empirically in Björklund, Drewes, and Jonsson (2018) against the N best runs algorithm by Huang and Chiang (2005), which represents the state of the art. Although Büchse et al. (2010) proved that the algorithm by Huang and Chiang works for cyclic input wtas and generalized it by extending it to structured weight domains, the core idea of the algorithm remains the same. The algorithm by Huang and Chiang (2005) is implemented in the widely referenced Tiburon toolkit (May and Knight 2006). From here on, we refer to this implementation simply as Tiburon, even though the best runs procedure is only one out of many that the toolkit has to offer. The conclusion of Björklund, Drewes, and Jonsson (2018) was that Best Trees v.1 is faster if the input wtas exhibit a high degree of nondeterminism, whereas Tiburon is the better option when the input wtas are large but essentially deterministic.
We now improve Best Trees v.1 by exploring the search space in a more structured way, resulting in the algorithm Best Trees. In Best Trees v.1, all assembled trees were kept in a single queue. In this work, we split the queue into as many queues as there are transitions in the input automaton. The queue Kτ of transition τ contains trees that are instantiations of τ, that is, trees with a run that applies τ at the root. This makes it possible to improve the strategy to prune the queue that was used by Björklund, Drewes, and Zechner (2019), and avoid pruning altogether. The intuition is simple: To assemble N distinct output trees, at most N instantiations of any one transition may be needed. We furthermore assemble the instantiations of τ in a lazy fashion, constructing an instantiation explicitly only when it is dequeued from Kτ. We formally prove the correctness of Best Trees and derive an upper bound on its running time, namely, .
In addition to solving the best trees problem, Best Trees can also solve the best runs problem by removing the control structure that makes it discard duplicate trees (see Section 5). In this article, we make use of this possibility to compare this algorithm, implemented as Betty, with Tiburon on the home turf of the latter, that is, with respect to the computation of best runs rather than best trees. For our experiments, we use both largely deterministic wtas from a machine translation project and from Grammatical Framework (Ranta 2011), and more nondeterministic wtas that were artificially created to expose the algorithms to challenging instances. Our results show that Betty is generally more time efficient than Tiburon, despite the fact that the former is more general as it can also compute best trees (with almost the same efficiency as it computes best runs). Moreover, we perform a limited set of experiments measuring the memory usage of the applications, and can conclude that overall, the memory efficiency of Tiburon is slightly better than that of Best Trees.
1.1 Related Work
The proposed algorithm adds to a line of research that spans two decades. It originates with an algorithm by Eppstein (1998) that finds the N best paths from one source node to the remaining nodes in a weighted directed graph. When applied to graphs representing weighted string automata, the list returned by Eppstein’s algorithm may in case of nondeterminism contain several paths that carry the same string, that is, the list is not guaranteed to be free from duplicate strings. Four years later, Mohri and Riley (2002) presented an algorithm that computes the N best strings with respect to a weighted string automaton and thereby creates duplicate-free lists. To reduce the amount of redundant computation, consisting in the exploration of alternative runs on one and the same substring, they incorporate the N shortest paths algorithm by Dijkstra (1959). Moreover, Mohri and Riley work with on-the-fly determinization of the input automaton M, which avoids the problem that the determinized automaton can be exponentially larger than M, but has at most one run on each input string. Jiménez and Marzal (2000) lift the problem to the tree domain by finding the N best parse trees with respect to a context-free grammar in Chomsky normal form for a given string. Independently, Huang and Chiang (2005) published an algorithm that computes the best runs for weighted hypergraphs (which is equivalent to weighted tree automata and weighted regular tree grammars), and that is a generalization of the algorithm by Jiménez and Marzal in that it does not require the input to be in normal form. Huang and Chiang combine dynamic programming and lazy evaluation to keep the number of intermediate computations small, and derive a lower bound on the worst-case running time of their algorithm than Jiménez and Marzal do.
As previously mentioned, the algorithm by Huang and Chiang (2005) is implemented in the Tiburon toolkit by May and Knight (2006). Its initial usage was as part of a machine-translation pipeline, to extract the N best trees from a weighted tree automaton. In connection with this work, Knight and Graehl (2005) noted that there was no known efficient algorithm to solve this problem directly. As an alternative way forward, Knight and Graehl thus enumerate the best runs and discard duplicate trees, until sufficiently many unique trees have been found. Since Tiburon is highly optimized and the current state-of-the-art tool for best runs extraction, it is a natural choice of reference implementation for an empirical evaluation of our solution. The algorithm by Huang and Chiang (2005) was later generalized by Büchse et al. (2010) to allow a linear pre-order on the weights, as opposed to a total order. Büchse et al. (2010) also prove that the algorithm is correct on cyclic input hypergraphs, provided that the Viterbi algorithm for finding an optimal run (Jurafsky and Martin 2009) is replaced by Knuth’s algorithm (Knuth 1974).1
Also, Finkel, Manning, and Ng (2006) remark on the lack of sub-exponential algorithms for the best trees problem. They propose an algorithm that approximates the solution when the automaton is expressed as a cascade of probabilistic tree transducers. The authors model the cascade as a Bayesian network and consider every step as a variable. This allows them to sample a set of alternative labels from each prior step, to propagate onward in the current step. The approximation algorithm runs in polynomial time in the size of the input device and the number of samples, but the convergence rate to the exact solution is not analyzed. To explain why their approach is preferable to finding N best runs, they extract the N = 50 best runs from the Stanford parser and observe that about half of the output trees are actually duplicates—enough to affect the outcome of the processing pipeline. Thus, they argue, extracting the highest ranking trees rather than the highest ranking runs is not only theoretically better, but is also of practical significance.
Finally, we note that the best trees problem also has applications outside of NLP. In fact, whenever we are considering a set of objects, each of which can be expressed uniquely by an expression in some particular algebra, and the set of expressions is the language of a wta, then the best trees algorithm can be used to produce the best objects. For instance, Björklund, Drewes, and Ericson (2016) propose a restricted class of hypergraphs that are uniquely described by expressions in a certain graph algebra, so the best trees algorithm makes it possible to find the optimal such graphs with respect to a wta. The reason why the representation needs to be unique is that otherwise we will have to check equivalence between objects as an added step. In the case of graphs, this would mean deciding graph isomorphism, which is not known to be tractable in general.
2 Preliminaries
We write ℕ for the set of nonnegative integers, ℕ + for ℕ ∖{0}, and ℝ + for the set of non-negative reals; and denote and , respectively. For n ∈ℕ, [n] = {i ∈ℕ∣1 ≤ i ≤ n}. Thus, in particular, [0] = ∅ and . The cardinality of a (countable) set S is written |S|. The n-fold Cartesian product of a set S with itself is denoted by Sn.
The set of all finite sequences over S is denoted by S*, and the empty sequence by λ. A sequence of l copies of a symbol s is denoted by sl. Given a sequence σ = s1⋯sn of n elements si ∈ S, we denote its length n by |σ|. Given an integer i ∈ [n], we write σi for the i-th element si of σ. For notational simplicity, we occasionally use sequences as if they were sets, for example, writing s ∈ σ to express that s occurs in σ, or S ∖ σ to denote the set of all elements of a set S that do not occur in the sequence σ.
A (commutative) semiring is a structure such that both and are commutative monoids, the semiring multiplication ⊗ distributes over the semiring addition ⊕ from both left and right, and 0 is an annihilator for ⊗, that is, 0 ⊗ d = 0 = d ⊗ 0 for all . In this article, we will exclusively consider the tropical semiring. Its domain is , with serving as semiring addition and ordinary plus as semiring multiplication.
A ranked alphabet is a disjoint union of finite sets of symbols, . For f ∈ Σ, the k ∈ℕ such that f ∈ Σ(k) is the rank of f, denoted by rank(f). The set TΣ of ranked trees over Σ consists of all Σ-labeled trees t in which the rank of every node v ∈dom(t) equals the rank of t(v). For a set T of trees we denote by Σ(T) the set of trees which have a symbol from Σ at their root, with direct subtrees in T, more precisely, {f[t1,…,tk]∣k ∈ℕ,f ∈ Σ(k),and t1,…,tk ∈ T}.
A weighted tree language over the tropical semiring is a mapping , where Σ is a ranked alphabet. Weighted tree languages can be specified in a number of equivalent ways. Three of the standard ones, mirroring the ways in which regular string languages are traditionally specified, are weighted regular tree grammars, weighted tree automata, and weighted finite-state diagrams formalized as hypergraphs. The equivalence of the second and the third is shown explicitly in Jonsson (2021). All three have been used in the context of N-best problems: weighted regular tree grammars by May and Knight (2006), weighted tree automata by Björklund, Drewes, and Zechner (2019), and hypergraphs by Huang and Chiang (2005) and Büchse et al. (2010). In this article, we use weighted tree automata.
A weighted tree automaton (wta) over the tropical semiring is a system M = (Q, Σ, R, ω, qf) consisting of:
a finite set Q of symbols of rank 0 called states;
a ranked alphabet Σ of input symbols disjoint with Q;
a finite set of transition rules;
a mapping ω: R →ℝ +; and
a final state qf ∈ Q.
From here on, we write to denote that τ = (q1,…,qk,f,q) ∈ R and ω(τ) = w, and consider R to be the set of these weighted rules, thus dropping the component ω from the definition of M.
A transition rule will also be viewed as a symbol of rank k, turning R into a ranked alphabet. We let tar(τ) denote the target state q of τ, src(τ) denotes the sequence of source states q1…qk, and rank(τ) = rank( f). In addition, we view every state q ∈ Q as a symbol of rank 0.
We define the set runsM ⊆ TR∪Q of runs ρ ofM, their input trees inputM(ρ), their intrinsic weightswtM(ρ), and their target statetar(ρ) inductively, as follows:
For every q ∈ Q, we have that q ∈runsM with inputM(q) = q, wtM(q) = 0, and tar(q) = q.
- For every transition rule in R and all runs ρ1,…,ρk ∈runsM such that tar(ρi) = qi for all i ∈ [k], we let ρ = τ[ρ1,…,ρk] ∈runsM with
Throughout the rest of the article, we will generally drop the subscript M in runsM, inputM, and wtM, because the wta in question will always be clear from the context.
Given as input a wta M and an integer N ∈ℕ, the N-bestruns problem consists in computing a sequence of N runs of minimal weight according to M. More precisely, an algorithm solving the problem will output a sequence ρ1,ρ2,… of N pairwise distinct runs such that there do not exist i ∈ [N] and ρ ∈ runs ∖{ρ1,…,ρi} with M(ρ) < M(ρi).
General Assumption. To make sure that the N-best runs problem always possesses a solution, and to simplify the presentation of our algorithms, we assume from now on that all considered wtas M have infinitely many runs ρ such that tar(ρ) = qf. In particular, TΣ is assumed to be infinite. Apart from simplifying some technical details, this assumption does not affect any of the reasonings in the paper.
Similarly to the N-best runs problem, the N-besttrees problem for the wta M consists in computing a sequence of pairwise distinct trees t1,t2,… in TΣ of minimal weight. In other words, we seek a sequence of trees such that there do not exist i ∈ [N] and t ∈ TΣ ∖{t1,…,ti} with M(t) < M(ti). Note that the N-best trees problem always has a solution because we assume that TΣ is infinite.
The input trees of 10 best runs (left) in comparison with 10 best trees (right) for the wta in Figure 1.
Best runs . | Best trees . | ||
---|---|---|---|
Input tree . | Weight . | Tree . | Weight . |
a | 1 | a | 1 |
f[a,a] | 3 | f[a,a] | 3 |
f[a,a] | 3 | f[f[a,a],a] | 5 |
f[a,a] | 3 | f[a,f[a,a]] | 5 |
f[a,f[a,a]] | 5 | f[f[a,a],f[a,a]] | 7 |
f[f[a,a],a] | 5 | f[f[f[a,a],a],a] | 7 |
f[a,f[a,a]] | 5 | f[f[a,f[a,a]],a] | 7 |
f[f[a,a],a] | 5 | f[a,f[f[a,a],a]] | 7 |
f[a,f[a,a]] | 5 | f[a,f[a,f[a,a]]] | 7 |
f[f[a,a],a] | 5 | f[f[a,f[a,a]],f[a,a]] | 9 |
Best runs . | Best trees . | ||
---|---|---|---|
Input tree . | Weight . | Tree . | Weight . |
a | 1 | a | 1 |
f[a,a] | 3 | f[a,a] | 3 |
f[a,a] | 3 | f[f[a,a],a] | 5 |
f[a,a] | 3 | f[a,f[a,a]] | 5 |
f[a,f[a,a]] | 5 | f[f[a,a],f[a,a]] | 7 |
f[f[a,a],a] | 5 | f[f[f[a,a],a],a] | 7 |
f[a,f[a,a]] | 5 | f[f[a,f[a,a]],a] | 7 |
f[f[a,a],a] | 5 | f[a,f[f[a,a],a]] | 7 |
f[a,f[a,a]] | 5 | f[a,f[a,f[a,a]]] | 7 |
f[f[a,a],a] | 5 | f[f[a,f[a,a]],f[a,a]] | 9 |
A finite-state diagram representing an example wta. The input alphabet is Σ(0) ∪ Σ(2), where Σ(0) = {a} and Σ(2) = {f}. Circles represent states (double circles indicate the final state, i.e., qf = q0), and squares represent transitions (all of weight 1). The consumed input symbols are shown inside the squares. Undirected edges connect states in left-hand sides to consumed input symbols in counter-clockwise order, starting at noon. Directed arcs point from the consumed symbol to the right-hand side state of the transition in question.
A finite-state diagram representing an example wta. The input alphabet is Σ(0) ∪ Σ(2), where Σ(0) = {a} and Σ(2) = {f}. Circles represent states (double circles indicate the final state, i.e., qf = q0), and squares represent transitions (all of weight 1). The consumed input symbols are shown inside the squares. Undirected edges connect states in left-hand sides to consumed input symbols in counter-clockwise order, starting at noon. Directed arcs point from the consumed symbol to the right-hand side state of the transition in question.
We end this section by discussing the choice of our particular weight structure, the tropical semiring. In the literature on wta, Definitions 1 and 2 are generalized to wta over arbitrary commutative semirings, and their resulting weighted tree languages, simply by replacing + and by ⊕ and ⊗, respectively. The tropical semiring used in Definitions 1 and 2 is frequently used in natural language processing. Equally popular is the Viterbi semiring that acts on the unit interval of probabilities, with maximum and standard multiplication as operations. In the setting discussed here, both semirings are equivalent. To see this, transform a wta M over the Viterbi semiring to a wta M′ over the tropical semiring by simply mapping every weight p of a transition rule of M to , that is, taking negative logarithms everywhere. Since and , it holds that M(t) (now calculated using the Viterbi semiring operations) is equal to for all trees t. It follows that the trees t1,…,tN form an N-best list according to M (now looking for trees with maximal weights) if and only if they form an N-best list according to M′ in the sense defined above.
3 The Improved Best Trees Algorithm
In this section, we explain how the algorithm in Björklund, Drewes, and Zechner (2019) can be made lazier, and hence more efficient, by exploring the search space with respect to transitions rather than states. From here on, let M = (Q,Σ,R,qf) be a wta with m transition rules, n states, and a maximum rank of r among the symbols in Σ.
3.1 Best Trees v.1
We can now reproduce the pseudocode of the base algorithm from Björklund, Drewes, and Zechner (2019) in Algorithm 1. Given a wta M and N ∈ℕ, it solves the N-best trees problem. After outputting i ∈ [N] trees, the set of trees enqueued in line 23 is pruned so that for every q ∈ Q, at most N − i trees are kept for which q is an optimal state. The function expand(T,t), which computes the trees to be enqueued in each step, returns the set of all trees in Σ(T) such that the “new” tree t occurs at least once among the direct subtrees of the root.
The correctness of this approach is formally proved in Björklund, Drewes, and Zechner (2019).
3.2 Computation of Best Contexts
We now recall the computation of best contexts 𝕔q, q ∈ Q, to the extent needed to understand the improved algorithm. This computation consists of two phases: first, a 1-best tree is computed for each state q ∈ Q. The desired property of is that it is a 1-best tree of Mq, that is, it is a tree t ∈ TΣ that minimizes Mq(t). After that, the second phase computes the actual best context 𝕔q for every q ∈ Q, in other words, a context c ∈ CΣ with Mq(c) =𝕨q.
The first phase can be accomplished using a dynamic programming algorithm by Knuth (1977) that computes 1-best runs. (Note that if ρ is a 1-best run, then input(ρ) is a 1-best tree.) The algorithm maintains a min-priority queue of all transition rules and collects, in |R| iterations, the desired best runs ρq. Initially, ρq is undefined for every q ∈ Q. The value determining the priority of a transition rule is if any (i ∈ [k]) is still undefined. Otherwise, it is , that is, the weight of the run . The algorithm repeatedly dequeues the highest priority element from the queue. If ρq is still undefined, it sets and . It then updates the priorities of transition rules having q among the states in their right-hand sides and repeats. We note that the trees are discovered by the algorithm in the order of ascending weight. This observation will soon become important for the initialization phase of Algorithm 2.
As a side remark, we note that the set of 1-best runs and 1-best trees determined by this algorithm are subtree closed, meaning that every subtree of ρq and is itself one of the trees ρq′ and , respectively. It follows that the entire set of these trees can be stored as a maximally shared directed acyclic graph with |Q| nodes.
Now, having computed the paths of minimal weight in this graph using Dijkstra’s algorithm, consider the edge sequence π of least weight from qf to a state q′. Then 𝕔q′ = 𝕔(π), where 𝕔(π) is defined recursively as follows:
If π = λ then q′ = qf and we set .
- If π = π′e where e = (q, τ, q′) for some transition rule , we choose some i ∈ [k] with qi = q′, and set
We note here that this way of computing best contexts results in contexts of a very peculiar kind: Every such context 𝕔q consists of a “spine” leading to the node v such that , and all subtrees that branch out from this spine are of the form for the required states q′.
3.3 Transition-Based Best Trees Computation
The improved algorithm for computing N best trees hinges on the observation that to generate N best trees, at most N distinct instantiations of each transition rule are needed. Here, an instantiation of a rule is a tree f[t1,…,tk]. Furthermore, the algorithm creates these instantiations in a lazy fashion.
To this end, we build sequences Tq of N′ ≤ N best trees for each state q. Most importantly, a separate priority queue Kτ is kept for each transition rule in R. Every tree in this queue is of the form and is hence uniquely determined by the tuple (i1,…,ik), each ij working as a pointer into the sequence .3 Hence, each such tuple can be understood as an abstract instruction of how to instantiate τ by previously dequeued trees. As outlined in Section 3.5, an efficient implementation of the algorithm can make this assembly “just in time,” so as to avoid unnecessary work.
In each iteration, the algorithm chooses the highest-priority element across all of the queues Kτ, where the priority order (to be described later) is similar to <K, but improved by replacing the use of the lexical order by a more goal-oriented component. To efficiently pick the highest-priority element across all queues, we organize the queues in a meta-queue K′, where the priority of every Kτ in K′ is given by the priority of its highest-priority element. This organization is schematically illustrated in Figure 2.
The algorithm keeps a dedicated priority queue Kτ for each transition rule τ, and the collection of these queues is itself arranged in a main priority queue K′ (drawn on the right).
The algorithm keeps a dedicated priority queue Kτ for each transition rule τ, and the collection of these queues is itself arranged in a main priority queue K′ (drawn on the right).
The algorithm itself is outlined in Algorithm 2. The sequences Tq mentioned above are the previously dequeued best instantiations of rules τ with tar(τ) = q. Thus, as mentioned before, Tq is a prefix of a solution of the N-best problem for the wta Mq.
For every and every index tuple u = (i1,…,ik) ∈ℕk, the instantiated transition rule is denoted by τ[u]. As previously mentioned, it is defined as the tree , but since this is well defined only if for every j ∈ [k], we let τ[u] be undefined otherwise.
Each queue Kτ in Algorithm 2 is a min-priority queue in which the least priority is assigned to elements u such that τ[u] is undefined. This reflects that we do not yet know the weight of the instantiation of τ with respect to u, and that every instantiation for which we do know the weight of the resulting tree can be shown to be preferable.
Now, let q ∈ Q and . Having computed best contexts (and, in the process, also best trees) for all states, Tq is initialized to contain only the best tree for q. The queue Kτ is initialized to contain only 1k if τ≠ρq(λ), since the latter means that τ was not used to build , which means that the first instantiation of τ is still waiting to be added to Tq at a suitable position. Otherwise, Kτ is initialized to contain all successors of 1k because, by the way in which was constructed and the definition of τ[u], we have and are now looking for the next best instantiation of τ.
After initializing Tq and Kτ, the algorithm enters the main loop. This loop is executed until N trees have been outputted. The first step in the loop is to extract a minimal u from the set of all queues Kτ, τ ∈ R (by first dequeuing Kτ from K′ and then u from Kτ). Thanks to our assumption that M possesses infinitely many runs ending in qf, it can be shown that Δτ(u) ∈ℕ, that is, the tree τ[u] is defined (see the next section). The tree is appended at the end of the list of the best (smallest-weighted) trees that reach the target state tar(τ) of τ. If tar(τ) = qf and τ[u] is “new,” that is, has not been outputted before, then τ[u] is outputted now, and the counter c tracking the number of output trees is increased. Note that, in contrast to Algorithm 1, we have to check whether τ[u] was outputted before, because some τ′[u′] outputted earlier may actually have been equal to τ[u]. However, this can only be the case if τ≠τ′, and can thus only happen m times for every tree.
3.4 Correctness
Let us now show that Algorithm 2 is correct. To simplify the reasoning, we shall first consider a variant of the algorithm, referred to as Algorithm 2′, obtained by removing the inequality |Ttar(τ)| < N from the condition on line 14. In the following lemma, we say that Tq = t1⋯tm is appropriate if t1,…,tm is a solution of the m-best trees problem for Mq.
Consider a run of Algorithm 2′. During every execution of the main loop the following statements hold:
- (1)
When the loop is entered, each Tq (q ∈ Q) is appropriate.
- (2)
When line 13 has been executed with tar(τ) = q, there do not exist any q′ ∈ Q and t ∈ TΣ ∖ Tq′ such that Δq′(t) < Δτ(u) (where τ and u denote the values of the corresponding variables in Algorithm 2′ at that point).
We use (1) as a loop invariant. Because it holds before the first execution of the loop (as each Tq consists of a single best tree with respect to Mq), we need to show that, under the condition that (1) holds when the loop is entered, statement (2) holds as well and, when line 21 has been executed, (1) still holds.
Because , the tuple u′ has not yet been dequeued from Kτ′. Thus, while u′ itself may not yet be in Kτ′, Kτ′ must contain some u″ = (j1′,…,jℓ′) with ji′ ≤ ji for all i ∈ [ℓ]. This is because when an element is dequeued on line 13 then all of its direct successors will be enqueued on line 21. In particular, Kτ′ cannot be empty at the start of an iteration, as long as there is some u′ ∈ℕℓ that has not yet been dequeued from Kτ′ (which will always be the case if ℓ > 0).
Note that ji′ ≤ ji for all i ∈ [ℓ] implies that since are appropriate. Hence, the inequality Δτ′(u″) ≤ Δτ′(u′) < Δτ(u) contradicts the assumption that u was dequeued on line 13.
Algorithm 2′ computes a solution to the N-best trees problem.
We first observe that, due to the output condition on line 17 of Algorithm 2′, the sequence of trees outputted by the algorithm does not contain repetitions.
Next, whenever a tree s = τ[u] is outputted on line 18, we show that M(t) ≥ M(s) for all trees t ∈ TΣ that have not yet been outputted. Let . Then q = qf and thus , 𝕨q = 0, and M(s) = Δτ(u). Now, consider any t ∈ TΣ that has not yet been outputted. Then we have . Consequently, by Lemma 1, as required.
To complete the proof, we have to show that there cannot be an infinite number of iterations without any tree being outputted. We show the following, stronger statement:
Claim 1. Let , where δq is defined as in Equation 2 for all q ∈ Q. At any point in time during the execution of Algorithm 2′, it takes at most δ iterations until the selected transition rule τ satisfies tar(τ) = qf.
there is no u″ in any of the queues Kτ″ with Δτ″(u″) < Δτ(u) (Lemma 1(2)),
, and
δq′ = δq − 1.
It follows that the queue Kτ″ from which a tuple is dequeued on line 12 at the start of the next iteration satisfies δtar(τ″) = δq − 1, thus bounding the number of iterations until this quantity reaches 0 from above by δ.
Now, to finish the proof, note that τ[u]≠τ[u′] whenever u≠u′ because the sequences Tq, q ∈ Q, do not contain repetitions. Because, furthermore, no tuple u is enqueued twice in Kτ, line 18 will be reached after at most δ iterations.
We finally show that Algorithm 2 is correct as well.
Algorithm 2 computes a solution to the N-best trees problem.
Consider an execution of Algorithm 2′ and assume that we assign every tree a color, red or black, where black is the default color. (The color attribute of a subtree may differ from that of the tree itself.) Suppose that, in an iteration of the main loop, we have and u = (i1,…,ik), and thus τ[u] = f(t1,…,tk) where for all j ∈ [k]. If |Tq|≥ N on line 14 and we append τ[u] to it on line 15, we color τ[u] red while its subtrees t1,…,tk keep their colors as given by their positions in .
Now, assume that some Tq contains a tree t = f[t1,…,tk] such that tj is red for some j ∈ [k]. We show that this implies that t is red. We know that t was appended to Tq as a tree of the form τ[u] for some u = (i1,…,ik) with ij > N (because of the assumption that tj is red). We know also that earlier iterations have dequeued all tuples from Kτ of the form ui = (i1,…,ij−1,i,ij +1,…,ik) for i = 1,…,N. (This is an immediate consequence of Lemma 2 because, by Lemma 1(1), Δτ(ui) < Δτ(u) for all i ∈ [k].) Since the τ[ui] are pairwise distinct (as they differ in the j-th direct subtree) this means that |Tq|≥ N before τ[u] was enqueued, thus proving that t is red, as claimed.
Because Algorithm 2′ terminates when , none of the trees in will ever be red. By the above, this implies that all subtrees of trees in are black as well. As subtrees inherit their color from the Tq they are taken from, this shows that no red tree occurring in an execution of Algorithm 2′ can ever have an effect on the output of the algorithm. Hence, the output of Algorithm 2 is the same as that of Algorithm 2′ and the result follows from Lemma 2.
3.5 Time Complexity
Recall that the input M = (Q,Σ,R,qf) is assumed to be a wta with m transition rules, n states, and a maximum rank of r among its symbols. In the complexity analysis, we consider an efficient implementation of Algorithm 2 along the lines illustrated in Figure 2, with priority queues based on heaps (Cormen et al. 2009). This enables us to implement the following details efficiently:
Consider a transition . At a given stage of the algorithm, some of the elements u = (u1,…,uk) in Kτ may contain elements ui such that , that is, τ[u] is still undefined and hence . Thus, Δτ(u) must be decreased from to its final value in when has reached ui for all i ∈ [k]. For this, we record for every p ∈ Q and for |Tp| < j < |N| a list of all u ∈ Kτ such that τ is as above, qi = p for some i ∈ [k], and ui = j. When |Tp| reaches the value j, this list is used to adjust the priority of each u on that list to the new value of Δτ(u) (which is either still or has reached its final value).
The queue K′ that contains the individual queues Kτ as elements is implemented in the straightforward way, also using priority queues based on heaps. As described earlier, the priority between Kτ and Kτ′ is given by comparing their top-priority elements: if these elements are u and u′, respectively, then Kτ takes priority over Kτ′ if (Δτ(u),δtar(τ)) <(Δτ′(u′),δtar(τ′)). If one of the queues Kτ runs empty (which can only happen if rank(τ) = 0), then Kτ is removed from K.
We now establish an upper bound on the running time of the algorithm.
For the proof, we look at the maximum number of instantiations that we encounter during a run of the algorithm.
Because K′ contains (at most) the m queues Kτ, enqueuing into K′ is in . Furthermore, each rule is limited to N instantiations, due to the fact that the sequences Tq do not contain repetitions and thus τ[u]≠τ[u′] for u≠u′. This implies that the maximum number of iterations of the main loop is Nm, yielding an upper bound of for the management of K′.
Next, we have the rule-specific queues Kτ. Each time a tuple u ∈ℕk is dequeued from Kτ, at most |inc(u)| = k ≤ r new tuples are enqueued. In total, the creation of these k tuples of size k each takes k2 ≤ r2 operations. Thus, there will be at most Nr elements in Kτ for any τ, which gives us a time bound of per queue operation, a total time of for the management of Kτ, and thus a total of for the m queues Kτ altogether.
To complete the analysis, we have to argue that the time that needs to be spent to check whether τ[u] has been outputted before, can be made negligible. We do this by implementing the forest of outputted trees in such a way that equal subtrees are shared. Hence, trees are equal if (and only if) they have the same address in memory. Assuming a good hashing function, the construction of τ[u] from the (already previously constructed) trees referred to by u, can essentially be done in constant time. Now, if we maintain with every previously constructed tree a flag indicating whether that tree had already been outputted once, the test boils down to constructing τ[u] (which would return the already existing tree if it did exist) and checking the flag .
We have not performed a detailed space complexity analysis, but because we know that the logarithmic factors are due to heap operations, we can conclude that the memory space consumption of the algorithm is in O(Nm).
In practical applications, we expect to see Algorithm 2 used in two ways. The first is, as discussed in the Introduction, the situation in which N best trees are computed for a relatively small value of N, for example, N = 200 as suggested by Socher et al. (2013). Here, based on a comparison of the upper bounds on the running time, and assuming that they are reasonably tight, Algorithm 2 will outperform Algorithm 1 if N is larger than , which we believe is the common case. In the second scenario, Algorithm 2 is invoked with a very large N to enumerate the trees recognized by the input automaton, outputting them in ascending order by weight. In this scenario, our exploration by transition rule is even more valuable, as it saves redundant computation.
4 The Algorithm Best Runs
We now recall the algorithm by Huang and Chiang (2005), which we henceforth will refer to as Best Runs. To facilitate comparison, we express Best Runs in terms of wta.
The type of wta used as input to Best Runs differs slightly from the one used in Definition 1 in that the admissible weight structures are not restricted to the tropical semiring. Instead, each transition rule τ: f[q1,…,qk] → q is equipped with a weight function . The definition of the weight of a run ρ = τ[ρ1,…,ρk] is then changed to wt(ρ) =wtτ(wt(ρ1),…,wt(ρk)).
For the algorithm to work, these weight functions wtτ are required to be monotonic: wtτ(w1,…,wk) ≥wtτ(w1′,…,wk′) whenever wi ≥ wi′ for all i ∈ [k]. Definition 1, which Best Trees is based on, corresponds to the special case where each of these weight functions is of the form for a constant w. In other words, the weight functions Best Trees can work with are a restriction of those Best Runs works on. This will be discussed in Section 5.
The input to Best Runs is a pair (M,N), where M is a wta with m transition rules and n states, and N ∈ℕ. The algorithm is outlined in Algorithm 3. Line 2 is a preprocessing step that can be performed in O(m) time using the Viterbi algorithm, given that the rank of the alphabet used is considered a constant. A list input[q] is used to store, for each state q ∈ Q, the at most N discovered best runs arriving at q. The search space of candidate runs is represented by an array of heaps, here denoted cands. For each state q, cands[q] holds a heap storing the (at most N) best, so far unexploited, runs arriving at q. That is, if we have already picked the N′ best runs arriving at a node, the heap lets us pick the next best unpicked candidate efficiently when so requested by the recursive call.
To expand the search space, the N′-th best run ρ is used as follows: if τ = ([q1,…,qk] → q), we obtain a new candidate by replacing the i-th direct subtree of ρ with the next (and thereby minimally worse) run in input arriving at qi. Note that this is equivalent to the increment method used in Best Trees.
5 Comparison
A major difference between Best Runs and Best Trees is that the former solves the N-best runs problem whereas the latter solves the N-best trees problem: On lines 14 and 17 of Algorithm 2, duplicate trees are discarded. If these conditions are removed, Best Trees solves the best runs problem. (Provided, of course, that the objects outputted are changed to being runs rather than trees.)
For the sake of comparison, we adopt the view of Huang and Chiang (2005) that the ranked alphabet Σ can be considered fixed. This yields that the running time of Best Trees is . Since m ∈ O(nr +1), where r is the (now fixed) maximal rank of symbols in Σ, it follows that , so the second expression simplifies to . Moreover, recall that the worst case running time of Best Runs is . If N is large enough to make the second term the dominating one, the difference between both running times is thus a factor of (assuming for the sake of the comparison that the given bounds are reasonably tight). A further comparison between the running times does not appear to be all that meaningful because depends on both N and the structure of the input wta, and the algorithms are specialized for different problems.
A conceptual comparison of the two algorithms may be more insightful. The algorithms differ mainly in two ways. The first difference is that, while Best Runs enumerates candidates of best runs using one priority queue per node of G, Best Trees uses a more fine-grained approach, maintaining one priority queue per hyperedge. Since the upper bound on the length of queues is O(N) in both cases, in total Best Trees may need to handle O(Nm) candidates whereas Best Runs needs only O(Nn). This ostensible disadvantage of Best Trees is not a real one as it occurs only when solving the N-best trees problem. The difference vanishes if the algorithm is used to compute best runs (by changing lines 14 and 17). To see this, consider a given state q ∈ Q. Each time a queue element (i1, …, ik) is dequeued from a queue Kτ with τ: f[q1,…,qk] → q, the corresponding run is appended to the list Tq on line 15, and k queue elements are inserted on line 21. As this can only happen at most N times per state q, in total only O(Nn) queue elements are ever created in the worst case. In other words, if Best Trees is set to solve the N-best runs problem, the splitting of queues does not result in a disadvantage compared to Best Runs.
When a run ρ reaching state q is dequeued with ℓ > 0 (top left), then its best context cq is of the form , where cp is the best context for p = tar(τ′) and the ρi are 1-best trees reaching the remaining states of τ′ (top right). Hence, ρ′ = τ′[ρ1, … ,ρi−1,ρ,ρi +1, … ,ρk] (or a run which likewise is less than ℓ steps away from an output run) will be the next run dequeued (bottom).
When a run ρ reaching state q is dequeued with ℓ > 0 (top left), then its best context cq is of the form , where cp is the best context for p = tar(τ′) and the ρi are 1-best trees reaching the remaining states of τ′ (top right). Hence, ρ′ = τ′[ρ1, … ,ρi−1,ρ,ρi +1, … ,ρk] (or a run which likewise is less than ℓ steps away from an output run) will be the next run dequeued (bottom).
It follows that line 21 inserts the element into Kτ that represents the run ρ′ = τ′[ρ1,…,ρi−1,ρ,ρi +1,…,ρk]. Because each of the runs ρi, arriving at some state qi, is already in the respective list , the run ρ′ immediately becomes a current candidate arriving at q = tar(τ). The corresponding pair (wq,dq) = (wt(cq),depth(cq)) satisfies wq + wt(ρ′) = w + wt(ρ) and dq = d − 1. Hence, ρ′ (or another run with the same priority) will be picked in the next loop execution, meaning that after at most ℓ steps the node arrived at by the constructed run will be qf, that is, the run will be outputted. This also shows that queues the elements of which are not needed for generating output trees will never become filled beyond their initial element.
We end the comparison by looking at Table 2, which summarizes the discussion above and provides comparison data for the other N-best algorithms discussed in this article. Keep in mind that N must be interpreted differently depending on the problem at hand. For example, the output of Best Runs is not equivalent to the output of Best Trees (unless the input is deterministic), which is why we cannot directly compare the time complexities. This output inequivalence should also be considered in Section 7 where, for simplicity, we plot data for Best Runs and Best Trees side by side.
Summary of characteristics of N-best algorithms. For the time complexities, the alphabet is taken to be constant and the input is represented as a wta. The two right-most columns indicate whether the algorithms compute best contexts as a preprocessing step, and what optimization methods are used to find new candidate objects.
Algorithm . | Objects . | Time complexity . | Best contexts . | Search-space expansion . |
---|---|---|---|---|
Eppstein (1998) | Paths | No | Adding sidetracks to implicit heap representations of paths | |
Mohri and Riley (2002) | Strings | No formal analysis provided | Yes | On-the-fly determinization |
Huang and Chiang (2005) | Runs | 5 | No | Increment |
Best Trees v.1 | Trees | Yes | Eppstein’s algorithm | |
Best Trees | Trees | Yes | Increment | |
Runs | 6 | Yes | Increment |
Algorithm . | Objects . | Time complexity . | Best contexts . | Search-space expansion . |
---|---|---|---|---|
Eppstein (1998) | Paths | No | Adding sidetracks to implicit heap representations of paths | |
Mohri and Riley (2002) | Strings | No formal analysis provided | Yes | On-the-fly determinization |
Huang and Chiang (2005) | Runs | 5 | No | Increment |
Best Trees v.1 | Trees | Yes | Eppstein’s algorithm | |
Best Trees | Trees | Yes | Increment | |
Runs | 6 | Yes | Increment |
6 Implementation Details
In the upcoming section, we experimentally compare Best Runs and Best Trees. In preparation of that, we want to make a few comments on the implementation of Best Trees. As previously mentioned, Best Runs is implemented in the Java toolkit Tiburon by May and Knight, and this is the implementation we use in our experiments. Therefore, we simply refer to the Tiburon GitHub page7 for implementation details.
We have extended our code repository Betty,8 which originally provided an implementation of Best Trees v.1, to additionally implement the improved Best Trees as its standard choice of algorithm. A flag -runs can be passed on as an argument to compute best runs instead of best trees. Below follow a number of central facts about the Betty implementation.
First, recall Algorithm 2, and in particular that the best tree is inserted into Tq for all q ∈ Q prior to the start of the main loop rather than letting Tq be empty and initializing Kτ to 1rankτ for all q ∈ Q and τ ∈ R. The reason is that the latter would not guarantee that at most ℓ iterations are made until a run is outputted because some ρi may still not be in Tq. If this happens, transition rules adding the weight 0 may repeatedly be picked because some other τ′ is not yet enabled. However, with a trick the initialization can nevertheless be simplified as indicated. The idea is to delay the execution of line 22 for every queue Kτ until τ has actually appeared in an output tree. Thus, until this has happened, Kτ is disabled from contributing another tree to Ttar(q). This variant turned out to have efficiency advantages in practice and is therefore the variant implemented in Betty. Another advantage is that it allows Betty to handle a set of final states rather than a single one. This is not possible with the original initialization of Algorithm 2, because line 3 would have to be generalized to outputting the best trees of all final states, which cannot be done while maintaining the correctness of the algorithm, since the second best tree for a state q may have a lesser weight than the best tree for another state q′.
In lines 14 and 17 of Algorithm 2, trees are checked for equivalence. To perform the comparisons efficiently, we make use of hash tables. We use immutable trees, which need to be hashed only once when they are created; we then save the hash code together with the tree to make the former accessible in constant time for each tree. To compare trees for inequality, their hash codes are compared. If equal (which seldom happens if the trees are not equal), the comparison is continued recursively on the direct subtrees. This is theoretically less efficient than the method of representing trees uniquely in memory (as described in the last paragraph of the proof of Theorem 2), but practically sufficient and much easier to implement.
When creating new tuples for a transition rule τ as given by line 21, we add the tuples that can be instantiated directly to the corresponding queue Kτ. The tuples that cannot be instantiated must, however, be stored until they can. We want to be able to efficiently access the tuples that are affected when adding a tree t to Tq′ for some q′ ∈ Q (line 15). Therefore, we connect each tuple to the memory locations that will contain the data needed by the tuple. In more detail: Let τ = ( f[q1⋯qk] → q) be any transition rule in R and let (i1,…,ik) be a tuple originating from τ. For every ij ∈{i1,…,ik}, the tuple is saved in a list of tuples affected by and marked with a counter that shows how many trees remain until it can be instantiated. Thus, when t is added to Tq′(ij) for ij ∈{i1,…,ik} and q′ = qj, we can immediately access all tuples that can possibly be instantiated. If a tuple cannot be instantiated, its counter is decreased appropriately.
7 Experiments
Let us now describe our experiments. For each problem instance (i.e., combination of input file and value of N), we perform a number of test runs, measure the elapsed time, and compute the average over the test runs. To avoid noise in our data caused by, for example, garbage collection, we only measure the time consumed by the thread the actual application runs in. Also, we disregard the time it takes to read the input files.
The number of test runs that are performed per problem instance is decided by the relationship between the mean μ and the standard deviation σ of the recorded times—these values are computed every fifth test run, and to finish the testing of the current problem instance, we require that σ < 0.01μ. However, five test runs per problem instance has turned out to be sufficient for fulfilling the requirement in practically all instances seen.
The memory usage is measured in terms of the maximum resident set size of the application process by using the Linux time command. Moreover, the strategy described above for computing the average over several test runs is used here as well.
All test scripts have been written in Python, in contrast to the tested implementations, which use Java. We run the experiments on a computer with a 3.60 Hz Intel Core i7-4790 processor.
7.1 Corpora
The corpora that were used in our experiments (see the list below) contain both real-world and synthetic data. The first corpus is derived from an actual machine-translation system and is thus representative for real-world usage. The second corpus consists of manually engineered grammars for a set of natural languages, used in a range of research and industry applications. The last two corpora are artificially created for the purpose of investigating the effect of increasing degrees of nondeterminism. Let us now present each in closer detail.
- MT-data
This data set consists of tree automata resulting from an English-to-German machine-translation task, described in the doctoral thesis of Quernheim (2017). The data consists of 927 files, each file containing a wta corresponding to one sentence; the files are indexed from 0 to 926. The smallest file has 24 lines and the largest one has 338, 937 lines; all lines but the first hold a transition rule. Moreover, these wtas have a large number of states and are essentially deterministic in the sense that each state corresponds to a particular part-of-sentence structure and input node.
- GF-data
Unweighted context-free grammars from the Grammatical Framework (GF) by Ranta (2011) provided the basis for this corpus. GF is, among other things, a programming language and processing platform for multilingual grammar applications. It provides combinatory categorial grammars (Steedman 1987) for more than 60 natural languages, out of which we export a subset to context-free grammars in Backus–Naur form with the help of the built-in export tool, and assign every grammar rule the weight 1. The number of production rules varies between languages: The Latin grammar has, for example, 1.6 million productions, whereas the Italian grammar has 5.8 million. (These differences are due to the level of coverage chosen by the grammar designers, and do not necessarily reflect inherent complexities of the languages.)
- PolyNonDet
This is a family of automata of increasing size indexed by i. The states of member i are q0,…,qi, where qi is the final state and the rules are:
for j = 0,…,i
for j = 0,…,i
for j = 1,…,i
There are then Θ(ni) runs for a tree of size n (and the number of rules grows linearly). Therefore, we call these polynomially nondeterministic.
- ExpNonDet
Finally, we use another family of automata of increasing size, also indexed by i and with states q0,…,qi and a final state qf. The rules for member i of the family for j, k = 0,…,i and j≠k are:
Thus, the number of rules grows quadratically in i and the number of runs for a tree of size n is Ω((i +1)n); we say that these are exponentially nondeterministic.
The variables in all result-displaying plots in this article are either N (the number of trees or runs to output), m (the number of transition rules, i.e., lines in the input file), or both. For the artificially created corpora PolyNonDet and ExpNonDet, we exchange m for the variable i with which they are indexed. Note that for the former, m and i are interchangeable, but for the latter, m grows quadratically with increasing i.
7.2 Comparison with Best Trees v.1
First, we verify that the algorithm Best Trees is at least as efficient as its predecessor Best Trees v.1. Therefore, we run experiments on the ExpNonDet data set for both of the algorithms—both solving the best trees problem. The ExpNonDet data set was chosen because it has the largest degree of nondeterminism achievable, which should challenge both of the algorithms maximally. The results are presented in figures 4 and 5 for varying N and i, respectively. Note that both of these plots have logarithmic y axes. We see that Best Trees outperforms Best Trees v.1 considerably for both increasing N and increasing i.
More interestingly, in Figure 4, Best Trees displays a step-like behavior at N ≈ 100, 200, and 600. The nature of the ExpNonDet corpus is the explanation of this: When we have found all of the distinct trees of size s (all of the same weight), then the algorithm has to discard all of the duplicates of size s that are still in the queue, before arriving at a tree of size s + 1 with higher weight.
7.3 Comparison with Tiburon
Before turning to the comparison between Betty and Tiburon, recall from Section 5 that when using Betty to find the best trees, the input N cannot be equated with the N that is input to Tiburon and Betty to find the best runs. When the practical application actually demands best trees, a best runs implementation like Tiburon will potentially have to compute many more runs than trees needed, thus by necessity resulting in larger N. We have, however, for simplicity chosen to show the best trees data in the same plot as the data for best runs. It should also be noted that this comparison between Betty and Tiburon is essentially a “stress test” for Betty, which was designed to solve best trees but is here reduced to compute best runs.
Next, we compare Betty with Tiburon for all data sets, starting with the MT-data corpus. Figure 6 shows the running times for finding the N best runs using Tiburon for every file of the MT-data corpus and N = 1, 000,2, 000,…,20, 000; recall that m is the number of transition rules in the input file. The corresponding results for Betty can be seen in Figure 7. When we instead let Betty solve the best trees problem for the same input, we achieve the result in Figure 8. Figures 9 and 10 show the results for all three implementations in the same plots for a fixed file and value of N, respectively. It seems that Betty is faster than Tiburon on both tasks, but to be certain, we perform a statistical test. Because we have paired results from two populations, we run a Wilcoxon signed rank test on the results. We prepare both result sets for statistical testing by splitting them into 9 vectors of size 103 for each value of N; the splitting is done to allow us to find patterns in which implementation performs better given the parameters. Let vBetty and vTiburon be such result vectors from the test runs of Betty and Tiburon, respectively. Our null hypothesis H0 is that vdiff ≔vBetty − vTiburon comes from a distribution with median 0, and our alternative hypothesis H1 is that vdiff comes from a distribution with median less than 0. For our statistical tests, we use a significance level of 0.01. The result has the form of probabilities of observing values as or more extreme than the data under the null hypothesis H0, and it shows that each probability is smaller than 10−18. (Because the numbers are insignificantly small, we exclude a table of detailed results.) We can therefore conclude that H0 is rejected for all partitions of our data. This implies that Betty is statistically significantly faster than Tiburon on the MT-data corpus for both tasks. For the rest of the experiments, we omit similar statistical tests.
Comparison of all three implementations on the largest MT-data file.
Comparison of all three implementations for N = 25, 000 on all MT-data files.
Next, we apply the three algorithms to the GF-data corpus. We run tests for 42 different languages, and present a selection of typical results along with a single atypical one. The typical results are shown in figures 11, 12, and 13, and concern the languages Persian, Thai, and Somali. The Thai result shows one of the extremes among the typical results with the three implementations starting at about the same running time but then diverging, while the Somali plot shows the other extreme, that is, when the running times differ significantly already at start. In all of these plots, Betty displays a better performance than Tiburon, regardless of the task solved. The one atypical result is found in Figure 14, which is the result from the Latin file. Here we see that Tiburon is faster than Best Trees for all values of N tested, although the slope of the Tiburon curve indicates that it will eventually surpass both Betty curves with increasing N. To verify this, we ran the same experiment for N = 120, 000, and confirm that it has indeed happened: Tiburon runs in 3, 105 ms whereas Betty runs in 2, 681 and 3, 055 ms for best runs and best trees, respectively. We conjecture that this unusual result depends on certain characteristics of the Latin corpus: The Latin and the Somali file are approximately the same size, but the Latin output trees are small and many represent one-word utterances, whereas the Somali output trees are larger and represent more complex sentence structures. The combination of a large file and small output trees makes the computation of the best contexts a large overhead, which is also visible in the plot: Even for zero output runs Betty seems to require about 2 seconds of computation time. Even though this is a combination that we rarely see in practice, it is an example of when computing the best contexts could be considered superfluous and thus detrimental to efficiency.
Finally, we measure the memory usage for the various implementations on one file from each natural language corpus. The results for the largest MT-data file and the Persian GF-data file can be seen in figures 15 and 16, respectively. For the for-mer, the memory usage of Tiburon seems to grow faster than the one for Betty, but for the latter, the roles are reversed. To explain why this happens, the memory usage of the implementations needs to be investigated in greater detail, a task left for fu-ture work.
Comparison of the memory usage of all three implementations on the largest MT-data file.
Comparison of the memory usage of all three implementations on the largest MT-data file.
Comparison of the memory usage of all three implementations on the Persian GF-data file.
Comparison of the memory usage of all three implementations on the Persian GF-data file.
Now we have arrived at the artificial corpora that were created to investigate the effect of nondeterminism on the algorithms. Running Tiburon on the PolyNonDet corpus yields the running times in Figure 17, and using Betty to solve the best runs and the best trees problems results in the numbers in figures 18 and 19, respectively. We observe that Tiburon’s running times are significantly larger than those of Betty. When solving the Best Trees problem, Betty does not seem to be affected by increasing i, which means it can all handle increasing degrees of polynomial nondeterminism well. By inspection of Figure 20, which compares the algorithms with respect to N, it is possible to conclude that they are all in O(N) on this particular example.
We now use the ExpNonDet corpus to expose the algorithms to exponential nondeterminism. The expectation is that the best runs algorithms will continue to do their job well, simply because the exponential nondeterminism does not play a significant role in this case. Indeed, this turns out to be the case, with Betty (Figure 22) being slightly faster than Tiburon (Figure 21). However, the best trees problem is no longer that easily solvable. As can be seen in Figure 23, we had to restrict the intervals for both i and N to make the task feasible for the computer used. This outcome is expected since there is now an exponential number of runs on each tree, and these duplicates have to be discarded before a larger tree can be found: The plateaus in Figure 23 indicate where we go from including trees of sizes smaller than s to including trees of size s. Asymptotic testing verifies that the best trees curve grows quadratically with i (meaning that it is linear in m) and is not worse than for the second dimension. Figures 24 and 25 make a closer comparison of Tiburon and Betty for the best runs task with respect to the variables N and i. The effect of increasing N is linear as for the PolyNonDet data, but when considering i, the previously constant behavior is now linear.
Comparison between Tiburon and Betty for the best runs problem on the ExpNonDet corpus for i = 299.
Comparison between Tiburon and Betty for the best runs problem on the ExpNonDet corpus for i = 299.
Comparison between Tiburon and Betty for the best runs problem on the ExpNonDet corpus for N = 25, 000.
Comparison between Tiburon and Betty for the best runs problem on the ExpNonDet corpus for N = 25, 000.
We conclude the experiments by presenting a smaller number of results for memory usage: Figures 26 and 27 show the memory usage measured in kbytes for varying N while figures 28 and 29 show the corresponding numbers for varying i. As expected, the memory usage is much larger for the best trees task than for the best runs task. Moreover, for the latter, Tiburon seems to have a slight advantage over Betty with respect to i but a clear advantage with respect to N.
Comparison of the memory usage of the best runs implementations for the ExpNonDet file with i = 299, and with i = 19 for the best trees one.
Comparison of the memory usage of the best runs implementations for the ExpNonDet file with i = 299, and with i = 19 for the best trees one.
Comparison of the memory usage of the two best runs implementations for the i = 299 ExpNonDet file.
Comparison of the memory usage of the two best runs implementations for the i = 299 ExpNonDet file.
Comparison of the memory usage of all three implementations for N = 25000 on the i = 0,…, 45 ExpNonDet files.
Comparison of the memory usage of all three implementations for N = 25000 on the i = 0,…, 45 ExpNonDet files.
Comparison of the memory usage of the two best runs implementations for N = 25000 on the i = 0,…,299 ExpNonDet files.
Comparison of the memory usage of the two best runs implementations for N = 25000 on the i = 0,…,299 ExpNonDet files.
8 Conclusion
We have presented an improved version of the algorithm by Björklund, Drewes, and Zechner (2019) that solves the N-best trees problem for weighted tree automata over the tropical semiring. The main novelty lies in the exploration of the search space with a focus on instantiations of transition rules rather than on states, and the lazy assembly of these instantiations. We have proved the new algorithm to be correct and derived an upper bound on its running time—a bound that is smaller than that of the previous algorithm. We believe that this speed-up makes it superior for usage in typical language-processing applications.
Moreover, we have complemented the theoretical work with an experimental evaluation. Because Best Trees can be easily modified to produce the best runs instead of the best trees, we considered both tasks in our evaluation. To achieve a large coverage, we used two types of data: data from real-world language processing tasks and artificially created data. The real-world data consist of machine translation output and corpus-based rule sets for natural languages; these corpora are meant to display the kind of behavior one can expect when applying our algorithm to language processing tasks. The artificial corpora were designed to expose Best Trees to its worst-case scenario for the best trees task: the case when we have an exponential number of duplicate trees in a best runs list. We also covered the more moderate case where the nondeterminism only gives rise to a polynomial number of duplicates.
In the experiments focusing on running time, we first used the exponentially nondeterministic data to show that Best Trees is better at the best trees task than its predecessor Best Trees v.1 that uses a less efficient pruning scheme. Then, we compared Best Trees with the state-of-the-art best-runs algorithm of Huang and Chiang (2005), implemented in Tiburon by May and Knight (2006) for both tasks on all data sets. The results made it clear that Best Trees is preferable when extracting N-best lists of both runs and trees: Betty outperforms Tiburon for the best runs task on all 2,269 input wtas except one. The single exception revealed a corner case where the Huang and Chiang algorithm is faster, namely, when there are millions of rules and only a small percentage of them are used to produce N very small (height 0 or 1) runs. Additionally, we performed a smaller number of experiments to measure the memory usage of the three applications, and, while no final conclusion could be made, it seemed as though Tiburon had an advantage over Best Trees with respect to memory usage in total.
Prior to the experiments, we compared the two algorithms at a conceptual level and discussed the expected effects on their running time. The allocation of queues to transition rules instead of states mainly serves to structure the implementation. As we saw, it does not have a disadvantage with respect to running time. The use of best contexts, generalizing the idea of Mohri and Riley (2002) to trees, has a positive effect: it ensures that the maximum distance of a state to the final state is an upper bound on the maximum number of main loop iterations before the next run is outputted. This is because the best contexts guide the algorithm to take the shortest route in constructing the next best run that reaches a final state.9
The disadvantage of using best contexts is that it limits which semirings can be used. The technique is compatible with the tropical semiring and, as discussed in Section 2, with the equivalent Viterbi semiring, but seemingly not with semirings that are not extremal. It is currently unclear to us whether an appropriate extension is possible, and we leave this question for future work. However, such extensions seem only relevant for the N-best runs problem, because it appears highly unlikely that the N-best trees problem could ever be solved with reasonable efficiency in cases where the weight semiring is not extremal. The reason is that, in that case, the best run on a tree does not determine the weight of that tree, making it unclear how the problem can be solved even if disregarding efficiency aspects. However, because the tropical semiring and the Viterbi semiring are the most prominent ones used to rank hypotheses in NLP, the computational advantages seen in this article seem to justify the restriction to these semirings, even when looking only at the N-best runs problem.
Acknowledgments
We are thankful to Andreas Maletti for providing the machine translation data set used in our experiments; to Jonathan May for his support in the application of Tiburon to the N-best problem; to André Berg for sharing his expertise in mathematical statistics; to Aarne Ranta, Peter Ljunglöf, and Krasimir Angelov for introducing us to Grammatical Framework; and to the reviewers for suggesting numerous improvements to the article.
Notes
This replacement had already been done in Tiburon, without an explicit remark.
We assume that enq(Kτ,U) enqueues all u ∈U in Kτ except those already in it, thus skipping duplicates.
Recall that denotes the ij-th element of the sequence .
In fact, the running time obtained by Huang and Chiang (2005) is , but this assumes (the graph representation of) M to be acyclic. When M is cyclic, line 2 must be implemented by using Knuth’s algorithm, resulting in an additional factor in the first term.
Allows for cyclic input wta; is the size of the largest output.
Note that a factor m is removed, compared with when the same algorithm is used for finding the best trees. This is because all runs originating at the same rule queue are distinct (and naturally the same also holds for different rule queues).
We have conducted a small experiment not mentioned in the previous section, by switching off that feature. It showed that the use of best contexts (in that particular, randomly chosen case) reduced the size of rule queues by a factor of 10.