Abstract
Weighted finite automata (WFAs) are often used to represent probabilistic models, such as n-gram language models, because among other things, they are efficient for recognition tasks in time and space. The probabilistic source to be represented as a WFA, however, may come in many forms. Given a generic probabilistic model over sequences, we propose an algorithm to approximate it as a WFA such that the Kullback-Leibler divergence between the source model and the WFA target model is minimized. The proposed algorithm involves a counting step and a difference of convex optimization step, both of which can be performed efficiently. We demonstrate the usefulness of our approach on various tasks, including distilling n-gram models from neural models, building compact language models, and building open-vocabulary character models. The algorithms used for these experiments are available in an open-source software library.
1. Introduction
Representing the target model as a WFA has many advantages, including efficient use, compact representation, interpretability, and composability. WFA models have been used in many applications including speech recognition (Mohri, Pereira, and Riley 2008), speech synthesis (Ebden and Sproat 2015), optical character recognition (Breuel 2008), machine translation (Iglesias et al. 2011), computational biology (Durbin et al. 1998), and image processing (Albert and Kari 2009). One particular problem of interest is language modeling for on-device (virtual) keyboard decoding (Ouyang et al. 2017), where WFA models are widely used because of space and time constraints. One example use of methods we present in this article comes from this domain, involving approximation of a neural LM. Storing training data from the actual domain of use (individual keyboard activity) in a centralized server and training k-gram or WFA models directly may not be feasible because of privacy constraints (Hard et al. 2018). To circumvent this, an LSTM model can be trained by federated learning (Konečný et al. 2016; McMahan et al. 2017; Hard et al. 2018), converted to a WFA at the server, and then used for fast on-device inference. This not only may improve performance over training the models just on out-of-domain publicly available data, but also benefits from the additional privacy provided by federated learning. Chen et al. (2019) use the methods presented in this article for this very purpose.
There are multiple reasons why one may choose to approximate a source model with a WFA. One may have strong constraints on system latency, such as the virtual keyboard example above. Alternatively, a specialized application may require a distribution over just a subset of possible strings, but must estimate this distribution from a more general model—see the example below regarding utterances for setting an alarm. To address the broadest range of use cases, we aim to provide methods that permit large classes of source models and target WFA topologies. We explore several distinct scenarios experimentally in Section 5.
Our methods allow failure transitions (Aho and Corasick 1975; Mohri 1997) in the target WFA, which are taken only when no immediate match is possible at a given state, for compactness. For example, in the WFA representation of a backoff k-gram model, failure transitions can compactly implement the backoff (Katz 1987; Chen and Goodman 1998; Allauzen, Mohri, and Roark 2003; Novak, Minematsu, and Hirose 2013; Hellsten et al. 2017). The inclusion of failure transitions complicates our analysis and algorithms but is highly desirable in applications such as keyboard decoding. Further, to avoid redundancy that leads to inefficiency, we assume the target model is deterministic, which requires that at each state there is at most one transition labeled with a given symbol.
The approximation problem can be divided into two steps: (1) select an unweighted automaton A that will serve as the topology of the target automaton and (2) weight the automaton A to form our weighted approximation Â. The main goal of this article is the latter determination of the automaton’s weighting in the approximation. If the topology is not known, we suggest a few techniques for inferring topology later in the Introduction.
An unweighted automaton that specifies what one might say to set an alarm. The initial state is the bold circle and the final state is the double circle. By convention, we terminate all accepted strings with the symbol $.
An unweighted automaton that specifies what one might say to set an alarm. The initial state is the bold circle and the final state is the double circle. By convention, we terminate all accepted strings with the symbol $.
If M is represented as a WFA, our approximation will in general give a different solution than forming the finite-state intersection with A and weight-pushing to normalize the result (Mohri, Pereira, and Riley 2008; Mohri 2009). Our approximation has the same states as A whereas weight-pushed M ∩ A has O(|M||A|) states. Furthermore, weight-pushed M ∩ A is an exact WFA representation of the distribution in Equation (1).
As stated earlier, some applications may simply require smaller models or those with lower latency of inference, and in such scenarios the specific target topology may be unknown. In such cases, one choice is to build a k-gram deterministic finite automaton (DFA) topology from a corpus drawn from S (Allauzen, Mohri, and Roark 2003). This could be from an existing corpus or from random samples drawn from S. Figure 2(a) shows a trigram topology for the very simple corpus aab. Figure 2(b) shows an alternative topology that allows skip-grams. Both of these representations make use of failure transitions. These allow modeling strings unseen in the corpus (e.g., abab) in a compact way by failing or backing-off to states that correspond to lower-order histories. Such models can be made more elaborate if some transitions represent classes, such as names or numbers, that are themselves represented by sub-automata. As mentioned previously, we will mostly assume we have a topology either pre-specified or inferred by some means and focus on how to weight that topology to best approximate the source distribution. We hence focus our evaluation on intrinsic measures of model quality such as perplexity or bits-per-character.1
Topology examples derived from the corpus aab. States are labeled with the context that is remembered, ∧ denotes the initial context, ϵ the empty context, $ the final context (and terminates accepted strings), and _ matches any symbol in a context. (a) 3-gram topology: failure transitions, labeled with φ, implement backoff from histories xy to _y to ϵ. (b) skip-gram topology: failure transitions implement backoff instead from histories xy to x_.
Topology examples derived from the corpus aab. States are labeled with the context that is remembered, ∧ denotes the initial context, ϵ the empty context, $ the final context (and terminates accepted strings), and _ matches any symbol in a context. (a) 3-gram topology: failure transitions, labeled with φ, implement backoff from histories xy to _y to ϵ. (b) skip-gram topology: failure transitions implement backoff instead from histories xy to x_.
This article expands upon an earlier, shorter version (Suresh et al. 2019) by also providing, beyond the additional motivating examples that have already been presented in this Introduction: an extended related work section and references; expansion of the theoretical analysis from three lemmas without proofs (omitted for space) to five lemmas (and a corollary) with full proofs; inclusion of additional algorithms for counting (for general WFA source and target in addition to φ-WFA); inclusion of additional experiments, including some illustrating the exact KL-minimization methods available for WFA sources of different classes; and documentation of the open-source library that provides the full functionality presented here.
The article is organized as follows. In Section 2 we review previous work in this area. In Section 3 we give the theoretical formulation of the problem and the minimum KL divergence approximation. In Section 4 we present algorithms to compute that solution. One algorithm is for the case that the source itself is finite-state. A second algorithm is for the case when it is not and involves a sampling approach. In Section 5 we show experiments using the approximation. In Section 6 we briefly describe the open-source software used in our experiments. Finally, in Section 7 we discuss the results and offer conclusions.
2. Related Work
In this section we will review methods both for inferring unweighted finite-state models from data and estimating the weight distribution as well in the weighted case. We start with the unweighted case.
There is a long history of unweighted finite-state model inference (Parekh and Honavar 2000; Cicchello and Kremer 2003). Gold (1967) showed that an arbitrary regular set L cannot be learned, identified in the limit, strictly from the presentation of a sequence of positive examples that eventually includes each string in L. This has led to several alternative lines of attack.
One approach is to include the negative examples in the sequence. Given such a complete sample, there are polynomial-time algorithms that identify a regular set in the limit (Gold 1978). For example, a prefix tree of the positive examples can be built and then states can be merged so long as they do not cause a negative example to be accepted (Oncina and Garcia 1992; Dupont 1996). Another approach is to train a recurrent neural network (RNN) on the positive and negative examples and then extract a finite automaton by quantizing the continuous state space of the RNN (Giles et al. 1992; Jacobsson 2005).
A second approach is to assume a teacher is available that determines not only if a string is a positive or negative example but also if the language of the current hypothesized automaton equals L or, if not, provides a counterexample. In this case the minimal m-state DFA corresponding to L can be learned in time polynomial in m (Angluin 1987). Weiss, Goldberg, and Yahav (2018) apply this method for DFA extraction from an RNN.
A third approach is to assume a probability distribution over the (positive only) samples. With some reasonable restrictions on the distribution, such as that the probabilities are generated from a weighted automaton A with L = 𝓛(A), then L is identifiable in the limit with “high probability” (Angluin 1988; Pitt 1989).
There have been a variety of approaches for estimating weighted automata. A variant of the prefix tree construction can be used that merges states with sufficiently similar suffix distributions, estimated from source frequencies (Carrasco and Oncina 1994, 1999). Approaches that produce (possibly highly) non-deterministic results include the Expectation-Maximization (EM) algorithm (Dempster, Laird, and Rubin 1977) applied to fully connected Hidden Markov models or spectral methods applied to automata (Balle and Mohri 2012; Balle et al. 2014). Eisner (2001) describes an algorithm for estimating probabilities in a finite-state transducer from data using EM-based methods. Weiss, Goldberg, and Yahav (2019) and Okudono et al. (2020) provide adaptations to the Weiss, Goldberg, and Yahav (2018) DFA extraction algorithm to yield weighted automata.
For approximating neural network (NN) models as WFAs, Deoras et al. (2011) used an RNN LM to generate samples that they then used to train a k-gram LM. Arisoy et al. (2014) used deep neural network (DNN) models of different orders to successively build and prune a k-gram LM with each new order constrained by the previous order. Adel et al. (2014) also trained DNNs of different orders, built a k-gram LM on the same data to obtain a topology, and then transferred the DNN probabilities of each order onto that k-gram topology. Tiño and Vojtek (1997) quantized the continuous state space of an RNN and then estimated the transition probabilities from the RNN. Lecorvé and Motlicek (2012) quantized the hidden states in an LSTM to form a finite-state model and then used an entropy criterion to back off to low-order k-grams to limit the number of transitions. See Section 4.3 for a more detailed comparison with the most closely related methods, once the details of our algorithms have been provided.
Our article is distinguished in several respects from previous work. First, our general approach does not depend on the form of the source distribution although we specialize our algorithms for (known) finite-state sources with an efficient direct construction and for other sources with an efficient sampling approach. Second, our targets are a wide class of deterministic automata with failure transitions. These are considerably more general than k-gram models but retain the efficiency of determinism and the compactness failure transitions allow, which is especially important in applications with large alphabets like language modeling. Third, we show that our approximation searches for the minimal KL divergence between the source and target distributions, given a fixed target topology provided by the application or some earlier computation.
3. Theoretical Analysis
3.1 Probabilistic Models
Lemma 1
The quantity γ(qs, qa) counts each string xi that reaches both state qs in Qs and state qa in Qa weighted by its probability according to ps. Equivalently, it counts each string xi reaching state (qs, qa) in Qs × Qa (an interpretation we develop in Section 4.1.1).5 Note that γ does not depend on distribution pa.
The following corollary is useful for finding the probabilistic model pa that has the minimal KL divergence from a model ps.
Corollary 1
The quantity c(xi, qa) counts each string xi that reaches a state (qs, qa) in Qs × {qa} by weighted by its probability according to ps (cf. Equation (18)).
3.2 Weighted Finite Automata
A weighted finite automaton A = (Σ, Q, E, i, f) over ℝ+ is given by a finite alphabet Σ, a finite set of states Q, a finite set of transitions E ⊆ Q × Σ × ℝ+ × Q, an initial state i ∈ Q, and a final state f ∈ Q. A transition e = (p[e], ℓ[e], w[e], n[e]) ∈ E represents a move from a previous (or source) state p[e] to the next (or destination) state n[e] with the labelℓ[e] and weightw[e]. The transitions with previous state q are denoted by E[q] and the labels of those transitions as L[q].
Transitions e1 and e2 are consecutive if n[ei] = p[ei+1]. A path π = e1 ⋯ en ∈ E* is a finite sequence of consecutive transitions. The previous state of a path we denote by p[π] and the next state by n[π]. The label of a path is the concatenation of its transition labels ℓ[π] = ℓ[e1] ⋯ ℓ[en]. The weight of a path is obtained by multiplying its transition weights w[π] = w[e1] × ⋯ × w[en]. For a non-empty path, the i-th transition is denoted by πi.
P(q, q′) denotes the set of all paths in A from state q to q′. We extend this to sets in the obvious way: P(q, R) denotes the set of all paths from state q to q′ ∈ R and so forth. A path π is successful if it is in P(i, f) and in that case the automaton is said to accept the input string α = ℓ[π].
For a symbol x ∈ Σ and a state q ∈ Q of a deterministic, probabilistic WFA A, define a distribution pa(x|q) ≜ w if (q, x, w, q′) ∈ E and pa(x|q) ≜ 0 otherwise. Then pa is a probabilistic model over Σ as defined in the previous section. If A = (Σ, Q, E, i, f) is an unweighted deterministic automaton, we denote by 𝒫(A) the set of all probabilistic models pa representable as a weighted WFA Â = (Σ, Q, Ê, i, f) with the same topology as A where Ê = {(q, x, pa(x|q), q′) : (q, x, 1, q′) ∈ E}.
Given an unweighted deterministic automaton A, our goal is to find the target distribution pa ∈ 𝒫(A) that has the minimum KL divergence from our source probability model ps.
Lemma 2
3.3 Weighted Finite Automata with Failure Transitions
A weighted finite automaton with failure transitions (φ-WFA) A = (Σ, Q, E, i, f) is a WFA extended to allow a transition to have a special failure label denoted by φ. Then E ⊆ Q × (Σ ∪ {φ}) × ℝ+ × Q.
A φ transition does not add to a path label; it consumes no input.6 However, it is followed only when the input cannot be read immediately. Specifically, a path e1 ⋯ en in a φ-WFA is disallowed if it contains a subpath ei ⋯ ej such that ℓ[ek] = φ for all k, i ≤ k < j, and there is another transition e ∈ E such that p[ei] = p[e] and ℓ[ej] = ℓ[e] ∈ Σ (see Figure 3). Since the label x = l[ej] can be read on e, we do not follow the failure transitions to read it on ej as well.
The (dashed red) path ei = (qi, φ, ωi, qi+1) to ej = (qj, x, ωj, qj+1) is disallowed since x can be read already on e = (qi, x, ω, q).
The (dashed red) path ei = (qi, φ, ωi, qi+1) to ej = (qj, x, ωj, qj+1) is disallowed since x can be read already on e = (qi, x, ω, q).
We use P*(q, q′) ⊆ P(q, q′) to denote the set of (not dis-) allowed paths from state q to q′ in a φ-WFA. This again extends to sets in the obvious way. A path π is successful in a φ-WFA if π ∈ P*(i, f) and only in that case is the input string α = ℓ[π] accepted.
The language accepted by the φ-automaton A is the set 𝓛(A) = {α ∈ Σ* : α = ℓ[π], π ∈ P*(i, f)}. The weight of α ∈ Σ* assigned by the automaton is A(α) = Σπ∈P*(i,f):ℓ[π]=αw[π]. We assume each string in 𝓛(A) is terminated by the symbol $ as before. We also assume there are no φ-labeled cycles and there is at most one exiting failure transition per state.
A deterministic φ-WFA is backoff-complete if a failure transition from state q to q′ implies L[q] ∩ Σ ⊆ L[q′] ∩ Σ. Further, if φ ∉ L[q′], then the containment is strict: L[q] ∩ Σ ⊂ L[q′] ∩ Σ. In other words, if a symbol can be read immediately from a state q it can also be read from a state failing (backing-off) from q and if q′ does not have a backoff arc, then at least one additional label can be read from q′ that cannot be read from q. For example, both topologies depicted in Figure 2 have this property. We restrict our target automata to have a topology with the backoff-complete property since it will simplify our analysis, make our algorithms efficient, and is commonly found in applications. For example, backoff n-gram models, such as the Katz model, are φ-cycle-free, backoff-complete φ-WFAs (Katz 1987; Chen and Goodman 1998).
For a symbol x ∈ Σ and a state q ∈ Q of a deterministic, probabilistic φ-WFA A, define (x|q) ≜ w if (q, x, w, q′) ∈ E*[q] and (x|q) ≜ 0 otherwise. Then is a probabilistic model over Σ as defined in Section 3.1. Note the distribution at a state q is defined over the φ–extended transitions E*[q] where pa in the previous section is defined over the transitions E[q]. It is convenient to define a companion distribution pa ∈ 𝒫(A) to as follows:7 Given a symbol x ∈ Σ ∪ {φ} and state q ∈ Q, define pa(x|q) ≜ (x|q) when x ∈ L[q] ∩ Σ, pa(φ|q) ≜ 1 − ∑x∈L[q]∩Σ(x|q), and pa(x|q) ≜ 0 otherwise. The companion distribution is thus defined solely over the transitions E[q].
For Equation (10) to be well-defined, we need d(p[e], n[e]) > 0. To ensure this condition, we restrict 𝒫(A) to contain distributions such that pa(x|q) ≥ ϵ for each x ∈ L[q].8
Given an unweighted deterministic, backoff-complete, automaton A, our goal is to find the target distribution ∈ 𝒫*(A) that has the minimum KL divergence from our source probability model ps.
Lemma 3
Proof. This follows immediately from Lemma 2.□
The requirement that the p* of Lemma 3 is in 𝒫*(A) will be true if, for instance, the target has no failure transitions or if the source and target are both φ-WFAs with the same topology and failure transitions. In general, this requirement cannot be assured. While membership in 𝒫(A) principally requires the weights of the transitions leaving a state are non-negative and sum to 1, membership in 𝒫*(A) imposes additional constraints due to the failure transitions, indicated in Equation (11).
As such, we restate our goal in terms of the companion distribution pa ∈ 𝒫(A) rather than its corresponding distribution ∈ 𝒫*(A) directly. Let Bn(q) be the set of states in A that back off to state q in n failure transitions and let B(q) = Bn(q).
Lemma 4
Substituting these results into Equation (14) proves the lemma.□
If there are no failure transitions in the target automaton, then C(x, q) = c(x, q), if x is in Σ, and is 0 otherwise. In this case, the statement of Lemma 4 simplifies to that of Corollary 1.
Unfortunately, we do not have a closed-form solution, analogous to Lemma 2 or Lemma 3, for the general φ-WFA case. Instead we will present numerical optimization algorithms to find the KL divergence minimum in the next section. This is aided by the observation that the quantity in braces in the statement of Lemma 4 depends on the distribution pa only at state q. Thus the KL divergence minimum can be found by maximizing that quantity independently for each state.
4. Algorithms
Approximating a probabilistic source algorithmically as a WFA requires two steps: (1) compute the quantity c(x, qa) found in Corollary 1 and Lemma 2 or C(x, q) in Lemma 4 and (2) use this quantity to find the minimum KL divergence solution. The first step, which we will refer to as counting, is covered in the next section and the KL divergence minimization step is covered afterwards, followed by an explicit comparison of the presented algorithms with closely related prior work.
4.1 Counting
How the counts are computed will depend on the form of the source and target models. We break this down into several cases.
4.1.1 WFA Source and Target.
The worst-case time complexity for the counting step is dominated by the shortest distance algorithm on the intersection S ∩ A. The shortest distance computation is a meta-algorithm that depends on the queue discipline selected (Mohri 2002). If s is the maximum number of times a state in the intersection is inserted into the shortest distance queue, C the maximum cost of a queue operation, and D the maximum out-degree in S and A (both assumed deterministic), then the algorithm runs in O(s(D + C)|QS||QA|) (Mohri 2002; Allauzen and Riley 2018). The worst-case space complexity is in O(D|QS||QA|), determined by the intersection size.
4.1.2 φ-WFA Source and Target.
We can simplify this computation by the following transformation. First we convert S ∩ A to an equivalent WFA by replacing each failure transition with an epsilon transition and introducing a negatively weighted transition to compensate for formerly disallowed paths (Allauzen and Riley 2018). The result is then promoted to a transducer T with the output label used to keep track of the previous state in A of the compensated positive transition (see Figure 4).9 Algorithms to efficiently compute the intersection and shortest distance on φ-WFAs are available in the OpenGrm libraries (Allauzen and Riley 2018).
A φ-WFA is transformed into an equivalent WFA by replacing each failure transition by an ϵ-transition. To compensate for the formerly disallowed paths, new (dashed red) negatively weighted transitions are added. The result is promoted to a transducer T with the output label used to keep track of the previous state in A of the compensated positive transition.
A φ-WFA is transformed into an equivalent WFA by replacing each failure transition by an ϵ-transition. To compensate for the formerly disallowed paths, new (dashed red) negatively weighted transitions are added. The result is promoted to a transducer T with the output label used to keep track of the previous state in A of the compensated positive transition.
The worst-case time and space complexity for the counting step for φ-WFAs is the same as for WFAs (Allauzen and Riley 2018).
4.1.3 Arbitrary Source and φ-WFA Target.
4.2 KL Divergence Minimization
4.2.1 WFA Target.
When the target topology is a deterministic WFA, we use c(x, qa) from the previous section and Lemma 2 to immediately find the minimum KL divergence solution.
4.2.2 φ-WFA Target.
When the target topology is a deterministic, backoff-complete φ-WFA, Lemma 3 can be applied in some circumstances to find the minimum KL divergence solution but not in general. However, as noted before, the quantity in braces in the statement of Lemma 4 depends on the distribution pa only at state q so the minimum KL divergence D(ps||) can be found by maximizing that quantity independently for each state.
The following lemma provides the solution to the optimization problem in Equation (22) that leads to a stationary point of the objective.
Lemma 5
From this, we form the KL-Minimization algorithm in Figure 5. Observe that if all the counts are zero, then for any y, u(y) − v(y) = 0 and any solution is an optimal solution and the algorithm returns a uniform distribution over labels. In other cases, we initialize the model based on counts such that y0 ∈ Ω. We then repeat the DC programming algorithm iteratively until convergence. Because Ω is a convex compact set and functions u, v, and ∇v are continuous and differentiable in Ω, the KL-Minimization converges to a stationary point (Sriperumbudur and Lanckriet 2009, Theorem 4). For each state q, the computational complexity of KL-Minimization is in O(|E[q]|) per iteration. Hence, if the maximum number of iterations per each state is s, the overall computational complexity of KL-Minimization is in O(s|E|).
4.3 Discussion
Our method for approximating from source φ-WFAs, specifically from backoff k-gram models, shares some similarities with entropy pruning (Stolcke 2000). Both can be used to approximate onto a more compact k-gram topology and both use a KL-divergence criterion to do so. They differ in that entropy pruning, by sequentially removing transitions, selects the target topology and in doing so uses a greedy rather than global entropy reduction strategy. We will empirically compare these methods in Section 5.1 for this use case.
Perhaps the closest to our work on approximating arbitrary sources is that of Deoras et al. (2011), where they used an RNN LM to generate samples that they then used to train a k-gram LM. In contrast, we use samples to approximate the joint state probability γ(qs, qa). If ℓ is the average number of words per sentence and N is the total number of sampled sentences, then the time complexity of Deoras et al. (2011) is 𝒪(Nℓ|Σ|) as it takes 𝒪(|Σ|) time to generate a sample from the neural model for every word. This is the same as that of our algorithm in Section 4.1.3. However, our approach has several advantages. First, for a given topology, our algorithm provably finds the best KL minimized solution, whereas their approach is optimal only when the size of the k-gram model tends to infinity. Second, as we show in our experiments, the proposed algorithm performs better compared with that of Deoras et al. (2011).
Arisoy et al. (2014) and Adel et al. (2014) both trained DNN models of different orders as the means to derive probabilities for a k-gram model, rather than focusing on model approximation as the above-mentioned methods do. Further, the methods are applicable to k-gram models specifically, not the larger class of target WFA to which our methods apply.
5. Experiments
We now provide experimental evidence of the theory’s validity and show its usefulness in various applications. For the ease of notation, we use WFA-Approx to denote the exact counting algorithm described in Section 4.1.2 followed by the KL-Minimization algorithm of Section 4.2. Similarly, we use WFA-SampleApprox(N) to denote the sampled counting described in Section 4.1.3 with N sampled sentences followed by KL-Minimization.
We first give experimental evidence that supports the theory in Section 5.1. We then show how to approximate neural models as WFAs in Section 5.2. We also use the proposed method to provide lower bounds on the perplexity given a target topology in Section 5.3. Finally, we present experiments in Section 5.4 addressing scenarios with source WFAs, hence not requiring sampling. First, motivated by low-memory applications such as (virtual) keyboard decoding (Ouyang et al. 2017), we use our approach to create compact language models in Section 5.4.1. Then, in Section 5.4.2, we use our approach to create compact open-vocabulary character language models from count-thresholded n-grams.
For most experiments (unless otherwise noted) we use the 1996 CSR Hub4 Language Model data, LDC98T31 (English Broadcast News data). We use the processed form of the corpus and further process it to downcase all the words and remove punctuation. The resulting data set has 132M words in the training set, 20M words in the test set, and has 240K unique words. For all the experiments that use word models, we create a vocabulary of approximately 32K words consisting of all words that appeared more than 50 times in the training corpus. Using this vocabulary, we create a trigram Katz model and prune it to contain 2M n-grams using entropy pruning (Stolcke 2000). We use this pruned model as a baseline in all our word-based experiments. We use Katz smoothing because it is amenable to pruning (Chelba et al. 2010). The perplexity of this model on the test set is 144.4.11 All algorithms were implemented using the open-source OpenFst and OpenGrmn-gram and stochastic automata (SFst) libraries12 with the last library including these implementations (Allauzen et al. 2007; Roark et al. 2012; Allauzen and Riley 2018).
5.1 Empirical Evidence of Theory
Recall that our goal is to find the distribution on a target DFA topology that minimizes the KL divergence to the source distribution. However, as stated in Section 4.2, if the target topology has failure transitions, the optimization objective is not convex so the stationary point solution may not be the global optimum. We now show that the model indeed converges to a good solution in various cases empirically.
Idempotency.
When the target topology is the same as the source topology, we show that the performance of the approximated model matches the source model. Let ps be the pruned Katz word model described above. We approximate ps onto the same topology using WFA-Approx and WFA-SampleApprox(⋅) and then compute perplexity on the test corpus. The results are presented in Figure 6. The test perplexity of the WFA-Approx model matches that of the source model and the performance of the WFA-SampleApprox(N) model approaches that of the source model as the number of samples N increases.
Idempotency: Test set perplexity for Katz baseline and approximations of that baseline trained on the same data. The Katz baseline and Katz WFA-Approx plots are identical (and thus overlap), while WFA-SampleApprox(N) plot converges to the baseline.
Idempotency: Test set perplexity for Katz baseline and approximations of that baseline trained on the same data. The Katz baseline and Katz WFA-Approx plots are identical (and thus overlap), while WFA-SampleApprox(N) plot converges to the baseline.
Comparison to Greedy Pruning.
Recall that entropy pruning (Stolcke 2000) greedily removes n-grams such that the KL divergence to the original model ps is small. Let pgreedy be the resulting model and Agreedy be the topology of pgreedy. If the KL-Minimization converges to a good solution, then approximating ps onto Agreedy would give a model that is at least as good as pgreedy. We show that this is indeed the case; in fact, approximating ps onto Agreedy performs better than pgreedy. As before, let ps again be the 2Mn-gram Katz model described above. We prune it to have 1M n-grams and obtain pgreedy, which has a test perplexity of 157.4. We then approximate the source model ps on Agreedy, the topology of pgreedy. The resulting model has a test perplexity of 155.7, which is smaller than the test perplexity of pgreedy. This shows that the approximation algorithm indeed finds a good solution. We repeat the experiment with different pruning thresholds and observe that the approximation algorithm provides a good solution across the resulting model sizes. The results are in Table 1.
Test perplexity of greedy pruning and approximated models.
Model size . | Greedy pruning . | Approximated model . |
---|---|---|
250K | 205.7 | 198.3 |
500K | 177.3 | 173.0 |
1M | 157.4 | 155.7 |
1.5M | 149.0 | 148.4 |
Model size . | Greedy pruning . | Approximated model . |
---|---|---|
250K | 205.7 | 198.3 |
500K | 177.3 | 173.0 |
1M | 157.4 | 155.7 |
1.5M | 149.0 | 148.4 |
5.2 Neural Models to WFA Conversion
Since neural models such as LSTMs give improved performance over n-gram models, we investigate whether an LSTM distilled onto a WFA model can obtain better performance than the baseline WFA trained directly from Katz smoothing. As stated in the Introduction, this could then be used together with federated learning for fast and private on-device inference, which was done in Chen et al. (2019) using the methods presented here.
To explore this, we train an LSTM language model on the training data. The model has 2 LSTM layers with 1,024 states and an embedding size of 1,024. The resulting model has a test perplexity of 60.5. We approximate this model as a WFA in three ways.
The first way is to construct a Katz n-gram model on N LSTM samples and entropy-prune to 2M n-grams, which we denote by WFA-SampleKatz(N). This approach is very similar to that of Deoras et al. (2011), except that we use Katz smoothing instead of Kneser-Ney smoothing. We used Katz due to the fact that, as stated earlier, Kneser-Ney models are not amenable to pruning (Chelba et al. 2010). The second way is to approximate onto the baseline Katz 2M n-gram topology described above using WFA-SampleApprox(N). We refer to this experiment as WFA-SampleApprox(N) (KT), where KT stands for known topology. The results are shown in Figure 7. The WFA-SampleKatz(N) models perform significantly worse than the baseline Katz model even at 32M samples, while the WFA-SampleApprox(N) models have better perplexity than the baseline Katz model with as little as 1M samples. With 32M samples this way of approximating the LSTM model as a WFA is 3.6 better in perplexity than the baseline Katz.
Approximated Neural Models for the English Broadcast News corpus. Test set perplexity for Katz baseline and LSTM models approximated in three ways. One uses LSTM samples to build a new Katz model (WFA-SampleKatz(N)). The remaining two use our approximation algorithm (WFA-SampleApprox(N)), but with different topologies. One topology (KT) is known, using the baseline Katz topology, and one (UT) is unknown, using the samples drawn for WFA-SampleApprox(N) to also infer the topology.
Approximated Neural Models for the English Broadcast News corpus. Test set perplexity for Katz baseline and LSTM models approximated in three ways. One uses LSTM samples to build a new Katz model (WFA-SampleKatz(N)). The remaining two use our approximation algorithm (WFA-SampleApprox(N)), but with different topologies. One topology (KT) is known, using the baseline Katz topology, and one (UT) is unknown, using the samples drawn for WFA-SampleApprox(N) to also infer the topology.
Finally, if the WFA topology is unknown we use the samples obtained in WFA-SampleApprox(⋅) to create a Katz model entropy-pruned to 2M n-grams. We refer to this experiment as WFA-SampleApprox(N) (UT), where UT stands for unknown topology. The results are shown in Figure 7. The approximated models obtained by WFA-SampleApprox(N) (UT) perform better than WFA-SampleKatz(N). However, they do not perform as well as WFA-SampleApprox(N) (KT), the models obtained with the known topology derived from the training data. However, with enough samples, their performance is similar to that of the original Katz model.
We then compare our approximation method with other methods that approximate DNNs with n-gram LMs (Arisoy et al. 2014; Adel et al. 2014). Both these methods require training DNNs of different orders and we used DNNs with two layers with 1,024 nodes and an embedding size of 1,024. The results are in Table 2. The proposed algorithms WFA-SampleApprox(⋅) perform better than the existing approaches.
Test perplexity of k-gram models obtained by different approximation methods for the English Broadcast News corpus. We use 32M samples for both WFA-SampleKatz(⋅) and WFA-SampleApprox(⋅).
Model . | Test perplexity . |
---|---|
Katz | 144.4 |
WFA-SampleKatz(⋅) | 151.8 |
Arisoy et al. (2014) | 159.6 |
Adel et al. (2014) | 146.7 |
WFA-SampleApprox(⋅) (KT) | 140.8 |
WFA-SampleApprox(⋅) (UT) | 143.9 |
Experiment on Polish.
To repeat the above experiment on a different language and corpus, we turn to the Polish language section13 of the Europarl corpus (Koehn 2005). We chose Polish for this follow-up experiment due to the fact that it belongs to a different language family than English, is relatively highly inflected (unlike English), and is found in a publicly available corpus. We use the processed form of the corpus and further process it to downcase all the words and remove punctuation. The resulting data set has approximately 13M words in the training set and 210K words in the test set. The selected vocabulary has approximately 30K words, consisting of all words that appeared more than 20 times in the training corpus. Using this vocabulary, we create a trigram Katz model and prune it to contain 2M n-grams using entropy pruning (Stolcke 2000). We use this pruned model as a baseline. The results are in Figure 8. The trend is similar to that of the English Broadcast News corpus with the proposed algorithm WFA-SampleApprox(⋅) performing better than the other methods. We also compare the proposed algorithms with other neural approximation algorithms. The comparison results are shown in Table 3.
Approximated Neural Models for the Polish (Europarl) corpus. Test set perplexity for Katz baseline and LSTM models approximated in three ways. One uses LSTM samples to build a new Katz model (WFA-SampleKatz(N)). The remaining two use our approximation algorithm (WFA-SampleApprox(N)), but with different topologies. One topology (KT) is known, using the baseline Katz topology, and one (UT) is unknown, using the samples drawn for WFA-SampleApprox(N) to also infer the topology.
Approximated Neural Models for the Polish (Europarl) corpus. Test set perplexity for Katz baseline and LSTM models approximated in three ways. One uses LSTM samples to build a new Katz model (WFA-SampleKatz(N)). The remaining two use our approximation algorithm (WFA-SampleApprox(N)), but with different topologies. One topology (KT) is known, using the baseline Katz topology, and one (UT) is unknown, using the samples drawn for WFA-SampleApprox(N) to also infer the topology.
Test perplexity of k-gram models obtained by different approaches for the Polish (Europarl) corpus. We use 32M samples for both WFA-SampleKatz(⋅) and WFA-SampleApprox(⋅).
Model . | Test perplexity . |
---|---|
Katz | 185.5 |
WFA-SampleKatz(⋅) | 189.3 |
Arisoy et al. (2014) | 197.8 |
Adel et al. (2014) | 187.1 |
WFA-SampleApprox(⋅) (KT) | 181.4 |
WFA-SampleApprox(⋅) (UT) | 177.0 |
5.3 Lower Bounds on Perplexity
Best Approximation for Target Topology.
The neural model in Section 5.2 has a perplexity of 60.5, but the best perplexity for the approximated model is 140.8. Is there a better approximation algorithm for the given target topology? We place bounds on that next.
Approximation onto Best Trigram Topology.
If we approximate the LSTM onto the best trigram topology, how well does it perform over the test data? To test this, we build a trigram model from the test data and approximate the LSTM on the trigram topology. This approximated model has 11M n-grams and a perplexity of 81. This shows that for large data sets, the largest shortfall of n-gram models in the approximation is due to the n-gram topology.
5.4 WFA Sources
While distillation of neural models is an important use case for our algorithms, there are other scenarios where approximations will be derived from WFA sources. In such cases, no sampling is required and we can use the algorithms presented in Sections 4.1.1 or 4.1.2, namely, WFA-Approx. For example, it is not uncommon for applications to place maximum size restrictions on models, so that existing WFA models are too large and must be reduced in size prior to use. Section 5.4.1 presents a couple of experiments focused on character language model size reduction that were motivated by such size restrictions. Another scenario with WFA sources is when, for privacy reasons, no language model training corpus is available in a domain, but only minimum-count-thresholded (i.e., high frequency) word n-gram counts are provided. In Section 5.4.2 we examine the estimation of open-vocabulary character language models from such data.
5.4.1 Creating Compact Language Models.
Creating Compact Models for Infrequent Words.
In low-memory applications such as on-device keyboard decoding (Ouyang et al. 2017), it is often useful to have a character-level WFA representation of a large set of vocabulary words that act only as unigrams, for example, those words beyond the 32K words of our trigram model. We explore how to compactly represent such a unigram-only model.
To demonstrate our approach, we take all the words in the training set (without a count cutoff) and build a character-level deterministic WFA of those words weighted by their unigram probabilities. This is represented as a tree rooted at the initial state (a trie). This automaton has 820K transitions. Storing this many transitions can be prohibitive; we can reduce the size in two steps.
The first step is to minimize this WFA using weighted minimization (Mohri 2000) to produce pchar, which has a topology Achar. Although pchar is already much smaller (it has 378K transitions, a 54% reduction), we can go further by approximating onto the minimal deterministic unweighted automaton, Minimize(Achar). This gives us a model with only 283K transitions, a further 25% reduction. Because Minimize(Achar) accepts exactly the same words as Achar, we are not corrupting our model by adding or removing any vocabulary items. Instead, we find an estimate that is as close as possible to the original, but that is constrained to the minimal deterministic representation that preserves the vocabulary.
To evaluate this approach, we randomly select a 20K sentence subset of the original test set, and represent each selected string as a character-level sequence. We evaluate using cross entropy in bits-per-character, common for character-level models. The resulting cross entropy for pchar is 1.557 bits-per-character. By comparison, the cross entropy for pchar approximated onto Minimize(Achar) is 1.560 bits-per-character. In exchange for this small accuracy loss we are rewarded with a model that is 25% smaller. Wolf-Sonkin et al. (2019) used the methods presented here to augment the vocabulary of an on-device keyboard to deal with issues related to a lack of standard orthography.
Creating Compact WFA Language Models.
Motivated by the previous experiment, we also consider applying (unweighted) minimization to Agreedy, the word-based trigram topology that we pruned to 1M n-grams described earlier. In Table 4 we show that applying minimization to Agreedy and then approximating onto the resulting topology leads to a reduction of 7% in the number of transitions needed to represent the model. However, the test perplexity also increases some. To control for this, we prune the original model to a 1.08M n-gram topology instead of the 1M as before and apply the same procedure to obtain an approximation on Minimize(). We achieve a 0.4% perplexity reduction compared to the approximation on Agreedy with very nearly the same number of transitions.
Test perplexity of models when approximated onto smaller topologies.
Topology . | Test perplexity . | # Transitions . |
---|---|---|
Agreedy | 155.7 | 1.13M |
Minimize(Agreedy) | 156.4 | 1.05M |
154.1 | 1.22M | |
Minimize() | 154.9 | 1.13M |
Topology . | Test perplexity . | # Transitions . |
---|---|---|
Agreedy | 155.7 | 1.13M |
Minimize(Agreedy) | 156.4 | 1.05M |
154.1 | 1.22M | |
Minimize() | 154.9 | 1.13M |
5.4.2 Count Thresholded Data for Privacy.
One increasingly common scenario that can benefit from these algorithms is modeling from frequency thresholded substring counts rather than raw text. For example, word n-grams and their frequencies may be provided from certain domains of interest only when they occur within at least k separate documents. With a sufficiently large k (say 50), no n-gram can be traced to a specific document, thus providing privacy in the aggregation. This is known as k-anonymity (Samarati 2001).
However, for any given domain, there are many kinds of models that one may want to build depending on the task, some of which may be trickier to estimate from such a collection of word n-gram counts than with standard approaches for estimation from a given corpus. For example, character n-gram models can be of high utility for tasks like language identification, and have the benefit of a relatively small memory footprint and low latency in use. However, character n-gram models might be harder to learn from a k-anonymized corpus.
Here we will compare open-vocabulary character language models, which accept all strings in Σ* for a character vocabulary Σ, trained in several ways. Each approach relies on the training corpus and 32k vocabulary, with every out-of-vocabulary word replaced by a single OOV symbol ★. Additionally, for each approach we add 50 to the unigram character count of any printable ASCII character, so that even those that are unobserved in the words of our 32k vocabulary have some observations. Our three approaches are:
Baseline corpus trained models: We count character 5-grams from the k-anonymized corpus, then remove all n-grams that include the ★ symbol (in any position) prior to smoothing and normalization. Here we present both Kneser-Ney and Witten Bell smoothed models, as both are popular for character n-gram models.
Word trigram sampled model: First we count word trigrams in the k-anonymized corpus and discard any n-gram with the ★ symbol (in any position) prior to smoothing and normalization. We then sample one million strings from a Katz smoothed model and build a character 5-gram model from these strings. We also use this as our target topology for the next approach.
Word trigram KL minimization estimation: We create a source model by converting the 2M n-gram word trigram model into an open vocabulary model. We do this using a specialized construction related to the construction presented in Section 6 of Chen et al. (2019), briefly described below, that converts the word model into a character sequence model. As this model is still closed vocabulary (see below), we additionally smooth the unigram distribution with a character trigram model trained from the words in the symbol table (and including the 50 extra counts for every printable ASCII character as with the other methods). From this source model, we estimate a model on the sampled character 5-gram topology from the previous approach, using our KL minimization algorithm.
Converting Word n-gram Source to Character Sequence Source Model.
Briefly, for every state q in the n-gram automaton, the set of words labeling transitions leaving q are represented as a trie of characters including a final end-of-word symbol. Each resulting transition labeled with the end-of-word symbol represents the last transition for that particular word spelled out by that sequence of transitions, hence is assigned the same next state as the original word transition. If q has a backoff transition pointing to its backoff state q′, then each new internal state in the character trie backs off to the corresponding state in the character trie leaving q′. The presence of the corresponding state in the character trie leaving q′ is guaranteed because the n-gram automaton is backoff-complete, as discussed in Section 3.3.
As stated above, this construction converts from word sequences to character sequences, but will only accept character sequences consisting of strings of in-vocabulary words, that is, this is still closed vocabulary. To make it open vocabulary, we further back off the character trie leaving the unigram state to a character n-gram model estimated from the symbol table (and additional ASCII character observations). This is done using a very similar construction to that described above. The resulting model is used as the source model for our KL minimization algorithm, to estimate a distribution over the sampled character 5-gram topology.
We encode the test set as a sequence of characters, without using the symbol table because our models are intended to be open vocabulary. Following typical practice for open-vocabulary settings, we evaluate with bits-per-character. The results are presented in Table 5. Here we achieve slightly lower bits-per-character than even what we get straight from the corpus, perhaps due to better regularization of the word-based model than with either Witten-Bell or Kneser-Ney on the character n-grams.
Comparison of character 5-gram models derived from either the original corpus or a word trigram model. Size of the models is presented in terms of the number of character n-grams, the numbers of states and transitions in the automaton representation, and the file size in MB. The two corpus estimated models have the same topology, hence the same size; as do the two word trigram estimated models.
Source . | n-grams (x1000) . | states (x1000) . | transitions (x1000) . | MB . | Estimation . | bits/char . |
---|---|---|---|---|---|---|
Corpus | 336 | 60 | 381 | 6.5 | Kneser-Ney | 2.04 |
Witten-Bell (WB) | 2.01 | |||||
Word trigram | 277 | 56 | 322 | 5.6 | Sampled (WB) | 2.36 |
KL min | 1.99 |
Source . | n-grams (x1000) . | states (x1000) . | transitions (x1000) . | MB . | Estimation . | bits/char . |
---|---|---|---|---|---|---|
Corpus | 336 | 60 | 381 | 6.5 | Kneser-Ney | 2.04 |
Witten-Bell (WB) | 2.01 | |||||
Word trigram | 277 | 56 | 322 | 5.6 | Sampled (WB) | 2.36 |
KL min | 1.99 |
6. Software Library
All of the algorithms presented here are available in the OpenGrm libraries at http://www.opengrm.org under the topic: SFST Library: operations to normalize, sample, combine, and approximate stochastic finite-state transducers. We illustrate the use of this library by showing how to implement some of the experiments in the previous section.
6.1 Example Data and Models
Instead of Broadcast News, we will use the text of Oscar Wilde’s The Importance of Being Earnest for our tutorial example. This is a small tutorial corpus that we make available at http://sfst.opengrm.org.
This corpus of approximately 1,688 sentences and 18,000-words has been upper-cased and had punctuation removed. The first 850 sentences were used as training data and the remaining 838 sentences used as test data. From these, we produce two 1,000-word vocabulary Katz bigram models, the ∼ 6kn-gram earnest_train.mod and the ∼4kn-gram earnest_test.mod. We also used relative-entropy pruning to create the ∼ 2kn-gram earnest_train.pru. The data, the steps to generate these models, and how to compute their perplexity using OpenGrm NGram are all fully detailed in the QuickTour topic at http://sfst.opengrm.org.
6.2 Computing the Approximation
The following step shows how to compute the approximation of a φ-WFA model onto a φ-WFA topology. In the example below, the first argument, earnest_train.mod, is the source model, and the second argument, earnest_train.pru, provides the topology. The result is a φ-WFA whose perplexity can be measured as before.
$ sfstapprox -phi_label=0 earnest_train.mod earnest_train.pru \ >earnest_train.approx
An alternative, equivalent way to perform this approximation is to break it into two steps, with the counting and normalization (KL divergence minimization) done separately.
$ sfstcount -phi_label=0 earnest_train.mod earnest_train.pru \ >earnest_train.approx_cnts
$ sfstnormalize -method=kl_min -phi_label=0 \ earnest_train.approx_cnts >earnest_train.approx
We can now use these utilities to run some experiments analogous to our larger Broadcast News experiments by using different target topologies. The results are shown in Table 6. We see in the idempotency experiment that the perplexity of the approximation on the same topology matches the source. In the greedy-pruning experiment, the approximation onto the greedy-pruned topology yields a better perplexity than the greedily-pruned model. Finally, the approximation onto the test-set bigram topology gives a better perplexity than the training-set model since we include all the relevant bigrams.
Perplexities for example experiments.
Experiment . | Source Model . | Topology . | Approx. Perplexity . |
---|---|---|---|
Idempotency | earnest_train.mod | earnest_train.mod | 73.41 |
Comparison to greedy pruning | earnest_train.mod | earnest_train.pru | 83.12 |
Approx. onto best bigram topology | earnest_train.mod | earnest_test.pru | 69.68 |
Experiment . | Source Model . | Topology . | Approx. Perplexity . |
---|---|---|---|
Idempotency | earnest_train.mod | earnest_train.mod | 73.41 |
Comparison to greedy pruning | earnest_train.mod | earnest_train.pru | 83.12 |
Approx. onto best bigram topology | earnest_train.mod | earnest_test.pru | 69.68 |
6.3 Available Operations
Table 7 lists the available command-line operations in the OpenGrm SFst library. We show command-line utilities here; there are corresponding C++ library functions that can be called from within a program; see http://sfst.opengrm.org.
Available Operations in the OpenGrm SFst Library.
Operation . | Description . |
---|---|
sfstapprox | Approximates a stochastic φ-WFA wrt a backoff-complete φ-WFA. |
sfstcount | Counts from a stochastic φ-WFA wrt a backoff-complete φ-WFA. |
sfstintersect | Intersects two φ-WFAs. |
sfstnormalize -method=global | Globally normalizes a φ-WFA. |
sfstnormalize -method=kl_min | Normalizes a count φ-WFA using KL divergence minimization. |
sfstnormalize -method=local | Locally normalizes a φ-WFA. |
sfstnormalize -method=phi | φ-normalizes a φ-WFA. |
sfstperplexity | Computes perplexity of a stochastic φ-WFA. |
sfstrandgen | Randomly generates paths from a stochastic φ-WFA. |
sfstshortestdistance | Computes the shortest distance on a φ-WFA. |
sfsttrim | Removes useless states and transitions in a φ-WFA. |
Operation . | Description . |
---|---|
sfstapprox | Approximates a stochastic φ-WFA wrt a backoff-complete φ-WFA. |
sfstcount | Counts from a stochastic φ-WFA wrt a backoff-complete φ-WFA. |
sfstintersect | Intersects two φ-WFAs. |
sfstnormalize -method=global | Globally normalizes a φ-WFA. |
sfstnormalize -method=kl_min | Normalizes a count φ-WFA using KL divergence minimization. |
sfstnormalize -method=local | Locally normalizes a φ-WFA. |
sfstnormalize -method=phi | φ-normalizes a φ-WFA. |
sfstperplexity | Computes perplexity of a stochastic φ-WFA. |
sfstrandgen | Randomly generates paths from a stochastic φ-WFA. |
sfstshortestdistance | Computes the shortest distance on a φ-WFA. |
sfsttrim | Removes useless states and transitions in a φ-WFA. |
7. Summary and Discussion
In this article, we have presented an algorithm for minimizing the KL-divergence between a probabilistic source model over sequences and a WFA target model. That our algorithm is general enough to permit source models of arbitrary form (e.g., RNNs) and deriving an appropriate WFA topology—something we do not really touch on in this article—is particularly important.
Notes
Model size or efficiency of inference may be a common motivation for the model approximations we present, which may suggest a demonstration of the speed/accuracy tradeoff for some downstream use of the models. However, as stated earlier in this Introduction, there are mulitple reasons why an approximation may be needed, and our goal in this article is to establish that our methods provide a well-motivated approach should an approximation be required for any reason.
We define x0 ≜ ϵ, the empty string, and adopt p(ϵ) = 0.
In the most general case, q(xn) = xn.
If |Q(ps)| is finite, it can be shown that H(ps) is necessarily finite.
γ is not (necessarily) a probability distribution over Qs × Qa. In fact, γ(qs, qa) can be greater than 1.
In other words, a φ label, like an ϵ label, acts as an identity element in string concatenation (Allauzen and Riley 2018).
The meaning of 𝒫(A) when A is φ-WFA is to interpret it as a WFA with the failure labels as regular symbols.
For brevity, we do not include ϵ in the notation of 𝒫(A).
We fix some total order on Σ ∪ {φ} so that y is well-defined.
For all perplexity measurements, we treat the unknown word as a single token instead of a class. To compute the perplexity with the unknown token being treated as class, multiply the perplexity by k0.0115, where k is the number of tokens in the unknown class and 0.0115 is the out-of-vocabulary rate in the test data set.
These libraries are available at www.openfst.org and www.opengrm.org.
References
Author notes
Some of the results in this article were previously presented in Suresh et al. (2019).