Abstract

Weighted finite automata (WFAs) are often used to represent probabilistic models, such as n-gram language models, because among other things, they are efficient for recognition tasks in time and space. The probabilistic source to be represented as a WFA, however, may come in many forms. Given a generic probabilistic model over sequences, we propose an algorithm to approximate it as a WFA such that the Kullback-Leibler divergence between the source model and the WFA target model is minimized. The proposed algorithm involves a counting step and a difference of convex optimization step, both of which can be performed efficiently. We demonstrate the usefulness of our approach on various tasks, including distilling n-gram models from neural models, building compact language models, and building open-vocabulary character models. The algorithms used for these experiments are available in an open-source software library.

1. Introduction

Given a sequence of symbols x1, x2, …, xn−1, where symbols are drawn from the alphabet Σ, a probabilistic model S assigns probability to the next symbol xn ∈ Σ by
$ps[xn|xn−1…x1]$
Such a model might be Markovian, where
$ps[xn|xn−1…x1]=ps[xn|xn−1…xn−k+1]$
such as a k-gram language model (LM) (Chen and Goodman 1998) or it might be non-Markovian such as a long short-term memory (LSTM) neural network LM (Sundermeyer, Schlüter, and Ney 2012). Our goal is to approximate a probabilistic model as a weighted finite automaton (WFA) such that the weight assigned by the WFA is close to the probability assigned by the source model. Specifically, we will seek to minimize the Kullback-Leibler (KL) divergence between the source S and the target WFA model.

Representing the target model as a WFA has many advantages, including efficient use, compact representation, interpretability, and composability. WFA models have been used in many applications including speech recognition (Mohri, Pereira, and Riley 2008), speech synthesis (Ebden and Sproat 2015), optical character recognition (Breuel 2008), machine translation (Iglesias et al. 2011), computational biology (Durbin et al. 1998), and image processing (Albert and Kari 2009). One particular problem of interest is language modeling for on-device (virtual) keyboard decoding (Ouyang et al. 2017), where WFA models are widely used because of space and time constraints. One example use of methods we present in this article comes from this domain, involving approximation of a neural LM. Storing training data from the actual domain of use (individual keyboard activity) in a centralized server and training k-gram or WFA models directly may not be feasible because of privacy constraints (Hard et al. 2018). To circumvent this, an LSTM model can be trained by federated learning (Konečný et al. 2016; McMahan et al. 2017; Hard et al. 2018), converted to a WFA at the server, and then used for fast on-device inference. This not only may improve performance over training the models just on out-of-domain publicly available data, but also benefits from the additional privacy provided by federated learning. Chen et al. (2019) use the methods presented in this article for this very purpose.

There are multiple reasons why one may choose to approximate a source model with a WFA. One may have strong constraints on system latency, such as the virtual keyboard example above. Alternatively, a specialized application may require a distribution over just a subset of possible strings, but must estimate this distribution from a more general model—see the example below regarding utterances for setting an alarm. To address the broadest range of use cases, we aim to provide methods that permit large classes of source models and target WFA topologies. We explore several distinct scenarios experimentally in Section 5.

Our methods allow failure transitions (Aho and Corasick 1975; Mohri 1997) in the target WFA, which are taken only when no immediate match is possible at a given state, for compactness. For example, in the WFA representation of a backoff k-gram model, failure transitions can compactly implement the backoff (Katz 1987; Chen and Goodman 1998; Allauzen, Mohri, and Roark 2003; Novak, Minematsu, and Hirose 2013; Hellsten et al. 2017). The inclusion of failure transitions complicates our analysis and algorithms but is highly desirable in applications such as keyboard decoding. Further, to avoid redundancy that leads to inefficiency, we assume the target model is deterministic, which requires that at each state there is at most one transition labeled with a given symbol.

The approximation problem can be divided into two steps: (1) select an unweighted automaton A that will serve as the topology of the target automaton and (2) weight the automaton A to form our weighted approximation Â. The main goal of this article is the latter determination of the automaton’s weighting in the approximation. If the topology is not known, we suggest a few techniques for inferring topology later in the Introduction.

We will now give some simple topology examples to illustrate the approximation idea. In Section 5 we will give larger-scale examples. Consider the unweighted automaton A in Figure 1 that was designed for what one might say to set an alarm. To use this in an application such as speech recognition, we would want to weight the automaton with some reasonable probabilities for the alternatives. For example, people may be more likely to set their alarms for six or seven than four. In the absence of data specifically for this scenario, we can fall back on some available background LM M, trained on a large suitable corpus. In particular, we can use the conditional distribution
$pM[x1…xn|x1…xn∈𝓛(A)]=pM[x1…xn]1x1…xn∈𝓛(A)∑x1…xn∈𝓛(A)pM[x1…xn]$
(1)
where 1 is the indicator function and 𝓛(A) is the regular language accepted by the automaton A, as our source distribution S. We then use the unweighted automaton A as our target topology.
Figure 1

An unweighted automaton that specifies what one might say to set an alarm. The initial state is the bold circle and the final state is the double circle. By convention, we terminate all accepted strings with the symbol $. Figure 1 An unweighted automaton that specifies what one might say to set an alarm. The initial state is the bold circle and the final state is the double circle. By convention, we terminate all accepted strings with the symbol$.

If M is represented as a WFA, our approximation will in general give a different solution than forming the finite-state intersection with A and weight-pushing to normalize the result (Mohri, Pereira, and Riley 2008; Mohri 2009). Our approximation has the same states as A whereas weight-pushed MA has O(|M||A|) states. Furthermore, weight-pushed MA is an exact WFA representation of the distribution in Equation (1).

As stated earlier, some applications may simply require smaller models or those with lower latency of inference, and in such scenarios the specific target topology may be unknown. In such cases, one choice is to build a k-gram deterministic finite automaton (DFA) topology from a corpus drawn from S (Allauzen, Mohri, and Roark 2003). This could be from an existing corpus or from random samples drawn from S. Figure 2(a) shows a trigram topology for the very simple corpus aab. Figure 2(b) shows an alternative topology that allows skip-grams. Both of these representations make use of failure transitions. These allow modeling strings unseen in the corpus (e.g., abab) in a compact way by failing or backing-off to states that correspond to lower-order histories. Such models can be made more elaborate if some transitions represent classes, such as names or numbers, that are themselves represented by sub-automata. As mentioned previously, we will mostly assume we have a topology either pre-specified or inferred by some means and focus on how to weight that topology to best approximate the source distribution. We hence focus our evaluation on intrinsic measures of model quality such as perplexity or bits-per-character.1

Figure 2

Topology examples derived from the corpus aab. States are labeled with the context that is remembered, ∧ denotes the initial context, ϵ the empty context, $the final context (and terminates accepted strings), and _ matches any symbol in a context. (a) 3-gram topology: failure transitions, labeled with φ, implement backoff from histories xy to _y to ϵ. (b) skip-gram topology: failure transitions implement backoff instead from histories xy to x_. Figure 2 Topology examples derived from the corpus aab. States are labeled with the context that is remembered, ∧ denotes the initial context, ϵ the empty context,$ the final context (and terminates accepted strings), and _ matches any symbol in a context. (a) 3-gram topology: failure transitions, labeled with φ, implement backoff from histories xy to _y to ϵ. (b) skip-gram topology: failure transitions implement backoff instead from histories xy to x_.

This article expands upon an earlier, shorter version (Suresh et al. 2019) by also providing, beyond the additional motivating examples that have already been presented in this Introduction: an extended related work section and references; expansion of the theoretical analysis from three lemmas without proofs (omitted for space) to five lemmas (and a corollary) with full proofs; inclusion of additional algorithms for counting (for general WFA source and target in addition to φ-WFA); inclusion of additional experiments, including some illustrating the exact KL-minimization methods available for WFA sources of different classes; and documentation of the open-source library that provides the full functionality presented here.

The article is organized as follows. In Section 2 we review previous work in this area. In Section 3 we give the theoretical formulation of the problem and the minimum KL divergence approximation. In Section 4 we present algorithms to compute that solution. One algorithm is for the case that the source itself is finite-state. A second algorithm is for the case when it is not and involves a sampling approach. In Section 5 we show experiments using the approximation. In Section 6 we briefly describe the open-source software used in our experiments. Finally, in Section 7 we discuss the results and offer conclusions.

2. Related Work

In this section we will review methods both for inferring unweighted finite-state models from data and estimating the weight distribution as well in the weighted case. We start with the unweighted case.

There is a long history of unweighted finite-state model inference (Parekh and Honavar 2000; Cicchello and Kremer 2003). Gold (1967) showed that an arbitrary regular set L cannot be learned, identified in the limit, strictly from the presentation of a sequence of positive examples that eventually includes each string in L. This has led to several alternative lines of attack.

One approach is to include the negative examples in the sequence. Given such a complete sample, there are polynomial-time algorithms that identify a regular set in the limit (Gold 1978). For example, a prefix tree of the positive examples can be built and then states can be merged so long as they do not cause a negative example to be accepted (Oncina and Garcia 1992; Dupont 1996). Another approach is to train a recurrent neural network (RNN) on the positive and negative examples and then extract a finite automaton by quantizing the continuous state space of the RNN (Giles et al. 1992; Jacobsson 2005).

A second approach is to assume a teacher is available that determines not only if a string is a positive or negative example but also if the language of the current hypothesized automaton equals L or, if not, provides a counterexample. In this case the minimal m-state DFA corresponding to L can be learned in time polynomial in m (Angluin 1987). Weiss, Goldberg, and Yahav (2018) apply this method for DFA extraction from an RNN.

A third approach is to assume a probability distribution over the (positive only) samples. With some reasonable restrictions on the distribution, such as that the probabilities are generated from a weighted automaton A with L = 𝓛(A), then L is identifiable in the limit with “high probability” (Angluin 1988; Pitt 1989).

There have been a variety of approaches for estimating weighted automata. A variant of the prefix tree construction can be used that merges states with sufficiently similar suffix distributions, estimated from source frequencies (Carrasco and Oncina 1994, 1999). Approaches that produce (possibly highly) non-deterministic results include the Expectation-Maximization (EM) algorithm (Dempster, Laird, and Rubin 1977) applied to fully connected Hidden Markov models or spectral methods applied to automata (Balle and Mohri 2012; Balle et al. 2014). Eisner (2001) describes an algorithm for estimating probabilities in a finite-state transducer from data using EM-based methods. Weiss, Goldberg, and Yahav (2019) and Okudono et al. (2020) provide adaptations to the Weiss, Goldberg, and Yahav (2018) DFA extraction algorithm to yield weighted automata.

For approximating neural network (NN) models as WFAs, Deoras et al. (2011) used an RNN LM to generate samples that they then used to train a k-gram LM. Arisoy et al. (2014) used deep neural network (DNN) models of different orders to successively build and prune a k-gram LM with each new order constrained by the previous order. Adel et al. (2014) also trained DNNs of different orders, built a k-gram LM on the same data to obtain a topology, and then transferred the DNN probabilities of each order onto that k-gram topology. Tiño and Vojtek (1997) quantized the continuous state space of an RNN and then estimated the transition probabilities from the RNN. Lecorvé and Motlicek (2012) quantized the hidden states in an LSTM to form a finite-state model and then used an entropy criterion to back off to low-order k-grams to limit the number of transitions. See Section 4.3 for a more detailed comparison with the most closely related methods, once the details of our algorithms have been provided.

Our article is distinguished in several respects from previous work. First, our general approach does not depend on the form of the source distribution although we specialize our algorithms for (known) finite-state sources with an efficient direct construction and for other sources with an efficient sampling approach. Second, our targets are a wide class of deterministic automata with failure transitions. These are considerably more general than k-gram models but retain the efficiency of determinism and the compactness failure transitions allow, which is especially important in applications with large alphabets like language modeling. Third, we show that our approximation searches for the minimal KL divergence between the source and target distributions, given a fixed target topology provided by the application or some earlier computation.

3. Theoretical Analysis

3.1 Probabilistic Models

Let Σ be a finite alphabet. Let $xin$ ∈ Σ* denote the string xixi+1xn and xn$x1n$. A probabilistic model p over Σ is a probabilistic distribution over the next symbol xn, given the previous symbols xn−1, such that2
$∑x∈Σp(xn=x|xn−1)=1andp(xn=x|xn−1)≥0,∀x∈Σ$
Without loss of generality, we assume that the model maintains an internal state q and updates it after observing the next symbol.3 Furthermore, the probability of the subsequent state just depends on the state q
$pxi+1n|xi=pxi+1n|q(xi)$
for all i, n, xi, $xi+1n$, where q(xi) is the state that the model has reached after observing sequence xi. Let Q(p) be the set of possible states. Let the language 𝓛(p) ⊆ Σ* defined by the distribution p be
$𝓛(p)≜{xn∈Σ*:p(xn)>0andxn=andxi≠,∀i
(2)
The symbol $is used as a stopping criterion. Further, for all xn ∈ Σ* such that xn−1 =$, p(xn|xn−1) = 0.
The KL divergence between the source model ps and the target model pa is given by
$D(ps||pa)=∑xn∈Σ*ps(xn)logps(xn)pa(xn)$
(3)
where for notational simplicity, we adopt the notion 0/0 = 1 and 0 log(0/0) = 0 throughout the article. Note that for the KL divergence to be finite, we need 𝓛(ps) ⊆ 𝓛(pa). We will assume throughout that the source entropy H(ps) = −∑xnps(xn) log ps(xn) is finite.4 We first reduce the KL divergence between two models as follows (cf. Carrasco 1997; Cortes et al. 2008). In the following, let q* denote the states of the probability distribution p*.

Lemma 1

If 𝓛(ps) ⊆ 𝓛(pa), then
$D(ps||pa)=∑qa∈Qa∑qs∈Qs∑x∈Σγ(qs,qa)ps(x|qs)logps(x|qs)pa(x|qa)$
(4)
where
$γ(qs,qa)=∑i=0∞∑xi:qs(xi)=qs,qa(xi)=qaps(xi)$
(5)
Proof.
$D(ps||pa)=∑n=1∞∑xnps(xn)logps(xn)pa(xn)=∑n=1∞∑xnps(xn)∑i=1nlogps(xi|xi−1)pa(xi|xi−1)=∑n=1∞∑i=1n∑xnps(xn)logps(xi|xi−1)pa(xi|xi−1)=∑n=1∞∑i=1n∑xnps(xi−1)ps(xi|xi−1)ps(xi+1n|xi)logps(xi|xi−1)pa(xi|xi−1)=∑i=1∞∑xi−1ps(xi−1)∑xips(xi|xi−1)logps(xi|xi−1)pa(xi|xi−1)⋅∑n≥i∑xi+1nps(xi+1n|xi)=∑i=1∞∑xi−1ps(xi−1)∑xips(xi|xi−1)logps(xi|xi−1)pa(xi|xi−1)$
By definition, the probability of the next symbol conditioned on the past just depends on the state. Hence grouping terms corresponding to same states both in s and t yields
$∑i=1∞∑xi−1ps(xi−1)∑xips(xi|xi−1)logps(xi|xi−1)pa(xi|xi−1)=∑i=1∞∑xi−1ps(xi−1)∑xips(xi|qs(xi−1))logps(xi|qs(xi−1))pa(xi|qa(xi−1))=∑qa∈Qa∑qs∈Qs∑i=1∞∑xi−1:qs(xi−1)=qs,qa(xi−1)=qaps(xi−1)∑xips(xi|qs)logps(xi|qs)pa(xi|qa)=∑qa∈Qa∑qs∈Qs∑xiγ(qs,qa)ps(xi|qs)logps(xi|qs)pa(xi|qa)$
Replacing xi by x yields the lemma.□

The quantity γ(qs, qa) counts each string xi that reaches both state qs in Qs and state qa in Qa weighted by its probability according to ps. Equivalently, it counts each string xi reaching state (qs, qa) in Qs × Qa (an interpretation we develop in Section 4.1.1).5 Note that γ does not depend on distribution pa.

The following corollary is useful for finding the probabilistic model pa that has the minimal KL divergence from a model ps.

Corollary 1

Let 𝒫 be a set of probabilistic models pa for which 𝓛(ps) ⊆ 𝓛(pa). Then
$argminpa∈𝒫D(ps||pa)=argmaxpa∈𝒫∑qa∈Qa∑x∈Σc(x,qa)logpa(x|qa)$
where
$c(x,qa)=∑qs∈Qsγ(qs,qa)ps(x|qs)$
(6)
Proof. By Lemma 1
$argminpa∈𝒫D(ps||pa)=argminpa∈𝒫∑qa∈Qa∑qs∈Qs∑x∈Σγ(qs,qa)ps(x|qs)logps(x|qs)pa(x|qa)=argminpa∈𝒫∑qa∈Qa∑qs∈Qs∑x∈Σγ(qs,qa)ps(x|qs)logps(x|qs)−∑qa∈Qa∑qs∈Qs∑x∈Σγ(qs,qa)ps(x|qs)logpa(x|qa)=argmaxpa∈𝒫∑qa∈Qa∑x∈Σc(x,qa)logpa(x|qa)$
(7)
since the first term in Equation (7) does not depend on pa.□

The quantity c(xi, qa) counts each string xi that reaches a state (qs, qa) in Qs × {qa} by $x1i−1$ weighted by its probability according to ps (cf. Equation (18)).

3.2 Weighted Finite Automata

A weighted finite automaton A = (Σ, Q, E, i, f) over ℝ+ is given by a finite alphabet Σ, a finite set of states Q, a finite set of transitions EQ × Σ × ℝ+ × Q, an initial state iQ, and a final state fQ. A transition e = (p[e], [e], w[e], n[e]) ∈ E represents a move from a previous (or source) state p[e] to the next (or destination) state n[e] with the label[e] and weightw[e]. The transitions with previous state q are denoted by E[q] and the labels of those transitions as L[q].

A deterministic WFA has at most one transition with a given label leaving each state. An unweighted (finite) automaton is a WFA that satisfies w[e] = 1, ∀eE. A probabilistic (or stochastic) WFA satisfies
$∑e∈E[q]w[e]=1andw[e]≥0,∀q∈Q−{f}$

Transitions e1 and e2 are consecutive if n[ei] = p[ei+1]. A path π = e1enE* is a finite sequence of consecutive transitions. The previous state of a path we denote by p[π] and the next state by n[π]. The label of a path is the concatenation of its transition labels [π] = [e1] ⋯ [en]. The weight of a path is obtained by multiplying its transition weights w[π] = w[e1] × ⋯ × w[en]. For a non-empty path, the i-th transition is denoted by πi.

P(q, q′) denotes the set of all paths in A from state q to q′. We extend this to sets in the obvious way: P(q, R) denotes the set of all paths from state q to q′ ∈ R and so forth. A path π is successful if it is in P(i, f) and in that case the automaton is said to accept the input string α = [π].

The language accepted by an automaton A is the regular set 𝓛(A) = {α ∈ Σ* : α = [π], πP(i, f)}. The weight of α ∈ 𝓛(A) assigned by the automaton is A(α) = ΣπP(i,f):[π]=αw[π]. Similar to Equation (2), we assume a symbol $∈ Σ such that $𝓛(A)⊆{xn∈Σ*:xn=andxi≠,∀i Thus all successful paths are terminated by the symbol$.

For a symbol x ∈ Σ and a state qQ of a deterministic, probabilistic WFA A, define a distribution pa(x|q) ≜ w if (q, x, w, q′) ∈ E and pa(x|q) ≜ 0 otherwise. Then pa is a probabilistic model over Σ as defined in the previous section. If A = (Σ, Q, E, i, f) is an unweighted deterministic automaton, we denote by 𝒫(A) the set of all probabilistic models pa representable as a weighted WFA Â = (Σ, Q, Ê, i, f) with the same topology as A where Ê = {(q, x, pa(x|q), q′) : (q, x, 1, q′) ∈ E}.

Given an unweighted deterministic automaton A, our goal is to find the target distribution pa ∈ 𝒫(A) that has the minimum KL divergence from our source probability model ps.

Lemma 2

If 𝓛(ps) ⊆ 𝓛(A), then
$argminpa∈𝒫(A)D(ps||pa)=p˜(x|qa)≜c(x,qa)c(qa)$
(8)
where
$c(qa)=∑x∈Σc(x,qa)$
Proof. From Corollary 1
$argminpa∈𝒫(A)D(ps||pa)=argmaxpa∈𝒫(A)∑qa∈Qa∑x∈Σc(x,qa)logpa(x|qa)=argmaxpa∈𝒫(A)∑qa∈Qa∑x∈Σc(qa)p˜(x|qa)logpa(x|qa)=argminpa∈𝒫(A)∑qa∈Qac(qa)−∑x∈Σp˜(x|qa)logpa(x|qa)$
The quantity in braces is minimized when pa(x|qa) = $p˜$(x|qa) because it is the cross entropy between the two distributions. Since 𝓛(ps) ⊆ 𝓛(A), it follows that $p˜$ ∈ 𝒫(A).□

3.3 Weighted Finite Automata with Failure Transitions

A weighted finite automaton with failure transitions (φ-WFA) A = (Σ, Q, E, i, f) is a WFA extended to allow a transition to have a special failure label denoted by φ. Then EQ × (Σ ∪ {φ}) × ℝ+ × Q.

A φ transition does not add to a path label; it consumes no input.6 However, it is followed only when the input cannot be read immediately. Specifically, a path e1en in a φ-WFA is disallowed if it contains a subpath eiej such that [ek] = φ for all k, ik < j, and there is another transition eE such that p[ei] = p[e] and [ej] = [e] ∈ Σ (see Figure 3). Since the label x = l[ej] can be read on e, we do not follow the failure transitions to read it on ej as well.

Figure 3

The (dashed red) path ei = (qi, φ, ωi, qi+1) to ej = (qj, x, ωj, qj+1) is disallowed since x can be read already on e = (qi, x, ω, q).

Figure 3

The (dashed red) path ei = (qi, φ, ωi, qi+1) to ej = (qj, x, ωj, qj+1) is disallowed since x can be read already on e = (qi, x, ω, q).

We use P*(q, q′) ⊆ P(q, q′) to denote the set of (not dis-) allowed paths from state q to q′ in a φ-WFA. This again extends to sets in the obvious way. A path π is successful in a φ-WFA if πP*(i, f) and only in that case is the input string α = [π] accepted.

The language accepted by the φ-automaton A is the set 𝓛(A) = {α ∈ Σ* : α = [π], πP*(i, f)}. The weight of α ∈ Σ* assigned by the automaton is A(α) = ΣπP*(i,f):[π]=αw[π]. We assume each string in 𝓛(A) is terminated by the symbol $as before. We also assume there are no φ-labeled cycles and there is at most one exiting failure transition per state. We express the φ-extended transitions leaving q as $E*[q]=(q,x,ω,q′):π∈P*(q,Q),x=ℓ[π]=ℓ[π|π|]∈Σ,ω=w[π],q′=n[π]$ This is a set of (possibly new) transitions (q, x, ω, q′), one for each allowed path from previous state q to next state q′ with optional leading failure transitions and a final x-labeled transition. Denote the labels of E*[q] by L*[q]. Note the WFA (Σ, Q, E*, i, f) accepts the same strings with the same weights as φ-WFA A and thus 𝓛(A) is regular (Allauzen and Riley 2018). A probabilistic (or stochastic)φ-WFA satisfies $∑e∈E*[q]w[e]=1andw[e]≥0,∀q∈Q−{f}$ A deterministic φ-WFA is backoff-complete if a failure transition from state q to q′ implies L[q] ∩ Σ ⊆ L[q′] ∩ Σ. Further, if φL[q′], then the containment is strict: L[q] ∩ Σ ⊂ L[q′] ∩ Σ. In other words, if a symbol can be read immediately from a state q it can also be read from a state failing (backing-off) from q and if q′ does not have a backoff arc, then at least one additional label can be read from q′ that cannot be read from q. For example, both topologies depicted in Figure 2 have this property. We restrict our target automata to have a topology with the backoff-complete property since it will simplify our analysis, make our algorithms efficient, and is commonly found in applications. For example, backoff n-gram models, such as the Katz model, are φ-cycle-free, backoff-complete φ-WFAs (Katz 1987; Chen and Goodman 1998). For a symbol x ∈ Σ and a state qQ of a deterministic, probabilistic φ-WFA A, define $pa*$(x|q) ≜ w if (q, x, w, q′) ∈ E*[q] and $pa*$(x|q) ≜ 0 otherwise. Then $pa*$ is a probabilistic model over Σ as defined in Section 3.1. Note the distribution $pa*$ at a state q is defined over the φ–extended transitions E*[q] where pa in the previous section is defined over the transitions E[q]. It is convenient to define a companion distribution pa ∈ 𝒫(A) to $pa*$ as follows:7 Given a symbol x ∈ Σ ∪ {φ} and state qQ, define pa(x|q) ≜ $pa*$(x|q) when xL[q] ∩ Σ, pa(φ|q) ≜ 1 − ∑xL[q]∩Σ$pa*$(x|q), and pa(x|q) ≜ 0 otherwise. The companion distribution is thus defined solely over the transitions E[q]. When A = (Σ, Q, E, i, f) is an unweighted deterministic, backoff-complete φ-WFA, we denote by 𝒫*(A) the set of all probabilistic models $pa*$ representable as a weighted φ-WFA Â = (Σ, Q, Ê, i, f) of same topology as A with $Ê={(q,x,pa(x|q),q′):(q,x,1,q′)∈E,x∈Σ}∪{(q,φ,α(q,q′),q′):(q,φ,1,q′)∈E}$ where pa ∈ 𝒫(A) is the companion distribution to $pa*$ and α(q, q′) = pa(φ|q)/d(q, q′) is the weight of the failure transition from state q to q′ with $d(q,q′)=1−∑x∈L[q]∩Σpa(x|q′)$ (9) Note that we have specified the weights on the automaton that represents $pa*$ ∈ 𝒫*(A) entirely in terms of the companion distribution pa ∈ 𝒫(A). The failure transi- tion weight α is determined from the requirement that ∑xE*[q] = 1 (Katz 1987, Equation (16)). Thanks to the backoff-complete property, this transition weight depends only on its previous state q and its next state q′ and not all the states in the failure path from q. This is a key reason for assuming that property; otherwise our algorithms could not benefit from this locality. Conversely, each distribution pa ∈ 𝒫(A) can be associated with a distribution $pa*$ ∈ 𝒫*(A) given a deterministic, backoff-complete φ-WFA A. First extend α(q, q′) to any failure path as follows. Denote a failure path from state q to q′ by πφ(q, q′). Then define $α(q,q′)=∏e∈πφ(q,q′)pa(φ|p[e])d(p[e],n[e])$ (10) where this quantity is taken to be 1 when the failure path is empty (q = q′). Finally define $pa*(x|q)=α(q,qx)pa(x|qx),x∈L*[q]0,otherwise$ (11) where for xL*[q], qx signifies the first state q′ on a φ-labeled path in A from state q for which xL[q′]. For Equation (10) to be well-defined, we need d(p[e], n[e]) > 0. To ensure this condition, we restrict 𝒫(A) to contain distributions such that pa(x|q) ≥ ϵ for each xL[q].8 Given an unweighted deterministic, backoff-complete, automaton A, our goal is to find the target distribution $pa*$ ∈ 𝒫*(A) that has the minimum KL divergence from our source probability model ps. Lemma 3 Assume 𝓛(ps) ⊆ 𝓛(A). Let p* = $p˜$(x|qa) from Lemma 2. If 𝓛(p*) ⊆ 𝓛(A) and p* ∈ 𝒫*(A) then $p*=argminpa*∈𝒫*(A)D(ps||pa*)$ Proof. This follows immediately from Lemma 2.□ The requirement that the p* of Lemma 3 is in 𝒫*(A) will be true if, for instance, the target has no failure transitions or if the source and target are both φ-WFAs with the same topology and failure transitions. In general, this requirement cannot be assured. While membership in 𝒫(A) principally requires the weights of the transitions leaving a state are non-negative and sum to 1, membership in 𝒫*(A) imposes additional constraints due to the failure transitions, indicated in Equation (11). As such, we restate our goal in terms of the companion distribution pa ∈ 𝒫(A) rather than its corresponding distribution $pa*$ ∈ 𝒫*(A) directly. Let Bn(q) be the set of states in A that back off to state q in n failure transitions and let B(q) = $⋃n=0|Qa|$Bn(q). Lemma 4 If 𝓛(ps) ⊆ 𝓛(A) then $argminpa*∈𝒫*(A)D(ps||pa*)=argmaxpa∈𝒫(A)∑q∈Qa∑x∈L[q]C(x,q)logpa(x|q)−∑q0∈B1(q)C(φ,q0)logd(q0,q)$ where $C(x,q)=∑qa∈B(q)c(x,qa)1q=qax,x∈Σ$ (12) $C(φ,q)=∑qa∈B(q)∑x∈Σc(x,qa)1x∉L[q]$ (13) and do not depend on pa. Proof. From Corollary 1, Equation (11) and the previously shown 1:1 correspondence between each distribution $pa*$ ∈ 𝒫*(A) and its companion distribution pa ∈ 𝒫(A) $argminpa*∈𝒫*(A)D(ps||pa*)=argminpa*∈𝒫*(A)∑qa∈Qa∑x∈L*[qa]c(x,qa)logpa*(x|qa)=argmaxpa∈𝒫(A)∑qa∈Qa∑x∈L*[qa]c(x,qa)logα(qa,qax)pa(x|qax)=argmaxpa∈𝒫(A)∑qa∈Qa∑x∈L*[qa]c(x,qa)log∏e∈πφ(qa,qax)pa(φ|p[e])d(p[e],n[e])pa(x|qax)=argmaxpa∈𝒫(A)Ax+Aφ−Ad$ (14) where we distribute the factors inside the logarithm in Equation (14) as follows: $Ax=∑qa∈Qa∑x∈L*[qa]c(x,qa)logpa(x|qax)=∑q∈Qa∑qa∈B(q)∑x∈L[q]∩Σc(x,qa)1q=qaxlogpa(x|q)=∑q∈Qa∑x∈L[q]∩ΣC(x,q)logpa(x|q)$ (15) Equation (15) follows from q = $qax$ implying qaB(q). $Aφ=∑qa∈Qa∑x∈L*[qa]c(x,qa)log∏e∈πφ(qa,qax)pa(φ|p[e])=∑qa∈Qa∑x∈L*[qa]c(x,qa)∑e∈πφ(qa,qax)logpa(φ|p[e])=∑q∈Qa∑qa∈B(q)∑x∈L*[qa]c(x,qa)∑e∈πφ(qa,qax)1q=p[e]logpa(φ|q)=∑q∈Qa∑qa∈B(q)∑x∈Σc(x,qa)1x∉L[q]logpa(φ|q)=∑q∈QaC(φ,q)logpa(φ|q)$ (16) Equation (16) follows from eπφ(qa, $qax$) implying xp[e]. $Ad=∑qa∈Qa∑x∈L*[qa]c(x,qa)log∏e∈πφ(qa,qax)d(p[e],n[e])=∑qa∈Qa∑x∈L*[qa]c(x,qa)∑e∈πφ(qa,qax)logd(p[e],n[e])=∑q∈Qa∑qa∈B(q0)∑q0∈B1(q)∑x∈L*[qa]c(x,qa)∑e∈πφ(qa,qax)1q0=p[e]logd(q0,q)=∑q∈Qa∑qa∈B(q0)∑q0∈B1(q)∑x∈Σc(x,qa)1x∉L[q0]logd(q0,q)=∑q∈Qa∑q0∈B1(q)C(φ,q0)logd(q0,q)$ Substituting these results into Equation (14) proves the lemma.□ If there are no failure transitions in the target automaton, then C(x, q) = c(x, q), if x is in Σ, and is 0 otherwise. In this case, the statement of Lemma 4 simplifies to that of Corollary 1. Unfortunately, we do not have a closed-form solution, analogous to Lemma 2 or Lemma 3, for the general φ-WFA case. Instead we will present numerical optimization algorithms to find the KL divergence minimum in the next section. This is aided by the observation that the quantity in braces in the statement of Lemma 4 depends on the distribution pa only at state q. Thus the KL divergence minimum can be found by maximizing that quantity independently for each state. 4. Algorithms Approximating a probabilistic source algorithmically as a WFA requires two steps: (1) compute the quantity c(x, qa) found in Corollary 1 and Lemma 2 or C(x, q) in Lemma 4 and (2) use this quantity to find the minimum KL divergence solution. The first step, which we will refer to as counting, is covered in the next section and the KL divergence minimization step is covered afterwards, followed by an explicit comparison of the presented algorithms with closely related prior work. 4.1 Counting How the counts are computed will depend on the form of the source and target models. We break this down into several cases. 4.1.1 WFA Source and Target. When the source and target models are represented as WFAs we compute c(x, qa) from Lemma 2. From Equation (6) this can be written as $c(x,qa)=∑qs∈Qsγ(qs,qa)ps(x|qs)$ (17) where $γ(qs,qa)=∑i=0∞∑xi:qs(xi)=qs,qa(xi)=qaps(xi)$ The quantity γ(qs, qa) can be computed as $γ(qs,qa)=∑π∈PS∩A((is,ia),(qs,qa))w[π]$ where SA is the weighted finite-state intersection of automata S and A (Mohri 2009). The above summation over this intersection is the (generalized) shortest distance from the initial state to a specified state computed over the positive real semiring (Mohri 2002; Allauzen and Riley 2018). Algorithms to efficiently compute the intersection and shortest distance on WFAs are available in OpenFst (Allauzen et al. 2007), an open-source WFA library. Then from Equation (17) we can form the sum $c(x,qa)=∑((qs,qa),x,w,(qs′,qa′))∈ES∩Aγ(qs,qa)w$ (18) Equation (18) is the weighted count of the paths in SA that begin at the initial state and end in any transition leaving a state (qs, qa) labeled with x. The worst-case time complexity for the counting step is dominated by the shortest distance algorithm on the intersection SA. The shortest distance computation is a meta-algorithm that depends on the queue discipline selected (Mohri 2002). If s is the maximum number of times a state in the intersection is inserted into the shortest distance queue, C the maximum cost of a queue operation, and D the maximum out-degree in S and A (both assumed deterministic), then the algorithm runs in O(s(D + C)|QS||QA|) (Mohri 2002; Allauzen and Riley 2018). The worst-case space complexity is in O(D|QS||QA|), determined by the intersection size. 4.1.2 φ-WFA Source and Target. When the source and target models are represented as φ-WFAs we compute C(x, qa) from Lemma 4. From Equation (12) and the previous case this can be written as $C(x,q)=∑qa∈B(q)∑qs∈Qsγ(qs,qa)ps(x|qs)1q=qax,x∈Σ$ (19) To compute this quantity we first form SA using an efficient φ-WFA intersection that compactly retains failure transitions in the result as described in Allauzen and Riley (2018). Equation (19) is the weighted count of the paths in SA allowed by the failure transitions that begin at the initial state and end in any transition leaving a state (qs, q) labeled with x. We can simplify this computation by the following transformation. First we convert SA to an equivalent WFA by replacing each failure transition with an epsilon transition and introducing a negatively weighted transition to compensate for formerly disallowed paths (Allauzen and Riley 2018). The result is then promoted to a transducer T with the output label used to keep track of the previous state in A of the compensated positive transition (see Figure 4).9 Algorithms to efficiently compute the intersection and shortest distance on φ-WFAs are available in the OpenGrm libraries (Allauzen and Riley 2018). Figure 4 A φ-WFA is transformed into an equivalent WFA by replacing each failure transition by an ϵ-transition. To compensate for the formerly disallowed paths, new (dashed red) negatively weighted transitions are added. The result is promoted to a transducer T with the output label used to keep track of the previous state in A of the compensated positive transition. Figure 4 A φ-WFA is transformed into an equivalent WFA by replacing each failure transition by an ϵ-transition. To compensate for the formerly disallowed paths, new (dashed red) negatively weighted transitions are added. The result is promoted to a transducer T with the output label used to keep track of the previous state in A of the compensated positive transition. Then $C(x,q)=∑((qs,qa),x,q,w,(qs′,qa′))∈ETγT(qs,qa)w,x∈Σ$ (20) where e = (p[e], il[e], ol[e], w[e], n[e]) is a transition in T and γT(qs, q) is the shortest distance from the initial state to (qs, qa) in T computed over the real semiring as described in Allauzen and Riley (2018). Equation (20) is the weighted count of all paths in SA that begin at the initial state and end in any transition leaving a state (qs, q) labeled with x minus the weighted count of those paths that are disallowed by the failure transitions. Finally, we compute C(φ, q) as follows. The count mass entering a state q must equal the count mass leaving a state $∑(qa,x,1,q)∈EAC(x,qa)=∑(q,x′,1,qa′)∈EAC(x′,q)=∑(q,x′,1,qa′)∈EA,x′∈ΣC(x′,q)+C(φ,q)$ Thus $C(φ,q)=∑(qa,x,1,q)∈EAC(x,qa)−∑(q,x′,1,qa′)∈EA,x′∈ΣC(x′,q)$ This quantity can be computed iteratively in the topological order of states with respect to the φ-labeled transitions. The worst-case time and space complexity for the counting step for φ-WFAs is the same as for WFAs (Allauzen and Riley 2018). 4.1.3 Arbitrary Source and φ-WFA Target. In some cases, the source is a distribution with possibly infinite states, for example, LSTMs. For these sources, computing C(x, q) can be computationally intractable as Equation (19) requires a summation over all possible states in the source machine, Qs. We propose to use a sampling approach to approximate C(x, q) for these cases. Let x(1), x(2), …, x(N) be independent random samples from ps. Instead of C(x, q), we propose to use $Ĉ(x,q)=∑qa∈B(q)∑qs∈Qsγˆ(qs,qa)ps(x|qs)1q=qax,x∈Σ$ where $γˆ(qs,qa)=1N∑j=1N∑i≥11qs(xi(j))=qs,qa(xi(j))=qa$ Observe that in expectation, $𝔼[γˆ(qs,qa)]=1N∑j=1N∑i≥1𝔼[1qs(xi(j))=qs,qa(xi(j))=qa]=∑i≥1∑xi:qs(xi)=qs,qa(xi)=qaps(xi)=γ(qs,qa),$ and hence $γˆ$(qs, qa) is an unbiased, asymptotically consistent estimator of γ(qs, qa). Given Ĉ(x, q), we compute C(φ, q) similarly to the previous section. If is the expected number of symbols per sample, then the computational complexity of counting in expectation is in O(Nℓ|Σ|). 4.2 KL Divergence Minimization 4.2.1 WFA Target. When the target topology is a deterministic WFA, we use c(x, qa) from the previous section and Lemma 2 to immediately find the minimum KL divergence solution. 4.2.2 φ-WFA Target. When the target topology is a deterministic, backoff-complete φ-WFA, Lemma 3 can be applied in some circumstances to find the minimum KL divergence solution but not in general. However, as noted before, the quantity in braces in the statement of Lemma 4 depends on the distribution pa only at state q so the minimum KL divergence D(ps||$pa*$) can be found by maximizing that quantity independently for each state. Fix a state q and let yxpa(x|q) for xL[q] and let y ≜ [yx]xL[q].10 Then our goal reduces to $argmaxy∑x∈L[q]C(x,q)logyx−∑q0∈B1(q)C(φ,q0)log1−∑x∈L[q0]∩Σyx$ (21) subject to the constraints yxϵ for xL[q] and $∑x∈L[q]$yx = 1. This is a difference of two concave functions in y since log(f(y)) is concave for any linear function f(y), C(x, q), C(φ, q0) are always non-negative, and the sum of concave functions is also concave. We give a DC programming solution to this optimization (Horst and Thoai 1999). Let $Ω={y:∀x,yx≥ϵ,∑x∈L(q)yx≤1}$ and let u(y) = ∑xL[q]C(x, q) log yx and v(y) = ∑q0B1(q)C(φ, q0) log(1 − ∑xL[q0]∩Σyx). Then the optimization problem can be written as $maxy∈Ωu(y)−v(y)$ The DC programming solution for such a problem uses an iterative procedure that linearizes the subtrahend in the concave difference about the current estimate and then solves the resulting concave objective for the next estimate (Horst and Thoai 1999), that is, $yn+1=argmaxy∈Ωu(y)−y⋅∇v(yn)$ Substituting u and ∇v gives $yn+1=argmaxy∈Ω∑x∈L[q]C(x,q)logyx+yxf(x,q,yn)$ (22) where $f(x,q,yn)=∑q0∈B1(q)C(φ,q0)1x∈L[q0]∩Σ1−∑x′∈L[q0]∩Σyx′n$ (23) Observe that 1 − ∑x′∈L[q0]∩Σ$yx′n$ϵ as the automaton is backoff-complete and yn ∈ Ω. Let C(q) be defined as: $C(q)=∑x′∈L[q]C(x′,q)$ The following lemma provides the solution to the optimization problem in Equation (22) that leads to a stationary point of the objective. Lemma 5 Solution to Equation (22) is given by $yxn+1=maxC(x,q)λ−f(x,q,yn),ϵ$ (24) where λ$maxx∈L[q]f(x,q,yn)+C(x,q),maxx∈L[q]f(x,q,yn)+C(q)1−|L[q]|ϵ$ such that ∑x$yxn+1$ = 1. Proof. With KKT multipliers, the optimization problem can be written as $maxy,λ,μx:μx≤0∑x∈L[q]C(x,q)logyx+yxf(x,q,yn)+λ(1−∑x∈L[q]yx)+∑x∈L[q]μx(ϵ−yx)$ We divide the proof into two cases depending on the value of C(x, q). Let C(x, q) ≠ 0. Differentiating this equation with respect to yx and equating to zero, we get $yxn+1=C(x,q)λ+μx−f(x,q,yn)$ Furthermore, by the KKT condition, μx(ϵ$yxn+1$) = 0. Hence, μx is only non-zero if $yxn+1$ = ϵ and if μx is zero, then $yxn+1$ = $C(x,q)λ−f(x,q,yn)$. Furthermore, because for all x, μx ≤ 0, for $yxn+1$ to be positive, we need λ ≥ maxxf(x, q, yn). Hence, the above two conditions can be re-expressed as Equation (24). If C(x, q) = 0, then we get $f(x,q,yn)=λ+μxandμx(ϵ−yxn+1)=0$ and the solution is given by $yxn+1$ = ϵ and μx = f(x, q, yn) − λ. Since μx cannot be positive, we have f(x, q, yn) ≤ λ for all x. Hence, irrespective of the value of C(x, q), the solution is given by Equation (24). The above analysis restricts λ ≥ maxxf(x, q, yn). If λ < f(x, q, yn) + C(x, q), then $yxn+1$ > 1 and if λ > maxxf(x, q, yn) + $C(q)1−|L[q]|ϵ$, then ∑x$yxn+1$ < 1. Hence λ needs to lie in $maxx∈L[q]f(x,q,yn)+C(x,q),maxx∈L[q]f(x,q,yn)+C(q)1−|L[q]|ϵ$ to ensure that ∑x$yxn+1$ = 1.□ From this, we form the KL-Minimization algorithm in Figure 5. Observe that if all the counts are zero, then for any y, u(y) − v(y) = 0 and any solution is an optimal solution and the algorithm returns a uniform distribution over labels. In other cases, we initialize the model based on counts such that y0 ∈ Ω. We then repeat the DC programming algorithm iteratively until convergence. Because Ω is a convex compact set and functions u, v, and ∇v are continuous and differentiable in Ω, the KL-Minimization converges to a stationary point (Sriperumbudur and Lanckriet 2009, Theorem 4). For each state q, the computational complexity of KL-Minimization is in O(|E[q]|) per iteration. Hence, if the maximum number of iterations per each state is s, the overall computational complexity of KL-Minimization is in O(s|E|). Figure 5 KL-Minimization algorithm. Figure 5 KL-Minimization algorithm. 4.3 Discussion Our method for approximating from source φ-WFAs, specifically from backoff k-gram models, shares some similarities with entropy pruning (Stolcke 2000). Both can be used to approximate onto a more compact k-gram topology and both use a KL-divergence criterion to do so. They differ in that entropy pruning, by sequentially removing transitions, selects the target topology and in doing so uses a greedy rather than global entropy reduction strategy. We will empirically compare these methods in Section 5.1 for this use case. Perhaps the closest to our work on approximating arbitrary sources is that of Deoras et al. (2011), where they used an RNN LM to generate samples that they then used to train a k-gram LM. In contrast, we use samples to approximate the joint state probability γ(qs, qa). If is the average number of words per sentence and N is the total number of sampled sentences, then the time complexity of Deoras et al. (2011) is 𝒪(Nℓ|Σ|) as it takes 𝒪(|Σ|) time to generate a sample from the neural model for every word. This is the same as that of our algorithm in Section 4.1.3. However, our approach has several advantages. First, for a given topology, our algorithm provably finds the best KL minimized solution, whereas their approach is optimal only when the size of the k-gram model tends to infinity. Second, as we show in our experiments, the proposed algorithm performs better compared with that of Deoras et al. (2011). Arisoy et al. (2014) and Adel et al. (2014) both trained DNN models of different orders as the means to derive probabilities for a k-gram model, rather than focusing on model approximation as the above-mentioned methods do. Further, the methods are applicable to k-gram models specifically, not the larger class of target WFA to which our methods apply. 5. Experiments We now provide experimental evidence of the theory’s validity and show its usefulness in various applications. For the ease of notation, we use WFA-Approx to denote the exact counting algorithm described in Section 4.1.2 followed by the KL-Minimization algorithm of Section 4.2. Similarly, we use WFA-SampleApprox(N) to denote the sampled counting described in Section 4.1.3 with N sampled sentences followed by KL-Minimization. We first give experimental evidence that supports the theory in Section 5.1. We then show how to approximate neural models as WFAs in Section 5.2. We also use the proposed method to provide lower bounds on the perplexity given a target topology in Section 5.3. Finally, we present experiments in Section 5.4 addressing scenarios with source WFAs, hence not requiring sampling. First, motivated by low-memory applications such as (virtual) keyboard decoding (Ouyang et al. 2017), we use our approach to create compact language models in Section 5.4.1. Then, in Section 5.4.2, we use our approach to create compact open-vocabulary character language models from count-thresholded n-grams. For most experiments (unless otherwise noted) we use the 1996 CSR Hub4 Language Model data, LDC98T31 (English Broadcast News data). We use the processed form of the corpus and further process it to downcase all the words and remove punctuation. The resulting data set has 132M words in the training set, 20M words in the test set, and has 240K unique words. For all the experiments that use word models, we create a vocabulary of approximately 32K words consisting of all words that appeared more than 50 times in the training corpus. Using this vocabulary, we create a trigram Katz model and prune it to contain 2M n-grams using entropy pruning (Stolcke 2000). We use this pruned model as a baseline in all our word-based experiments. We use Katz smoothing because it is amenable to pruning (Chelba et al. 2010). The perplexity of this model on the test set is 144.4.11 All algorithms were implemented using the open-source OpenFst and OpenGrmn-gram and stochastic automata (SFst) libraries12 with the last library including these implementations (Allauzen et al. 2007; Roark et al. 2012; Allauzen and Riley 2018). 5.1 Empirical Evidence of Theory Recall that our goal is to find the distribution on a target DFA topology that minimizes the KL divergence to the source distribution. However, as stated in Section 4.2, if the target topology has failure transitions, the optimization objective is not convex so the stationary point solution may not be the global optimum. We now show that the model indeed converges to a good solution in various cases empirically. Idempotency. When the target topology is the same as the source topology, we show that the performance of the approximated model matches the source model. Let ps be the pruned Katz word model described above. We approximate ps onto the same topology using WFA-Approx and WFA-SampleApprox(⋅) and then compute perplexity on the test corpus. The results are presented in Figure 6. The test perplexity of the WFA-Approx model matches that of the source model and the performance of the WFA-SampleApprox(N) model approaches that of the source model as the number of samples N increases. Figure 6 Idempotency: Test set perplexity for Katz baseline and approximations of that baseline trained on the same data. The Katz baseline and Katz WFA-Approx plots are identical (and thus overlap), while WFA-SampleApprox(N) plot converges to the baseline. Figure 6 Idempotency: Test set perplexity for Katz baseline and approximations of that baseline trained on the same data. The Katz baseline and Katz WFA-Approx plots are identical (and thus overlap), while WFA-SampleApprox(N) plot converges to the baseline. Comparison to Greedy Pruning. Recall that entropy pruning (Stolcke 2000) greedily removes n-grams such that the KL divergence to the original model ps is small. Let pgreedy be the resulting model and Agreedy be the topology of pgreedy. If the KL-Minimization converges to a good solution, then approximating ps onto Agreedy would give a model that is at least as good as pgreedy. We show that this is indeed the case; in fact, approximating ps onto Agreedy performs better than pgreedy. As before, let ps again be the 2Mn-gram Katz model described above. We prune it to have 1M n-grams and obtain pgreedy, which has a test perplexity of 157.4. We then approximate the source model ps on Agreedy, the topology of pgreedy. The resulting model has a test perplexity of 155.7, which is smaller than the test perplexity of pgreedy. This shows that the approximation algorithm indeed finds a good solution. We repeat the experiment with different pruning thresholds and observe that the approximation algorithm provides a good solution across the resulting model sizes. The results are in Table 1. Table 1 Test perplexity of greedy pruning and approximated models. Model sizeGreedy pruningApproximated model 250K 205.7 198.3 500K 177.3 173.0 1M 157.4 155.7 1.5M 149.0 148.4 Model sizeGreedy pruningApproximated model 250K 205.7 198.3 500K 177.3 173.0 1M 157.4 155.7 1.5M 149.0 148.4 5.2 Neural Models to WFA Conversion Since neural models such as LSTMs give improved performance over n-gram models, we investigate whether an LSTM distilled onto a WFA model can obtain better performance than the baseline WFA trained directly from Katz smoothing. As stated in the Introduction, this could then be used together with federated learning for fast and private on-device inference, which was done in Chen et al. (2019) using the methods presented here. To explore this, we train an LSTM language model on the training data. The model has 2 LSTM layers with 1,024 states and an embedding size of 1,024. The resulting model has a test perplexity of 60.5. We approximate this model as a WFA in three ways. The first way is to construct a Katz n-gram model on N LSTM samples and entropy-prune to 2M n-grams, which we denote by WFA-SampleKatz(N). This approach is very similar to that of Deoras et al. (2011), except that we use Katz smoothing instead of Kneser-Ney smoothing. We used Katz due to the fact that, as stated earlier, Kneser-Ney models are not amenable to pruning (Chelba et al. 2010). The second way is to approximate onto the baseline Katz 2M n-gram topology described above using WFA-SampleApprox(N). We refer to this experiment as WFA-SampleApprox(N) (KT), where KT stands for known topology. The results are shown in Figure 7. The WFA-SampleKatz(N) models perform significantly worse than the baseline Katz model even at 32M samples, while the WFA-SampleApprox(N) models have better perplexity than the baseline Katz model with as little as 1M samples. With 32M samples this way of approximating the LSTM model as a WFA is 3.6 better in perplexity than the baseline Katz. Figure 7 Approximated Neural Models for the English Broadcast News corpus. Test set perplexity for Katz baseline and LSTM models approximated in three ways. One uses LSTM samples to build a new Katz model (WFA-SampleKatz(N)). The remaining two use our approximation algorithm (WFA-SampleApprox(N)), but with different topologies. One topology (KT) is known, using the baseline Katz topology, and one (UT) is unknown, using the samples drawn for WFA-SampleApprox(N) to also infer the topology. Figure 7 Approximated Neural Models for the English Broadcast News corpus. Test set perplexity for Katz baseline and LSTM models approximated in three ways. One uses LSTM samples to build a new Katz model (WFA-SampleKatz(N)). The remaining two use our approximation algorithm (WFA-SampleApprox(N)), but with different topologies. One topology (KT) is known, using the baseline Katz topology, and one (UT) is unknown, using the samples drawn for WFA-SampleApprox(N) to also infer the topology. Finally, if the WFA topology is unknown we use the samples obtained in WFA-SampleApprox(⋅) to create a Katz model entropy-pruned to 2M n-grams. We refer to this experiment as WFA-SampleApprox(N) (UT), where UT stands for unknown topology. The results are shown in Figure 7. The approximated models obtained by WFA-SampleApprox(N) (UT) perform better than WFA-SampleKatz(N). However, they do not perform as well as WFA-SampleApprox(N) (KT), the models obtained with the known topology derived from the training data. However, with enough samples, their performance is similar to that of the original Katz model. We then compare our approximation method with other methods that approximate DNNs with n-gram LMs (Arisoy et al. 2014; Adel et al. 2014). Both these methods require training DNNs of different orders and we used DNNs with two layers with 1,024 nodes and an embedding size of 1,024. The results are in Table 2. The proposed algorithms WFA-SampleApprox(⋅) perform better than the existing approaches. Table 2 Test perplexity of k-gram models obtained by different approximation methods for the English Broadcast News corpus. We use 32M samples for both WFA-SampleKatz(⋅) and WFA-SampleApprox(⋅). ModelTest perplexity Katz 144.4 WFA-SampleKatz(⋅) 151.8 Arisoy et al. (2014159.6 Adel et al. (2014146.7 WFA-SampleApprox(⋅) (KT) 140.8 WFA-SampleApprox(⋅) (UT) 143.9 ModelTest perplexity Katz 144.4 WFA-SampleKatz(⋅) 151.8 Arisoy et al. (2014159.6 Adel et al. (2014146.7 WFA-SampleApprox(⋅) (KT) 140.8 WFA-SampleApprox(⋅) (UT) 143.9 Experiment on Polish. To repeat the above experiment on a different language and corpus, we turn to the Polish language section13 of the Europarl corpus (Koehn 2005). We chose Polish for this follow-up experiment due to the fact that it belongs to a different language family than English, is relatively highly inflected (unlike English), and is found in a publicly available corpus. We use the processed form of the corpus and further process it to downcase all the words and remove punctuation. The resulting data set has approximately 13M words in the training set and 210K words in the test set. The selected vocabulary has approximately 30K words, consisting of all words that appeared more than 20 times in the training corpus. Using this vocabulary, we create a trigram Katz model and prune it to contain 2M n-grams using entropy pruning (Stolcke 2000). We use this pruned model as a baseline. The results are in Figure 8. The trend is similar to that of the English Broadcast News corpus with the proposed algorithm WFA-SampleApprox(⋅) performing better than the other methods. We also compare the proposed algorithms with other neural approximation algorithms. The comparison results are shown in Table 3. Figure 8 Approximated Neural Models for the Polish (Europarl) corpus. Test set perplexity for Katz baseline and LSTM models approximated in three ways. One uses LSTM samples to build a new Katz model (WFA-SampleKatz(N)). The remaining two use our approximation algorithm (WFA-SampleApprox(N)), but with different topologies. One topology (KT) is known, using the baseline Katz topology, and one (UT) is unknown, using the samples drawn for WFA-SampleApprox(N) to also infer the topology. Figure 8 Approximated Neural Models for the Polish (Europarl) corpus. Test set perplexity for Katz baseline and LSTM models approximated in three ways. One uses LSTM samples to build a new Katz model (WFA-SampleKatz(N)). The remaining two use our approximation algorithm (WFA-SampleApprox(N)), but with different topologies. One topology (KT) is known, using the baseline Katz topology, and one (UT) is unknown, using the samples drawn for WFA-SampleApprox(N) to also infer the topology. Table 3 Test perplexity of k-gram models obtained by different approaches for the Polish (Europarl) corpus. We use 32M samples for both WFA-SampleKatz(⋅) and WFA-SampleApprox(⋅). ModelTest perplexity Katz 185.5 WFA-SampleKatz(⋅) 189.3 Arisoy et al. (2014197.8 Adel et al. (2014187.1 WFA-SampleApprox(⋅) (KT) 181.4 WFA-SampleApprox(⋅) (UT) 177.0 ModelTest perplexity Katz 185.5 WFA-SampleKatz(⋅) 189.3 Arisoy et al. (2014197.8 Adel et al. (2014187.1 WFA-SampleApprox(⋅) (KT) 181.4 WFA-SampleApprox(⋅) (UT) 177.0 5.3 Lower Bounds on Perplexity Best Approximation for Target Topology. The neural model in Section 5.2 has a perplexity of 60.5, but the best perplexity for the approximated model is 140.8. Is there a better approximation algorithm for the given target topology? We place bounds on that next. Let T be the set of test sentences. The test-set log-perplexity of a model p can be written as $1|T|∑x*∈Tlog1p(x*)=∑x*pˆt(x*)log1p(x*)$ where $pˆt$ is the empirical distribution of test sentences. Observe that the best model with topology A can be computed as $pa′=argminpa∈𝒫(A)∑x*pˆt(x*)log1pa(x*)$ which is the model with topology A that has minimal KL divergence from the test distribution $pˆt$. This can be computed using WFA-Approx. If we use this approach on the English Broadcast News test set with the 2M n-gram Katz model, the resulting model has perplexity of 121.1, showing that, under the assumption that the algorithm finds the global KL divergence minimum, the test perplexity with this topology cannot be improved beyond 121.1, irrespective of the method. Approximation onto Best Trigram Topology. If we approximate the LSTM onto the best trigram topology, how well does it perform over the test data? To test this, we build a trigram model from the test data and approximate the LSTM on the trigram topology. This approximated model has 11M n-grams and a perplexity of 81. This shows that for large data sets, the largest shortfall of n-gram models in the approximation is due to the n-gram topology. 5.4 WFA Sources While distillation of neural models is an important use case for our algorithms, there are other scenarios where approximations will be derived from WFA sources. In such cases, no sampling is required and we can use the algorithms presented in Sections 4.1.1 or 4.1.2, namely, WFA-Approx. For example, it is not uncommon for applications to place maximum size restrictions on models, so that existing WFA models are too large and must be reduced in size prior to use. Section 5.4.1 presents a couple of experiments focused on character language model size reduction that were motivated by such size restrictions. Another scenario with WFA sources is when, for privacy reasons, no language model training corpus is available in a domain, but only minimum-count-thresholded (i.e., high frequency) word n-gram counts are provided. In Section 5.4.2 we examine the estimation of open-vocabulary character language models from such data. 5.4.1 Creating Compact Language Models. Creating Compact Models for Infrequent Words. In low-memory applications such as on-device keyboard decoding (Ouyang et al. 2017), it is often useful to have a character-level WFA representation of a large set of vocabulary words that act only as unigrams, for example, those words beyond the 32K words of our trigram model. We explore how to compactly represent such a unigram-only model. To demonstrate our approach, we take all the words in the training set (without a count cutoff) and build a character-level deterministic WFA of those words weighted by their unigram probabilities. This is represented as a tree rooted at the initial state (a trie). This automaton has 820K transitions. Storing this many transitions can be prohibitive; we can reduce the size in two steps. The first step is to minimize this WFA using weighted minimization (Mohri 2000) to produce pchar, which has a topology Achar. Although pchar is already much smaller (it has 378K transitions, a 54% reduction), we can go further by approximating onto the minimal deterministic unweighted automaton, Minimize(Achar). This gives us a model with only 283K transitions, a further 25% reduction. Because Minimize(Achar) accepts exactly the same words as Achar, we are not corrupting our model by adding or removing any vocabulary items. Instead, we find an estimate that is as close as possible to the original, but that is constrained to the minimal deterministic representation that preserves the vocabulary. To evaluate this approach, we randomly select a 20K sentence subset of the original test set, and represent each selected string as a character-level sequence. We evaluate using cross entropy in bits-per-character, common for character-level models. The resulting cross entropy for pchar is 1.557 bits-per-character. By comparison, the cross entropy for pchar approximated onto Minimize(Achar) is 1.560 bits-per-character. In exchange for this small accuracy loss we are rewarded with a model that is 25% smaller. Wolf-Sonkin et al. (2019) used the methods presented here to augment the vocabulary of an on-device keyboard to deal with issues related to a lack of standard orthography. Creating Compact WFA Language Models. Motivated by the previous experiment, we also consider applying (unweighted) minimization to Agreedy, the word-based trigram topology that we pruned to 1M n-grams described earlier. In Table 4 we show that applying minimization to Agreedy and then approximating onto the resulting topology leads to a reduction of 7% in the number of transitions needed to represent the model. However, the test perplexity also increases some. To control for this, we prune the original model to a 1.08M n-gram topology $Agreedy′$ instead of the 1M as before and apply the same procedure to obtain an approximation on Minimize($Agreedy′$). We achieve a 0.4% perplexity reduction compared to the approximation on Agreedy with very nearly the same number of transitions. Table 4 Test perplexity of models when approximated onto smaller topologies. TopologyTest perplexity# Transitions Agreedy 155.7 1.13M Minimize(Agreedy156.4 1.05M $Agreedy′$ 154.1 1.22M Minimize($Agreedy′$154.9 1.13M TopologyTest perplexity# Transitions Agreedy 155.7 1.13M Minimize(Agreedy156.4 1.05M $Agreedy′$ 154.1 1.22M Minimize($Agreedy′$154.9 1.13M 5.4.2 Count Thresholded Data for Privacy. One increasingly common scenario that can benefit from these algorithms is modeling from frequency thresholded substring counts rather than raw text. For example, word n-grams and their frequencies may be provided from certain domains of interest only when they occur within at least k separate documents. With a sufficiently large k (say 50), no n-gram can be traced to a specific document, thus providing privacy in the aggregation. This is known as k-anonymity (Samarati 2001). However, for any given domain, there are many kinds of models that one may want to build depending on the task, some of which may be trickier to estimate from such a collection of word n-gram counts than with standard approaches for estimation from a given corpus. For example, character n-gram models can be of high utility for tasks like language identification, and have the benefit of a relatively small memory footprint and low latency in use. However, character n-gram models might be harder to learn from a k-anonymized corpus. Here we will compare open-vocabulary character language models, which accept all strings in Σ* for a character vocabulary Σ, trained in several ways. Each approach relies on the training corpus and 32k vocabulary, with every out-of-vocabulary word replaced by a single OOV symbol ★. Additionally, for each approach we add 50 to the unigram character count of any printable ASCII character, so that even those that are unobserved in the words of our 32k vocabulary have some observations. Our three approaches are: 1. Baseline corpus trained models: We count character 5-grams from the k-anonymized corpus, then remove all n-grams that include the ★ symbol (in any position) prior to smoothing and normalization. Here we present both Kneser-Ney and Witten Bell smoothed models, as both are popular for character n-gram models. 2. Word trigram sampled model: First we count word trigrams in the k-anonymized corpus and discard any n-gram with the ★ symbol (in any position) prior to smoothing and normalization. We then sample one million strings from a Katz smoothed model and build a character 5-gram model from these strings. We also use this as our target topology for the next approach. 3. Word trigram KL minimization estimation: We create a source model by converting the 2M n-gram word trigram model into an open vocabulary model. We do this using a specialized construction related to the construction presented in Section 6 of Chen et al. (2019), briefly described below, that converts the word model into a character sequence model. As this model is still closed vocabulary (see below), we additionally smooth the unigram distribution with a character trigram model trained from the words in the symbol table (and including the 50 extra counts for every printable ASCII character as with the other methods). From this source model, we estimate a model on the sampled character 5-gram topology from the previous approach, using our KL minimization algorithm. Converting Word n-gram Source to Character Sequence Source Model. Briefly, for every state q in the n-gram automaton, the set of words labeling transitions leaving q are represented as a trie of characters including a final end-of-word symbol. Each resulting transition labeled with the end-of-word symbol represents the last transition for that particular word spelled out by that sequence of transitions, hence is assigned the same next state as the original word transition. If q has a backoff transition pointing to its backoff state q′, then each new internal state in the character trie backs off to the corresponding state in the character trie leaving q′. The presence of the corresponding state in the character trie leaving q′ is guaranteed because the n-gram automaton is backoff-complete, as discussed in Section 3.3. As stated above, this construction converts from word sequences to character sequences, but will only accept character sequences consisting of strings of in-vocabulary words, that is, this is still closed vocabulary. To make it open vocabulary, we further back off the character trie leaving the unigram state to a character n-gram model estimated from the symbol table (and additional ASCII character observations). This is done using a very similar construction to that described above. The resulting model is used as the source model for our KL minimization algorithm, to estimate a distribution over the sampled character 5-gram topology. We encode the test set as a sequence of characters, without using the symbol table because our models are intended to be open vocabulary. Following typical practice for open-vocabulary settings, we evaluate with bits-per-character. The results are presented in Table 5. Here we achieve slightly lower bits-per-character than even what we get straight from the corpus, perhaps due to better regularization of the word-based model than with either Witten-Bell or Kneser-Ney on the character n-grams. Table 5 Comparison of character 5-gram models derived from either the original corpus or a word trigram model. Size of the models is presented in terms of the number of character n-grams, the numbers of states and transitions in the automaton representation, and the file size in MB. The two corpus estimated models have the same topology, hence the same size; as do the two word trigram estimated models. Sourcen-grams (x1000)states (x1000)transitions (x1000)MBEstimationbits/char Corpus 336 60 381 6.5 Kneser-Ney 2.04 Witten-Bell (WB) 2.01 Word trigram 277 56 322 5.6 Sampled (WB) 2.36 KL min 1.99 Sourcen-grams (x1000)states (x1000)transitions (x1000)MBEstimationbits/char Corpus 336 60 381 6.5 Kneser-Ney 2.04 Witten-Bell (WB) 2.01 Word trigram 277 56 322 5.6 Sampled (WB) 2.36 KL min 1.99 6. Software Library All of the algorithms presented here are available in the OpenGrm libraries at http://www.opengrm.org under the topic: SFST Library: operations to normalize, sample, combine, and approximate stochastic finite-state transducers. We illustrate the use of this library by showing how to implement some of the experiments in the previous section. 6.1 Example Data and Models Instead of Broadcast News, we will use the text of Oscar Wilde’s The Importance of Being Earnest for our tutorial example. This is a small tutorial corpus that we make available at http://sfst.opengrm.org. This corpus of approximately 1,688 sentences and 18,000-words has been upper-cased and had punctuation removed. The first 850 sentences were used as training data and the remaining 838 sentences used as test data. From these, we produce two 1,000-word vocabulary Katz bigram models, the ∼ 6kn-gram earnest_train.mod and the ∼4kn-gram earnest_test.mod. We also used relative-entropy pruning to create the ∼ 2kn-gram earnest_train.pru. The data, the steps to generate these models, and how to compute their perplexity using OpenGrm NGram are all fully detailed in the QuickTour topic at http://sfst.opengrm.org. 6.2 Computing the Approximation The following step shows how to compute the approximation of a φ-WFA model onto a φ-WFA topology. In the example below, the first argument, earnest_train.mod, is the source model, and the second argument, earnest_train.pru, provides the topology. The result is a φ-WFA whose perplexity can be measured as before.$ sfstapprox -phi_label=0 earnest_train.mod earnest_train.pru \ >earnest_train.approx

An alternative, equivalent way to perform this approximation is to break it into two steps, with the counting and normalization (KL divergence minimization) done separately.

$sfstcount -phi_label=0 earnest_train.mod earnest_train.pru \ >earnest_train.approx_cnts$ sfstnormalize -method=kl_min -phi_label=0 \ earnest_train.approx_cnts >earnest_train.approx

We can now use these utilities to run some experiments analogous to our larger Broadcast News experiments by using different target topologies. The results are shown in Table 6. We see in the idempotency experiment that the perplexity of the approximation on the same topology matches the source. In the greedy-pruning experiment, the approximation onto the greedy-pruned topology yields a better perplexity than the greedily-pruned model. Finally, the approximation onto the test-set bigram topology gives a better perplexity than the training-set model since we include all the relevant bigrams.

Table 6

Perplexities for example experiments.

ExperimentSource ModelTopologyApprox. Perplexity
Idempotency earnest_train.mod earnest_train.mod 73.41
Comparison to greedy pruning earnest_train.mod earnest_train.pru 83.12
Approx. onto best bigram topology earnest_train.mod earnest_test.pru 69.68
ExperimentSource ModelTopologyApprox. Perplexity
Idempotency earnest_train.mod earnest_train.mod 73.41
Comparison to greedy pruning earnest_train.mod earnest_train.pru 83.12
Approx. onto best bigram topology earnest_train.mod earnest_test.pru 69.68

6.3 Available Operations

Table 7 lists the available command-line operations in the OpenGrm SFst library. We show command-line utilities here; there are corresponding C++ library functions that can be called from within a program; see http://sfst.opengrm.org.

Table 7

Available Operations in the OpenGrm SFst Library.

OperationDescription
sfstapprox Approximates a stochastic φ-WFA wrt a backoff-complete φ-WFA.
sfstcount Counts from a stochastic φ-WFA wrt a backoff-complete φ-WFA.
sfstintersect Intersects two φ-WFAs.
sfstnormalize -method=global Globally normalizes a φ-WFA.
sfstnormalize -method=kl_min Normalizes a count φ-WFA using KL divergence minimization.
sfstnormalize -method=local Locally normalizes a φ-WFA.
sfstnormalize -method=phi φ-normalizes a φ-WFA.
sfstperplexity Computes perplexity of a stochastic φ-WFA.
sfstrandgen Randomly generates paths from a stochastic φ-WFA.
sfstshortestdistance Computes the shortest distance on a φ-WFA.
sfsttrim Removes useless states and transitions in a φ-WFA.
OperationDescription
sfstapprox Approximates a stochastic φ-WFA wrt a backoff-complete φ-WFA.
sfstcount Counts from a stochastic φ-WFA wrt a backoff-complete φ-WFA.
sfstintersect Intersects two φ-WFAs.
sfstnormalize -method=global Globally normalizes a φ-WFA.
sfstnormalize -method=kl_min Normalizes a count φ-WFA using KL divergence minimization.
sfstnormalize -method=local Locally normalizes a φ-WFA.
sfstnormalize -method=phi φ-normalizes a φ-WFA.
sfstperplexity Computes perplexity of a stochastic φ-WFA.
sfstrandgen Randomly generates paths from a stochastic φ-WFA.
sfstshortestdistance Computes the shortest distance on a φ-WFA.
sfsttrim Removes useless states and transitions in a φ-WFA.

7. Summary and Discussion

In this article, we have presented an algorithm for minimizing the KL-divergence between a probabilistic source model over sequences and a WFA target model. That our algorithm is general enough to permit source models of arbitrary form (e.g., RNNs) and deriving an appropriate WFA topology—something we do not really touch on in this article—is particularly important.

Notes

1

Model size or efficiency of inference may be a common motivation for the model approximations we present, which may suggest a demonstration of the speed/accuracy tradeoff for some downstream use of the models. However, as stated earlier in this Introduction, there are mulitple reasons why an approximation may be needed, and our goal in this article is to establish that our methods provide a well-motivated approach should an approximation be required for any reason.

2

We define x0ϵ, the empty string, and adopt p(ϵ) = 0.

3

In the most general case, q(xn) = xn.

4

If |Q(ps)| is finite, it can be shown that H(ps) is necessarily finite.

5

γ is not (necessarily) a probability distribution over Qs × Qa. In fact, γ(qs, qa) can be greater than 1.

6

In other words, a φ label, like an ϵ label, acts as an identity element in string concatenation (Allauzen and Riley 2018).

7

The meaning of 𝒫(A) when A is φ-WFA is to interpret it as a WFA with the failure labels as regular symbols.

8

For brevity, we do not include ϵ in the notation of 𝒫(A).

9

The construction illustrated in Figure 4 is sufficient when SA is acyclic. In the cyclic case a slightly modified construction is needed to ensure convergence in the shortest distance calculation (Allauzen and Riley 2018).

10

We fix some total order on Σ ∪ {φ} so that y is well-defined.

11

For all perplexity measurements, we treat the unknown word as a single token instead of a class. To compute the perplexity with the unknown token being treated as class, multiply the perplexity by k0.0115, where k is the number of tokens in the unknown class and 0.0115 is the out-of-vocabulary rate in the test data set.

12

These libraries are available at www.openfst.org and www.opengrm.org.

References

,
Heike
,
Katrin
Kirchhoff
,
Ngoc Thang
Vu
,
Dominic
Telaar
, and
Tanja
Schultz
.
2014
.
Comparing approaches to convert recurrent neural networks into backoff language models for efficient decoding
. In
Fifteenth Annual Conference of the International Speech Communication Association (Interspeech)
, pages
651
655
.
Aho
,
Alfred V.
and
Margaret J.
Corasick
.
1975
.
Efficient string matching: An aid to bibliographic search
.
Communications of the ACM
,
18
(
6
):
333
340
.
Albert
,
Jürgen
and
Jarkko
Kari
.
2009
.
Digital image compression
. In
M.
Droste
,
W.
Kuich
, and
H.
Vogler
, editors,
Handbook of Weighted Automata
,
Springer
, pages
453
479
.
Allauzen
,
Cyril
,
Mehryar
Mohri
, and
Brian
Roark
.
2003
.
Generalized algorithms for constructing language models
. In
Proceedings of ACL
, pages
40
47
.
Allauzen
,
Cyril
,
Michael
Riley
,
Johan
Schalkwyk
,
Wojciech
Skut
, and
Mehryar
Mohri
.
2007
.
OpenFst Library
. http://www.openfst.org.
Allauzen
,
Cyril
and
Michael D.
Riley
.
2018
.
Algorithms for weighted finite automata with failure transitions
. In
International Conference on Implementation and Application of Automata
, pages
46
58
.
Angluin
,
Dana
.
1987
.
Learning regular sets from queries and counterexamples
.
Information and Computation
,
75
(
2
):
87
106
.
Angluin
,
Dana
.
1988
.
Identifying languages from stochastic examples
.
Technical Report YALEU/DCS/RR-614
,
Yale University
.
Arisoy
,
Ebru
,
Stanley F.
Chen
,
Bhuvana
, and
Abhinav
Sethy
.
2014
.
Converting neural network language models into back-off language models for efficient decoding in automatic speech recognition
.
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP)
,
22
(
1
):
184
192
.
Balle
,
Borja
,
Xavier
Carreras
,
Franco M.
Luque
, and
Quattoni
.
2014
.
Spectral learning of weighted automata
.
Machine Learning
,
96
(
1–2
):
33
63
.
Balle
,
Borja
and
Mehryar
Mohri
.
2012
.
Spectral learning of general weighted automata via constrained matrix completion
. In
Advances in Neural Information Processing Systems
, pages
2159
2167
.
Breuel
,
Thomas M.
2008
.
The OCRopus open source OCR system
. In
Proceedings of IS&T/SPIE 20th Annual Symposium
.
Carrasco
,
Rafael C.
1997
.
Accurate computation of the relative entropy between stochastic regular grammars
.
RAIRO-Theoretical Informatics and Applications
,
31
(
5
):
437
444
.
Carrasco
,
Rafael C.
and
José
Oncina
.
1994
.
Learning stochastic regular grammars by means of a state merging method
. In
International Colloquium on Grammatical Inference
, pages
139
152
.
Carrasco
,
Rafael C.
and
José
Oncina
.
1999
.
Learning deterministic regular grammars from stochastic samples in polynomial time
.
RAIRO-Theoretical Informatics and Applications
,
33
(
1
):
1
19
.
Chelba
,
Ciprian
,
Thorsten
Brants
,
Will
Neveitt
, and
Peng
Xu
.
2010
.
Study on interaction between entropy pruning and Kneser-Ney smoothing
. In
Eleventh Annual Conference of the International Speech Communication Association (Interspeech)
, pages
2422
2425
.
Chen
,
Mingqing
,
Ananda Theertha
Suresh
,
Rajiv
Mathews
,
Wong
,
Françoise
Beaufays
,
Cyril
Allauzen
, and
Michael
Riley
.
2019
.
Federated learning of N-gram language models
. In
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
, pages
121
130
.
Chen
,
Stanley
and
Joshua
Goodman
.
1998
.
An empirical study of smoothing techniques for language modeling
.
TR-10-98
,
Harvard University
.
Cicchello
,
Orlando
and
Stefan C.
Kremer
.
2003
.
Inducing grammars from sparse data sets: A survey of algorithms and results
.
Journal of Machine Learning Research
,
4
(
Oct
):
603
632
.
Cortes
,
Corinna
,
Mehryar
Mohri
,
Ashish
Rastogi
, and
Michael
Riley
.
2008
.
On the computation of the relative entropy of probabilistic automata
.
International Journal of Foundations of Computer Science
,
19
(
01
):
219
242
.
Dempster
,
Arthur P.
,
Nan M.
Laird
, and
Donald B.
Rubin
.
1977
.
Maximum likelihood from incomplete data via the EM algorithm
.
Journal of the Royal Statistical Society. Series B (Methodological)
,
39
(
1
):
1
38
.
Deoras
,
Anoop
,
Tomáš
Mikolov
,
Stefan
Kombrink
,
Martin
Karafiát
, and
Sanjeev
Khudanpur
.
2011
.
Variational approximation of long-span language models for LVCSR
. In
2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
, pages
5532
5535
.
Dupont
,
Pierre
.
1996
.
Incremental regular inference
. In
International Colloquium on Grammatical Inference
, pages
222
237
.
Durbin
,
Richard
,
Sean R.
Eddy
,
Anders
Krogh
, and
Graeme J.
Mitchison
.
1998
.
Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
.
Cambridge University Press
.
Ebden
,
Peter
and
Richard
Sproat
.
2015
.
The Kestrel TTS text normalization system
.
Natural Language Engineering
,
21
(
3
):
333
353
.
Eisner
,
Jason
.
2001
.
Expectation semirings: Flexible EM for learning finite-state transducers
. In
Proceedings of the ESSLLI Workshop on Finite-state Methods in NLP
, pages
1
5
.
Giles
,
C. Lee
,
Clifford B.
Miller
,
Dong
Chen
,
Hsing-Hen
Chen
,
Guo-Zheng
Sun
, and
Yee-Chun
Lee
.
1992
.
Learning and extracting finite state automata with second-order recurrent neural networks
.
Neural Computation
,
4
(
3
):
393
405
.
Gold
,
E. Mark
.
1967
.
Language identification in the limit
.
Information and Control
,
10
(
5
):
447
474
.
Gold
,
E. Mark
.
1978
.
Complexity of automaton identification from given data
.
Information and Control
,
37
(
3
):
302
320
.
Hard
,
Andrew
,
Kanishka
Rao
,
Rajiv
Mathews
,
Françoise
Beaufays
,
Sean
Augenstein
,
Hubert
Eichner
,
Chloé
Kiddon
, and
Daniel
Ramage
.
2018
.
Federated learning for mobile keyboard prediction
.
arXiv preprint arXiv:1811.03604
.
Hellsten
,
Lars
,
Brian
Roark
,
Prasoon
Goyal
,
Cyril
Allauzen
,
Françoise
Beaufays
,
Tom
Ouyang
,
Michael
Riley
, and
David
Rybach
.
2017
.
Transliterated mobile keyboard input via weighted finite-state transducers
. In
FSMNLP 2017
, pages
10
19
.
Horst
,
Reiner
and
Nguyen V.
Thoai
.
1999
.
DC programming: Overview
.
Journal of Optimization Theory and Applications
,
103
(
1
):
1
43
.
Iglesias
,
Gonzalo
,
Cyril
Allauzen
,
William
Byrne
,
de Gispert
, and
Michael
Riley
.
2011
.
Hierarchical phrase-based translation representations
. In
EMNLP 2011
, pages
1373
1383
.
Jacobsson
,
Henrik
.
2005
.
Rule extraction from recurrent neural networks: A taxonomy and review
.
Neural Computation
,
17
(
6
):
1223
1263
.
Katz
,
Slava M.
1987
.
Estimation of probabilities from sparse data for the language model component of a speech recognizer
.
IEEE Transactions on Acoustic, Speech, and Signal Processing
,
35
(
3
):
400
401
.
Koehn
,
Philipp
.
2005
.
Europarl: A parallel corpus for statistical machine translation
. In
MT Summit
,
5
:
79
86
.
Konečný
,
Jakub
,
H. Brendan
McMahan
,
Felix X.
Yu
,
Peter
Richtárik
,
Ananda Theertha
Suresh
, and
Dave
Bacon
.
2016
.
Federated learning: Strategies for improving communication efficiency
.
arXiv preprint arXiv:1610.05492
.
Lecorvé
,
Gwénolé
and
Petr
Motlicek
.
2012
.
Conversion of recurrent neural network language models to weighted finite state transducers for automatic speech recognition
. In
Thirteenth Annual Conference of the International Speech Communication Association (Interspeech)
, pages
1668
1671
.
McMahan
,
Brendan
,
Eider
Moore
,
Daniel
Ramage
,
Seth
Hampson
, and
Blaise
Aguera y Arcas
.
2017
.
Communication-efficient learning of deep networks from decentralized data
. In
Artificial Intelligence and Statistics
, pages
1273
1282
.
Mohri
,
Mehryar
.
1997
.
String-matching with automata
.
Nordic Journal of Computing
,
4
(
2
):
217
213
.
Mohri
,
Mehryar
.
2000
.
Minimization algorithms for sequential transducers
.
Theoretical Computer Science
,
234
(
1–2
):
177
201
.
Mohri
,
Mehryar
.
2002
.
Semiring frameworks and algorithms for shortest-distance problems
.
Journal of Automata, Languages and Combinatorics
,
7
(
3
):
321
350
.
Mohri
,
Mehryar
.
2009
.
Weighted automata algorithms
. In
M.
Droste
,
W.
Kuich
, and
H.
Vogler
, editors,
Handbook of Weighted Automata
.
Springer
, pages
213
254
.
Mohri
,
Mehryar
,
Fernando C. N.
Pereira
, and
Michael
Riley
.
2008
.
Speech recognition with weighted finite-state transducers
. In
J.
Benesty
,
M. M.
Sondhi
, and
Y. A.
Huang
, editors,
Handbook on Speech Processing and Speech Communication
.
Springer
, pages
559
584
.
Novak
,
Josef R.
,
Nobuaki
Minematsu
, and
Keikichi
Hirose
.
2013
.
Failure transitions for joint n-gram models and G2P conversion
. In
Fourteenth Annual Conference of the International Speech Communication Association (Interspeech)
, pages
1821
1825
.
Okudono
,
Takamasa
,
Masaki
Waga
,
Taro
Sekiyama
, and
Ichiro
Hasuo
.
2020
.
Weighted automata extraction from recurrent neural networks via regression on state spaces
. In
Proceedings of the AAAI Conference on Artificial Intelligence
,
34
:
5306
5314
.
Oncina
,
José
and
Pedro
Garcia
.
1992
.
Identifying regular languages in polynomial time
. In
Advances in Structural and Syntactic Pattern Recognition
.
World Scientific
, pages
99
108
.
Ouyang
,
Tom
,
David
Rybach
,
Françoise
Beaufays
, and
Michael
Riley
.
2017
.
Mobile keyboard input decoding with finite-state transducers
.
arXiv preprint arXiv:1704.03987
.
Parekh
,
Rajesh
and
Vasant
Honavar
.
2000
.
Grammar inference, automata induction, and language acquisition
. In
R.
Dale
,
H.
Moisl
, and
H.
Somers
, editors,
Handbook of Natural Language Processing
, pages
727
764
.
Pitt
,
Leonard
.
1989
.
Inductive inference, DFAs, and computational complexity
. In
International Workshop on Analogical and Inductive Inference
, pages
18
44
.
Roark
,
Brian
,
Richard
Sproat
,
Cyril
Allauzen
,
Michael
Riley
,
Jeffrey
Sorensen
, and
Terry
Tai
.
2012
.
The OpenGrm open-source finite-state grammar software libraries
.
Proceedings of the ACL 2012 System Demonstrations
, pages
61
66
.
Samarati
,
Pierangela
.
2001
.
Protecting respondents’ identities in microdata release
.
IEEE Transactions on Knowledge and Data Engineering
,
13
(
6
):
1010
1027
.
Sriperumbudur
,
Bharath K.
and
Gert R. G.
Lanckriet
.
2009
.
On the convergence of the concave-convex procedure
. In
Proceedings of the 22nd International Conference on Neural Information Processing Systems
, pages
1759
1767
.
Stolcke
,
Andreas
.
2000
.
Entropy-based pruning of backoff language models
.
arXiv preprint cs/0006025
.
Sundermeyer
,
Martin
,
Ralf
Schlüter
, and
Hermann
Ney
.
2012
.
LSTM neural networks for language modeling
.
Thirteenth Annual Conference of the International Speech Communication Association
, pages
194
197
.
Suresh
,
Ananda Theertha
,
Brian
Roark
,
Michael
Riley
, and
Schogol
.
2019
.
Distilling weighted finite automata from arbitrary probabilistic models
. In
Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing (FSMNLP)
, pages
87
97
.
Tiño
,
Peter
and
Vojtek
.
1997
.
Extracting stochastic machines from recurrent neural networks trained on complex symbolic sequences
. In
Proceedings of the First International Conference on Knowledge-Based Intelligent Electronic Systems
,
volume 2
, pages
551
558
,
IEEE
.
Weiss
,
Gail
,
Yoav
Goldberg
, and
Eran
Yahav
.
2018
.
Extracting automata from recurrent neural networks using queries and counterexamples
. In
International Conference on Machine Learning
, pages
5244
5253
.
Weiss
,
Gail
,
Yoav
Goldberg
, and
Eran
Yahav
.
2019
.
Learning deterministic weighted automata with queries and counterexamples
.
Advances in Neural Information Processing Systems
, pages
8560
8571
.
Wolf-Sonkin
,
Lawrence
,
Schogol
,
Brian
Roark
, and
Michael
Riley
.
2019
.
Latin script keyboards for South Asian languages with finite-state normalization
. In
Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing (FSMNLP)
, pages
108
117
.

Author notes

*

Some of the results in this article were previously presented in Suresh et al. (2019).

This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits you to copy and redistribute in any medium or format, for non-commercial use only, provided that the original work is not remixed, transformed, or built upon, and that appropriate credit to the original source is given. For a full description of the license, please visit https://creativecommons.org/licenses/by-nc-nd/4.0/legalcode.