Abstract
Probabilistic finite-state automata are a formalism that is widely used in many problems of automatic speech recognition and natural language processing. Probabilistic finite-state automata are closely related to other finite-state models as weighted finite-state automata, word lattices, and hidden Markov models. Therefore, they share many similar properties and problems. Entropy measures of finite-state models have been investigated in the past in order to study the information capacity of these models. The derivational entropy quantifies the uncertainty that the model has about the probability distribution it represents. The derivational entropy in a finite-state automaton is computed from the probability that is accumulated in all of its individual state sequences. The computation of the entropy from a weighted finite-state automaton requires a normalized model. This article studies an efficient computation of the derivational entropy of left-to-right probabilistic finite-state automata, and it introduces an efficient algorithm for normalizing weighted finite-state automata. The efficient computation of the derivational entropy is also extended to continuous hidden Markov models.
1. Introduction
Probabilistic Finite-State Automata (PFA) and hidden Markov models (HMM) are well-known formalisms that have been widely used in automatic speech recognition (Ortmanns and Ney, 1997), machine translation (Ueffing, Och, and Ney, 2002), natural language processing (Mohri, Pereira, and Riley, 2002), and, more recently, in handwritten text recognition (Romero, Toselli, and Vidal, 2012). PFA and HMM can be considered special cases of weighted finite-state automata (WFA) (Mohri, Pereira, and Riley, 2002; Dupont, Denis, and Esposito, 2005). PFA and HMM were extensively researched in Vidal et al. (2005) and interesting probabilistic properties were demonstrated. In formal language theory, automata are considered as string acceptors, but PFA may be considered as generative processes (see Section 2.2 in Vidal et al. [2005]). We have followed the point of view of Vidal et al. (2005) in this article about this issue. PFA are closely related to word lattices (WL) (Tomita, 1986), which are currently a fundamental tool for many applications because WL convey most of the hypotheses produced by a decoder. WL have also been used for parsing (Tomita, 1986), for computing confidence measures in speech recognition (Kemp and Schaaf, 1997; Ortmanns and Ney, 1997; Sanchis, Juan, and Vidal, 2012), machine translation (Ueffing, Och, and Ney, 2002), and handwritten text recognition (Puigcerver, Toselli, and Vidal, 2014) for interactive transcription (Toselli, Vidal, and Casacuberta, 2011) and term detection (Can and Saraçlar, 2011). All of these applications require the WL to be correctly defined, and therefore, knowing the stochastic properties of WL becomes very important.
An algorithm that is based on a matrix inversion was introduced in Grenander (1967) for computing Equation (3), and, therefore, the time complexity is cubic with the number of states. Note that in the case of WL, the number of states can be thousands in practical situations, and therefore this solution is not interesting in practice. There is also interesting research related to the efficient computation of Equation (2). As mentioned, an efficient computation with HMM was proposed in Hernando, Crespi, and Cybenko (2005), and a generalized version was explored in Ilic (2011). When the word sequence is partially known, Mann and McCallum (2007) provide an algorithm for computing the entropy with conditional random fields. The derivational entropy of a probabilistic context-free grammar was investigated in Corazza and Satta (2007), and it was demonstrated that the derivational entropy coincides with the cross-entropy when the cross-entropy is used as objective function for probabilistic estimation of the probabilities of the rules. The computation of the derivational entropy for finite-state models was studied in Nederhof and Satta (2008), but an approximate solution was stated as the solution of a linear system of equations. Note that, in Hernando, Crespi, and Cybenko (2005), Mann and McCallum (2007), and Ilic (2011), the computation of Equation (2) is studied, whereas in this article the computation of Equation (3) is studied, as in Grenander (1967).
Cubic algorithms with respect to the number of states have been proposed for computing the derivational entropy of PFA without restrictions (Corazza and Satta, 2007; Nederhof and Satta, 2008). This article presents an algorithm that is asymptotically more efficient for left-to-right PFA (and therefore for WL), that is, linear with respect to the number of edges. This algorithm is then extended to left-to-right HMM. If the PFA is obtained from a WFA, like a WL, then it has to be adequately normalized for computing the derivational entropy. In this article, we adopt the normalization described in Thompson (1974), and we improve its computation for left-to-right PFA. The proposed normalization guarantees that the relative weights of the state sequences after the normalization are preserved. Normalization of WFA and probabilistic grammars have been studied in different articles. Thompson (1974) proposed the normalization of probabilistic grammars, on which our normalization technique is based. Normalization of WFA has also been studied in Mohri (2009), who proposed the weight pushing algorithm. Grammar normalization is also investigated in Chi (1999) and Abney, McAllester, and Pereira (1999). The normalization in WL is also a necessary process in order to compute word confidence measures at the frame level for many purposes (Wessel et al., 2001).
This article is organized as follows: Section 2 specifies the notation related to PFA, and the computation of like-forward and like-backward probabilities from these models. These forward and backward probabilities are not the classical probabilities and they are necessary for subsequent computations. In the case of models that may not be normalized, such as WL produced by a decoder, they need to be normalized before computing the derivational entropy. This normalization is explained in Section 3. At the end of Section 3, we include a discussion on related normalization techniques. Section 4 explains the main contribution of this article—namely, the efficient computation of the derivational entropy. Section 5 extends the computation of the derivational entropy to continuous HMM.
2. Left-to-Right Probabilistic Finite-State Automata
We introduce the notation related to PFA that will be used in this article following Vidal et al. (2005).
For the sake of notation simplification, P is assumed to be extended with P(i, v, j) = 0 for all (i, v, j) ∉ δ. An automaton is said to be proper if it satisfies Equation (5).
In this article, we will assume that all states are nominated by integers from 0 to |Q| − 1. For simplicity, we assume that the PFA have only one initial state, named 0, and, therefore, the sum in Equation (4) has only one term. We assume without loss of generality that the PFA has only one final state, named |Q| − 1, without loops. These assumptions greatly simplify the notation. For practical reasons, and also for simplifying the notation, we assume that the empty string is not in . This last assumption implies that PFA with only one state are not allowed.
Definition 2. A valid path in a PFA is a path for some w = v1v2 … vk. The set of valid paths in will be denoted as .
The probability of generating w with is , where is the set of all valid paths in for a given w. The language generated by a PFA is . A PFA is said to be consistent if . A PFA is consistent if its states are useful, that is, all states appear in at least one valid path (Vidal et al., 2005).
Definition 3. A left-to-right PFA3 is defined as a PFA such that each state can have loops but it has no cycles, and if the loops are removed then a directed acyclic automaton is obtained. The final state has no loops.
A left-to-right PFA has an equivalent left-to-right HMM 4 representation and vice versa, and they share the characteristic that once a state i has been reached, then it is not possible to go back to state j such that j < i. If the loops are removed in a left-to-right PFA, then a topological order might be induced on the resulting PFA. In the induced loop-free PFA, we call this order pseudo-topological order.5 From now on, we consider left-to-right PFA.
Proposition 1. For any i, 0 ≤ i < |Q|, it is fulfilled that .
Proof. Note that i affects the number of different states in the paths from i to |Q| − 1, and this is the reason why the previous sum ranges from l = 2 to |Q| − 1 − i + 1 = |Q| − i.
The proposition establishes that given a consistent PFA, then any PFA induced from a node and the nodes reachable from it and the corresponding edges is also consistent.
Proposition 2..
Let us see an example of the computation of the and values with the left-to-right PFA in Figure 2. First, we show the computation of the values:
Second, we show the computation of the values:
Note that if we consider that each state is nominated with integers in increasing order according to the pseudo-topological order, then have null values above the main diagonal. Something analogous can be demonstrated for , but, in that case, the null values are below the other diagonal.
The following proposition can be demonstrated from Proposition 1.
Values and are related to values in(⋅) and out(⋅) in Nederhof and Satta (2008). These values can be efficiently computed with time complexity O(|δ|) following the pseudo-topological order defined on . Note that |δ| is at most O(|Q|2|Σ|).
3. Weighted Finite-State Automata Normalization
WFA are defined similarly to PFA, but the probability transition of PFA is substituted by a weight function.
Definition 4. A WFA is a tuple , where: Q is a finite set of states; Σ is the alphabet; δ ⊆ Q × Σ × Q is a set of transitions; is the weight of a state being an initial state; is the weight associated with transitions between states; and is the weight of a state being a final state.
As we previously did with PFA, we only consider WFA with one initial state and one final state. Each path through the WFA has an associated weight that is computed as the product of the weights of the edges of that path. In order to compute the derivational entropy as described in Section 4, it is necessary to guarantee that the automaton is consistent. Therefore, the weights of the edges in the WFA have to be normalized, because the normalization is a sufficient condition for the WFA to be consistent and to become a PFA if all states appear at least in one path.
It is a desirable property for the normalization process of a WFA to keep the relative weights of the paths unaffected once the WFA becomes a PFA. Note that this condition can be guaranteed only if the loops in each state give rise to an infinite addition that converges to some constant value. This is the case of WL, and, therefore, we only consider WFA for which this condition is fulfilled.
A similar normalization solution proposed in Thompson (1974) for probabilistic regular grammars is adopted in this article, and it is adapted for left-to-right finite-state automata.
The following definition is related to a definition introduced in Thompson (1974) for probabilistic grammars.
Definition 5. The normalizing vector of a WFA is a |Q| × 1 column vector where each term , 0 ≤ i < |Q| − 1, is defined such that, if all transition weights W(i, v, j) ∈ δ are multiplied by with , then the WFA is transformed into a proper PFA.
If i = j then M(i, i) = ri.
The following theorem for WFA is stated in Thompson (1974) for probabilistic regular grammars.
Note that according to Definition 5.
As a final remark, note how the normalization takes place: Each transition weight W(i, v, j) in the WFA is changed inversely proportional to the weight accumulated in all the paths that start in i, that is, , and it is changed directly proportional to the weight accumulated in all the paths that start in j, that is, .
Expression (14) can be computed with an algorithm that is similar to the backward algorithm, and then summing up for all values in row i. In this way, the time complexity for obtaining the normalizing vector is less than the cubic time required by the matrix inversion.
Let us see an example of the normalization process with the acyclic WFA of Figure 3. This WFA is neither proper nor consistent.
After computations in Theorem 1 and given that , we get: . If we use the backward algorithm and we sum up for each row, then we obtain the same normalizing vector but more efficiently:
If this normalizing vector is applied to the WFA of Figure 3, then the PFA of Figure 4 is obtained. PFA of Figure 4 is proper and consistent.
This normalization technique is related to the weight pushing algorithm (Mohri, 2009) as follows: The value d[q] defined in expression (35) in Mohri (2009) is analogous to the value that is introduced in our Definition 5. Both values represent the accumulated weight in all state sequences from state q to the final state. The computations introduced in expressions (36)–(38) in Mohri (2009) are the same normalization that we explicitly introduce in the proof of Theorem 1 (see an interpretation of this normalization in the paragraph that follows our Equation (15)). The value is also related to the norm defined in Abney, McAllester, and Pereira (1999) and with the normalization described in Chi (1999; see Corollary 3).
One important contribution of our normalization algorithm with regard to the normalization algorithm presented in Mohri (2009) is that the time complexity for normalizing the graphs introduced in our article (left-to-right graphs with loops in the states) is O(|δ|), where δ is the set of edges of the automaton. Note that |δ| ≤ |Q|2, where Q is the set of states in the automaton. In terms of the discussion in Mohri (2009), the semiring used in our article is (R, +, ×, 0, 1) and it is a complete semiring. The normalization for semirings of this type is O(|Q|3) according to Mohri (2009). However, given the restricted graphs that we define, this complete semiring behaves as a k-closed semiring (see our Equations (12) and (13)) because the characteristic matrix M is an upper triangular matrix, and, therefore, the time complexity is O(|δ|).
The weight pushing algorithm is also related to the normalization algorithm described in Thompson (1974) as follows: The normalization in Thompson (1974) is applied to probabilistic regular grammars (see Theorem 6 in Thompson [1974]) and to probabilistic incontextual grammars in Chomsky Normal Form (see Theorem 7 in Thompson [1974]), but here we only focus on probabilistic regular grammars. One important issue in Thompson (1974) to be taken into account is that the initial probabilistic regular grammar may not be proper (see Definition 3 in Thompson [1974]), but consistent (see Definition 1 in Thompson [1974]). If the initial probabilistic regular grammar is not consistent, the normalization in Thompson (1974) may lead to a proper regular grammar that is not consistent. The main differences between our proposal for the normalization with respect to Thompson (1974) are the following:
- 1.
Thompson (1974) proposes a simultaneous normalization of the model that requires the computation of the inverse of a matrix that is similar to our Equation (10) (see the final expression in Theorem 8 in Thompson [1974]). Therefore, our proposal is more efficient for the given left-to-right automata that we are dealing with.
- 2.
Thompson (1974) points out a problem when the start symbol of the grammar is not proper (see Condition 4, page 611), although he mentioned an “intermediation” operation to overcome this problem. As he mentioned: “The conditions restricting these problems are left as an open problem.” This problem is not present in our proposal, as we demonstrated in Theorem 1.
The relation between Thompson (1974) and Mohri (2009) related to point 1 is that both normalization proposals are cubic for general regular models (grammars or automata). With regard to point 2, normalization in Mohri (2009) and our normalization do not have the problem of the consistency of the final model after normalization as may occur in Thompson (1974).
4. Derivational Entropy of a Left-to-Right PFA
The concept of derivational entropy was introduced in Grenander (1967) and was further studied in Soule (1974) for probabilistic grammars. Here we use a similar definition of derivational entropy for a Left-to-Right PFA.
The previous sum can have an infinite number of terms because of the loops in the states. We describe how the derivational entropy of a PFA can be efficiently computed.
Theorem 2. (Theorem 4.2 in Grenander [1967]) If the largest eigenvalue of the characteristic matrix M is strictly less than 1, then the derivational entropy of the model can be computed as: .
According to Theorem 2, the derivational entropy can be computed with time complexity O(|Q|3) given the inverse matrix computation. An analogous result about the computation of the derivational entropy is stated in Nederhof and Satta (2008) (see Lemma 8).
The two inner additions in Equation (18) can be interpreted as follows: Given a path , the probability is accumulated each time the transition (i, v, j) is used in that path. We distinguish three different cases for computing these two inner additions: i) edges between different states, ii) the loops in the initial state, and iii) all the other loops.
Case iii in Equation (18) for a given i and v (0 < i ≤ |Q| − 2, v ∈ Σ) corresponds to the addition of all paths that can be seen in Figure 6.
5. Derivational Entropy of a Continuous Left-to-Right HMM
In this section, we extend the efficient computation of the derivational entropy to continuous left-to-right HMM. First, we introduce the notation that will be used for HMM following the notation in Romero, Toselli, and Vidal (2012).
Note that we are considering continuous left-to-right HMM with just one initial state and just one final state. The concept of left-to-right is analogous to Definition 3. The emission probabilities are associated with states, not to transitions. For continuous HMM, b(i, x) is a continuous probability density function (pdf).
Given a continuous left-to-right HMM , a similar algorithm for computing can be defined. However, first, it is necessary to compute the probability of all real sequences that can be generated in a loop.
6. Conclusions
This article has studied the efficient computation of the derivational entropy for left-to-right PFA and HMM. This efficiency is mainly based on the left-to-right nature of the models. This efficient computation is based on algorithms that similar to the forward and backward algorithms defined for general finite state models. The algorithms that are introduced in this article are also necessary for normalizing the left-to-right WFA in order to guarantee that the derivational entropy is correctly computed.
Acknowledgments
This work has been partially supported through the European Union’s H2020 grant READ (Recognition and Enrichment of Archival Documents) (Ref: 674943) and the MINECO/FEDER-UE project TIN2015-70924-C2-1-R. The second author was supported by the “División de Estudios de Posgrado e Investigación” of Instituto Tecnológico de León.
Notes
Although the concrete formal notation used in this article will be introduced in Section 2, in this introduction we assume that the reader is familiar with some concepts related to finite-state automata theory.
Throughout this article, we assume that 0 log 0 = 0. In addition, logarithm to base 2 is used in this article.
PFA of this type are also known as Bakis models (Bakis, 1976).
HMM will be defined formally in Section 5.
Note that different pseudo-topological orders can exist, but this is not relevant in this article. We will consider just one of them.