## Abstract

Probabilistic finite-state automata are a formalism that is widely used in many problems of automatic speech recognition and natural language processing. Probabilistic finite-state automata are closely related to other finite-state models as weighted finite-state automata, word lattices, and hidden Markov models. Therefore, they share many similar properties and problems. Entropy measures of finite-state models have been investigated in the past in order to study the information capacity of these models. The derivational entropy quantifies the uncertainty that the model has about the probability distribution it represents. The derivational entropy in a finite-state automaton is computed from the probability that is accumulated in all of its individual state sequences. The computation of the entropy from a weighted finite-state automaton requires a normalized model. This article studies an efficient computation of the derivational entropy of left-to-right probabilistic finite-state automata, and it introduces an efficient algorithm for normalizing weighted finite-state automata. The efficient computation of the derivational entropy is also extended to continuous hidden Markov models.

## 1. Introduction

Probabilistic Finite-State Automata (PFA) and hidden Markov models (HMM) are well-known formalisms that have been widely used in automatic speech recognition (Ortmanns and Ney, 1997), machine translation (Ueffing, Och, and Ney, 2002), natural language processing (Mohri, Pereira, and Riley, 2002), and, more recently, in handwritten text recognition (Romero, Toselli, and Vidal, 2012). PFA and HMM can be considered special cases of weighted finite-state automata (WFA) (Mohri, Pereira, and Riley, 2002; Dupont, Denis, and Esposito, 2005). PFA and HMM were extensively researched in Vidal et al. (2005) and interesting probabilistic properties were demonstrated. In formal language theory, automata are considered as string acceptors, but PFA may be considered as generative processes (see Section 2.2 in Vidal et al. [2005]). We have followed the point of view of Vidal et al. (2005) in this article about this issue. PFA are closely related to word lattices (WL) (Tomita, 1986), which are currently a fundamental tool for many applications because WL convey most of the hypotheses produced by a decoder. WL have also been used for parsing (Tomita, 1986), for computing confidence measures in speech recognition (Kemp and Schaaf, 1997; Ortmanns and Ney, 1997; Sanchis, Juan, and Vidal, 2012), machine translation (Ueffing, Och, and Ney, 2002), and handwritten text recognition (Puigcerver, Toselli, and Vidal, 2014) for interactive transcription (Toselli, Vidal, and Casacuberta, 2011) and term detection (Can and Saraçlar, 2011). All of these applications require the WL to be correctly defined, and therefore, knowing the stochastic properties of WL becomes very important.

**sentential entropy**) and derivational entropy for formal grammars were introduced in Grenander (1967). Given a finite-state automaton $A$, the sequence entropy

^{1}of the model is defined as:

^{2}

*w*, which is defined in Hernando, Crespi, and Cybenko (2005):

*w*. Note that Equation (2) requires that the condition $\u2211\theta \u2208\Theta A(w)pA(\theta |w)=1.0$ must be fulfilled.

*aa*is:

An algorithm that is based on a matrix inversion was introduced in Grenander (1967) for computing Equation (3), and, therefore, the time complexity is cubic with the number of states. Note that in the case of WL, the number of states can be thousands in practical situations, and therefore this solution is not interesting in practice. There is also interesting research related to the efficient computation of Equation (2). As mentioned, an efficient computation with HMM was proposed in Hernando, Crespi, and Cybenko (2005), and a generalized version was explored in Ilic (2011). When the word sequence is partially known, Mann and McCallum (2007) provide an algorithm for computing the entropy with conditional random fields. The derivational entropy of a probabilistic context-free grammar was investigated in Corazza and Satta (2007), and it was demonstrated that the derivational entropy coincides with the cross-entropy when the cross-entropy is used as objective function for probabilistic estimation of the probabilities of the rules. The computation of the derivational entropy for finite-state models was studied in Nederhof and Satta (2008), but an approximate solution was stated as the solution of a linear system of equations. Note that, in Hernando, Crespi, and Cybenko (2005), Mann and McCallum (2007), and Ilic (2011), the computation of Equation (2) is studied, whereas in this article the computation of Equation (3) is studied, as in Grenander (1967).

Cubic algorithms with respect to the number of states have been proposed for computing the derivational entropy of PFA without restrictions (Corazza and Satta, 2007; Nederhof and Satta, 2008). This article presents an algorithm that is asymptotically more efficient for left-to-right PFA (and therefore for WL), that is, linear with respect to the number of edges. This algorithm is then extended to left-to-right HMM. If the PFA is obtained from a WFA, like a WL, then it has to be adequately normalized for computing the derivational entropy. In this article, we adopt the normalization described in Thompson (1974), and we improve its computation for left-to-right PFA. The proposed normalization guarantees that the relative weights of the state sequences after the normalization are preserved. Normalization of WFA and probabilistic grammars have been studied in different articles. Thompson (1974) proposed the normalization of probabilistic grammars, on which our normalization technique is based. Normalization of WFA has also been studied in Mohri (2009), who proposed the *weight pushing* algorithm. Grammar normalization is also investigated in Chi (1999) and Abney, McAllester, and Pereira (1999). The normalization in WL is also a necessary process in order to compute word confidence measures at the frame level for many purposes (Wessel et al., 2001).

This article is organized as follows: Section 2 specifies the notation related to PFA, and the computation of like-forward and like-backward probabilities from these models. These forward and backward probabilities are not the classical probabilities and they are necessary for subsequent computations. In the case of models that may not be normalized, such as WL produced by a decoder, they need to be normalized before computing the derivational entropy. This normalization is explained in Section 3. At the end of Section 3, we include a discussion on related normalization techniques. Section 4 explains the main contribution of this article—namely, the efficient computation of the derivational entropy. Section 5 extends the computation of the derivational entropy to continuous HMM.

## 2. Left-to-Right Probabilistic Finite-State Automata

We introduce the notation related to PFA that will be used in this article following Vidal et al. (2005).

**Definition 1.**A

*PFA*is a tuple $A=\u2329Q,\Sigma ,\delta ,I,F,P\u232a$, where:

*Q*is a finite set of states; Σ is the alphabet; δ ⊆

*Q*× Σ ×

*Q*is a set of transitions; $I:Q\u2192R\u22650$ is the probability function of a state being an initial state; $P:\delta \u2192R\u22650$ is a probability function of transition between states; and $F:Q\u2192R\u22650$ is the probability function of a state being a final state.

*I*,

*P*, and

*F*are functions such that:

For the sake of notation simplification, *P* is assumed to be extended with *P*(*i*, *v*, *j*) = 0 for all (*i*, *v*, *j*) ∉ δ. An automaton is said to be **proper** if it satisfies Equation (5).

In this article, we will assume that all states are nominated by integers from 0 to |*Q*| − 1. For simplicity, we assume that the PFA have only one initial state, named 0, and, therefore, the sum in Equation (4) has only one term. We assume without loss of generality that the PFA has only one final state, named |*Q*| − 1, without loops. These assumptions greatly simplify the notation. For practical reasons, and also for simplifying the notation, we assume that the empty string is not in $L(A)$. This last assumption implies that PFA with only one state are not allowed.

*w*whose length is

*k*; that is, there is a sequence of transitions (

*i*

_{0},

*v*

_{1},

*i*

_{1}), (

*i*

_{1},

*v*

_{2},

*i*

_{2}), …, (

*i*

_{k−1},

*v*

_{k},

*i*

_{k}) ∈ δ such that

*w*=

*v*

_{1}

*v*

_{2}…

*v*

_{k}. We do not consider empty paths and therefore

*k*≥ 1. The probability of generating such a path is:

*i*,

*v*,

*j*) has been used in $\theta A$.

**Definition 2.** A **valid path** in a PFA $A$ is a path $\theta A=(i0=0,v1,i1,v2,i2,\u2026,ik\u22121,vk,ik=|Q|\u22121)$ for some *w* = *v*_{1}*v*_{2} … *v*_{k}. The set of valid paths in $A$ will be denoted as $\Theta A$.

The probability of generating *w* with $A$ is $pA(w)=\u2211\theta \u2208\Theta A(w)pA(\theta )$, where $\Theta A(w)$ is the set of all valid paths in $A$ for a given *w*. The language generated by a PFA $A$ is $L(A)={w:pA(w)>0}$. A PFA is said to be consistent if $\u2211w\u2208L(A)pA(w)=1$. A PFA is consistent if its states are useful, that is, all states appear in at least one valid path (Vidal et al., 2005).

**Definition 3.** A **left-to-right PFA**^{3} is defined as a PFA such that each state can have loops but it has no cycles, and if the loops are removed then a directed acyclic automaton is obtained. The final state has no loops.

A left-to-right PFA has an equivalent *left-to-right* HMM ^{4} representation and vice versa, and they share the characteristic that once a state *i* has been reached, then it is not possible to go back to state *j* such that *j* < *i*. If the loops are removed in a left-to-right PFA, then a topological order might be induced on the resulting PFA. In the induced loop-free PFA, we call this order *pseudo*-topological order.^{5} From now on, we consider left-to-right PFA.

*i*, and ending in

*j*with

*l*(

*l*≥ 1) different states (

*i*and

*j*inclusive). Note that

*l*is not the length of the path, but the number of states that are different in each path from

*i*to

*j*. For example, in the PFA in Figure 2, the two paths (0,

*a*, 1,

*a*, 2) and (0,

*a*, 0,

*a*, 1,

*a*, 1,

*b*, 2) belong to $\Theta A(0,2,3)$, because both start in state 0, arrive to state 2, and use three different states: 0, 1, and 2. But their lengths are different. The generated string when traversing a path is not relevant for the computations involved in this article, so, to simplify, the symbol information is omitted in subsequent definitions. We define the following probability:

*l*different states and ending in state

*i*. This value is necessary for normalizing a WFA, as we describe subsequently. A similar expression to Equation (7) can be defined for suffixes:

*l*different states and starting in state

*i*. This expression is also necessary for normalizing a WFA. Note that Equation (8) makes sense for

*l*≥ 2 because for

*l*= 1 we have that $\Theta A(|Q|\u22121,|Q|\u22121,1)$ is empty, since we assumed only one final state without loops. When

*l*= |

*Q*| then $pA(\Theta A(0,|Q|\u22121,|Q|))=1.0$.

*r*, (

*r*< 1), the addition of all loop probabilities in the same state, then we obtain:

*n*= 0, no loop in state is used and the probability of paths that go through the state is multiplied by 1 so its value is not changed.

*a*,

*b*,

*c*}), with probabilities

*a*

_{1}=

*P*(

*i*,

*a*,

*i*),

*b*

_{1}=

*P*(

*i*,

*b*,

*i*), and

*c*

_{1}=

*P*(

*i*,

*c*,

*i*), respectively, then the result is analogous:

*i*, 0 ≤

*i*< |

*Q*| − 1, is:

**Proposition 1.** For any *i*, 0 ≤ *i* < |*Q*|, it is fulfilled that $\u2211l=2|Q|\u2212ipA(\Theta A(i,|Q|\u22121,l))=1$.

*Proof.* Note that *i* affects the number of different states in the paths from *i* to |*Q*| − 1, and this is the reason why the previous sum ranges from *l* = 2 to |*Q*| − 1 − *i* + 1 = |*Q*| − *i*.

*l*and following the nodes in the reverse of the

*pseudo*-topological order. Note that we start the proof for

*i*= |

*Q*| − 2 because the final state has no loops. In this case, the sum ranges from

*l*= 2 to |

*Q*| − (|

*Q*| − 2)) = 2:

*Q*| − 2 times the transitions to the final state.

*i*in the

*pseudo*-topological order, such that 0 ≤

*i*< |

*Q*| − 2,

The proposition establishes that given a consistent PFA, then any PFA induced from a node and the nodes reachable from it and the corresponding edges is also consistent.

*forward*computation proposed for HMM (Vidal et al., 2005). We define $\alpha ^A(i,l)$, 0 ≤

*i*≤ |

*Q*| − 1 and 1 ≤

*l*≤ |

*Q*| as the probability accumulated in all prefixes with paths each of which has

*l*different states and reaching state

*i*: $\alpha ^A(i,l)=pA(\Theta A(0,i,l))$. The computation of $\alpha ^A(\u22c5,\u22c5)$ can be performed with this new

*forward*algorithm:

*r*

_{i}is the addition of all loop probabilities in state

*i*. If

*i*= |

*Q*| − 1, then

*r*

_{|Q|−1}= 0.

**Proposition 2.**$\u2211l=1|Q|\alpha ^A(|Q|\u22121,l)=1$.

*i*≤|

*Q*| − 1 and 2 ≤

*l*≤|

*Q*|,

*l*different states, start in state

*i*, and reach the final state |

*Q*| − 1. This expression can be computed with this

*backward*algorithm:

Let us see an example of the computation of the $\alpha ^A(\u22c5,\u22c5)$ and $\beta ^A(\u22c5,\u22c5)$ values with the left-to-right PFA in Figure 2. First, we show the computation of the $\alpha ^A(\u22c5,\u22c5)$ values:

Second, we show the computation of the $\beta ^A(\u22c5,\u22c5)$ values:

Note that if we consider that each state is nominated with integers in increasing order according to the *pseudo*-topological order, then $\alpha ^A(\u22c5,\u22c5)$ have null values above the main diagonal. Something analogous can be demonstrated for $\beta ^A(\u22c5,\u22c5)$, but, in that case, the null values are below the other diagonal.

The following proposition can be demonstrated from Proposition 1.

**Proposition 3.**For all

*i*, 0 ≤

*i*≤|

*Q*| − 1, we have:

Values $\alpha ^A(\u22c5,\u22c5)$ and $\beta ^A(\u22c5,\u22c5)$ are related to values *in*(⋅) and *out*(⋅) in Nederhof and Satta (2008). These values can be efficiently computed with time complexity *O*(|δ|) following the *pseudo*-topological order defined on $A$. Note that |δ| is at most *O*(|*Q*|^{2}|Σ|).

## 3. Weighted Finite-State Automata Normalization

WFA are defined similarly to PFA, but the probability transition of PFA is substituted by a weight function.

**Definition 4.** A *WFA* is a tuple $A=\u2329Q,\Sigma ,\delta ,IW,FW,W\u232a$, where: *Q* is a finite set of states; Σ is the alphabet; δ ⊆ *Q* × Σ × *Q* is a set of transitions; $IW:Q\u2192R\u22650$ is the weight of a state being an initial state; $W:\delta \u2192R\u22650$ is the weight associated with transitions between states; and $FW:Q\u2192R\u22650$ is the weight of a state being a final state.

As we previously did with PFA, we only consider WFA with one initial state and one final state. Each path through the WFA has an associated weight that is computed as the product of the weights of the edges of that path. In order to compute the derivational entropy as described in Section 4, it is necessary to guarantee that the automaton is consistent. Therefore, the weights of the edges in the WFA have to be normalized, because the normalization is a sufficient condition for the WFA to be consistent and to become a PFA if all states appear at least in one path.

It is a desirable property for the normalization process of a WFA to keep the relative weights of the paths unaffected once the WFA becomes a PFA. Note that this condition can be guaranteed only if the loops in each state give rise to an infinite addition that converges to some constant value. This is the case of WL, and, therefore, we only consider WFA for which this condition is fulfilled.

A similar normalization solution proposed in Thompson (1974) for probabilistic regular grammars is adopted in this article, and it is adapted for left-to-right finite-state automata.

*Q*| − 1) × 1 column vector, where:

The following definition is related to a definition introduced in Thompson (1974) for probabilistic grammars.

**Definition 5.** The normalizing vector $N$ of a WFA $A$ is a |*Q*| × 1 column vector where each term $N(i)$, 0 ≤ *i* < |*Q*| − 1, is defined such that, if all transition weights *W*(*i*, *v*, *j*) ∈ δ are multiplied by $N(j)/N(i)$ with $N(|Q|\u22121)=1$, then the WFA $A$ is transformed into a proper PFA.

**Definition 6.**Given a WFA $A$, we define the characteristic matrix

*M*(Thompson, 1974) of dimensions |

*Q*| × |

*Q*| as:

If *i* = *j* then *M*(*i*, *i*) = *r*_{i}.

The following theorem for WFA is stated in Thompson (1974) for probabilistic regular grammars.

**Theorem 1.**Let $A$ be a WFA and

*M*, ν, and $N$be the corresponding characteristic matrix, final vector, and normalizing vector, where

*M*and ν are known, and where the row and column of the characteristic matrix associated with state |

*Q*| − 1 have been removed. Then $N$ can be computed as:

*Proof.*We have to demonstrate how to obtain Equation (10) and that the relative weights of the paths are unaffected. First, we demonstrate how to obtain Equation (10) as in Thompson (1974). By definition of $N$, normalization takes place by changing the transition weights

*W*(

*i*,

*v*,

*j*), 0 ≤

*i*,

*j*< |

*Q*| − 1, to $W(i,v,j)N(j)/N(i)$. When

*j*= |

*Q*|− 1, then $W(i,v,|Q|\u22121)/N(i)$. Then, the sum of the new weights have to add up to 1 for all

*i*. That is:

*I*−

*M*)

^{−1}is upper triangular with non-null diagonal for left-to-right WFA.

*k*:

Note that $N(|Q|\u22121)=1$ according to Definition 5.

*M*

^{k}(

*i*,

*j*) is in fact the addition of the weights of all paths from

*i*to

*j*(

*i*,

*j*≠ |

*Q*| − 1) with length

*k*that have at most

*k*+ 1 different states. Therefore, if

*i*=

*j*,

*i*≠ |

*Q*|− 1, expression (11) becomes:

*i*≠

*j*, then Equation (11) becomes:

*k*≠ |

*Q*| − 1 because, as stated in Theorem 1, the row and column of the characteristic matrix

*M*associated with state |

*Q*| − 1 have been removed. Therefore:

*i*to |

*Q*| − 1. Note that $N(0)$ represents the weight accumulated in all paths in the WFA $WA(\Theta A(0,|Q|\u22121,\u22c5))$ (as we mentioned in Theorem 1):

As a final remark, note how the normalization takes place: Each transition weight *W*(*i*, *v*, *j*) in the WFA $A$ is changed inversely proportional to the weight accumulated in all the paths that start in *i*, that is, $N(i)$, and it is changed directly proportional to the weight accumulated in all the paths that start in *j*, that is, $N(j)$.

Expression (14) can be computed with an algorithm that is similar to the backward algorithm, and then summing up for all values in row *i*. In this way, the time complexity for obtaining the normalizing vector is less than the cubic time required by the matrix inversion.

Let us see an example of the normalization process with the acyclic WFA of Figure 3. This WFA is neither proper nor consistent.

After computations in Theorem 1 and given that $N(|Q|\u22121)=1.0$, we get: $NT=(0.21540.23320.35710.57121.0)$. If we use the backward algorithm and we sum up for each row, then we obtain the same normalizing vector but more efficiently:

If this normalizing vector is applied to the WFA of Figure 3, then the PFA of Figure 4 is obtained. PFA of Figure 4 is proper and consistent.

This normalization technique is related to the weight pushing algorithm (Mohri, 2009) as follows: The value *d*[*q*] defined in expression (35) in Mohri (2009) is analogous to the value $N(q)$ that is introduced in our Definition 5. Both values represent the accumulated weight in all state sequences from state *q* to the final state. The computations introduced in expressions (36)–(38) in Mohri (2009) are the same normalization that we explicitly introduce in the proof of Theorem 1 (see an interpretation of this normalization in the paragraph that follows our Equation (15)). The value $N(q)$ is also related to the norm defined in Abney, McAllester, and Pereira (1999) and with the normalization described in Chi (1999; see Corollary 3).

One important contribution of our normalization algorithm with regard to the normalization algorithm presented in Mohri (2009) is that the time complexity for normalizing the graphs introduced in our article (left-to-right graphs with loops in the states) is *O*(|δ|), where δ is the set of edges of the automaton. Note that |δ| ≤ |*Q*|^{2}, where *Q* is the set of states in the automaton. In terms of the discussion in Mohri (2009), the semiring used in our article is (*R*, +, ×, 0, 1) and it is a complete semiring. The normalization for semirings of this type is *O*(|*Q*|^{3}) according to Mohri (2009). However, given the restricted graphs that we define, this complete semiring behaves as a *k*-closed semiring (see our Equations (12) and (13)) because the characteristic matrix *M* is an upper triangular matrix, and, therefore, the time complexity is *O*(|δ|).

The weight pushing algorithm is also related to the normalization algorithm described in Thompson (1974) as follows: The normalization in Thompson (1974) is applied to probabilistic regular grammars (see Theorem 6 in Thompson [1974]) and to probabilistic incontextual grammars in Chomsky Normal Form (see Theorem 7 in Thompson [1974]), but here we only focus on probabilistic regular grammars. One important issue in Thompson (1974) to be taken into account is that the initial probabilistic regular grammar may not be proper (see Definition 3 in Thompson [1974]), but consistent (see Definition 1 in Thompson [1974]). If the initial probabilistic regular grammar is not consistent, the normalization in Thompson (1974) may lead to a proper regular grammar that is not consistent. The main differences between our proposal for the normalization with respect to Thompson (1974) are the following:

- 1.
Thompson (1974) proposes a simultaneous normalization of the model that requires the computation of the inverse of a matrix that is similar to our Equation (10) (see the final expression in Theorem 8 in Thompson [1974]). Therefore, our proposal is more efficient for the given left-to-right automata that we are dealing with.

- 2.
Thompson (1974) points out a problem when the start symbol of the grammar is not proper (see Condition 4, page 611), although he mentioned an “intermediation” operation to overcome this problem. As he mentioned: “The conditions restricting these problems are left as an open problem.” This problem is not present in our proposal, as we demonstrated in Theorem 1.

The relation between Thompson (1974) and Mohri (2009) related to point 1 is that both normalization proposals are cubic for general regular models (grammars or automata). With regard to point 2, normalization in Mohri (2009) and our normalization do not have the problem of the consistency of the final model after normalization as may occur in Thompson (1974).

## 4. Derivational Entropy of a Left-to-Right PFA

The concept of **derivational entropy** was introduced in Grenander (1967) and was further studied in Soule (1974) for probabilistic grammars. Here we use a similar definition of **derivational entropy** for a Left-to-Right PFA.

**Definition 7.**The derivational entropy of a PFA $A$ is defined as:

The previous sum can have an infinite number of terms because of the loops in the states. We describe how the derivational entropy of a PFA can be efficiently computed.

*i*< |

*Q*| − 1, as in Soule (1974):

*Q*| − 1) = 0.

**Theorem 2.** (**Theorem 4.2 in Grenander [1967]**) If the largest eigenvalue of the characteristic matrix *M* is strictly less than 1, then the derivational entropy of the model can be computed as: $Hd(A)=(1\u2212M)\u22121\xi $.

*M*is obtained from a left-to-right PFA, then the largest eigenvalue of

*M*is guaranteed to be strictly less than 1 (see Theorem 5 in Wetherell [1980]). The main idea in Theorem 2 is to compute the average number of times that state

*i*has been used in all valid paths of $A$ times the entropy introduced by transitions leaving from that state according to Equation (16). Because we are dealing with left-to-right PFA with only one initial state, only this initial state is relevant for computing the derivational entropy, which then becomes the following scalar:

According to Theorem 2, the derivational entropy can be computed with time complexity *O*(|*Q*|^{3}) given the inverse matrix computation. An analogous result about the computation of the derivational entropy is stated in Nederhof and Satta (2008) (see Lemma 8).

The two inner additions in Equation (18) can be interpreted as follows: Given a path $\theta \u2208\Theta A(0,|Q|\u22121,l)$, the probability $pA(\theta )$ is accumulated each time the transition (*i*, *v*, *j*) is used in that path. We distinguish three different cases for computing these two inner additions: i) edges between different states, ii) the loops in the initial state, and iii) all the other loops.

*i*<

*j*. Therefore, taking into account the numeration of the states according to the

*pseudo*-topological order, Equation (18) for these nodes becomes:

Case iii in Equation (18) for a given *i* and *v* (0 < *i* ≤ |*Q*| − 2, *v* ∈ Σ) corresponds to the addition of all paths that can be seen in Figure 6.

*i*(1/(1 −

*r*

_{i})) multiplied by the probability of transition (

*i*,

*v*,

*i*) multiplied again by the product of all strings that can be composed traversing the loops in state

*i*.

*O*(|δ|); therefore, the time complexity of computing the derivational entropy with this method is

*O*(|δ|). Note that this method is clearly better than the time complexity of the method proposed in Grenander (1967), which is cubic. Note that for dense graphs the time complexity of the new proposed method is at most quadratic with the number of nodes in the PFA.

*I*−

*M*)

^{−1}is:

## 5. Derivational Entropy of a Continuous Left-to-Right HMM

In this section, we extend the efficient computation of the derivational entropy to continuous left-to-right HMM. First, we introduce the notation that will be used for HMM following the notation in Romero, Toselli, and Vidal (2012).

**Definition 8.**A continuous HMM is defined as a tuple $H=(Q,I,F,X,a,b)$ where

*Q*is a finite set of states;

*I*is the initial state,

*I*∈

*Q*;

*F*is the final state with

*F*∈

*Q*; and

*X*is a real

*d*-dimensional space of observations:

*X*⊆ $R$

^{d}. To make the subsequent equations simpler, we will assume that

*X*is a scalar (i.e., $X\u2286R)$. The extension to multidimensional

*X*is straightforward.

*a*is the state-transition probability function, such that for all

*i*, 0 ≤

*i*< |

*Q*|− 1:

*b*is an emission probability distribution function such that for all

*i*, 0 ≤

*i*< |

*Q*| − 1

Note that we are considering continuous left-to-right HMM with just one initial state and just one final state. The concept of left-to-right is analogous to Definition 3. The emission probabilities are associated with states, not to transitions. For continuous HMM, *b*(*i*, *x*) is a continuous probability density function (pdf).

Given a continuous left-to-right HMM $H$, a similar algorithm for computing $\alpha ^H(\u22c5,\u22c5)$ can be defined. However, first, it is necessary to compute the probability of all real sequences that can be generated in a loop.

*i*be a state of a HMM $H$ with

*r*

_{i}as the loop probability, and

*b*(

*i*,

*x*) as the associated pdf. To compute the probability of all real sequences in the loop, we start from a discrete case by sampling the pdf and then use limits to extend it to continuous:

*i*, which for the case of Gaussian mixture models, which are commonly used in practice, the entropy can be approximated (Huber et al., 2008).

## 6. Conclusions

This article has studied the efficient computation of the derivational entropy for left-to-right PFA and HMM. This efficiency is mainly based on the left-to-right nature of the models. This efficient computation is based on algorithms that similar to the *forward* and *backward* algorithms defined for general finite state models. The algorithms that are introduced in this article are also necessary for normalizing the left-to-right WFA in order to guarantee that the derivational entropy is correctly computed.

## Acknowledgments

This work has been partially supported through the European Union’s H2020 grant READ (Recognition and Enrichment of Archival Documents) (Ref: 674943) and the MINECO/FEDER-UE project TIN2015-70924-C2-1-R. The second author was supported by the “División de Estudios de Posgrado e Investigación” of Instituto Tecnológico de León.

## Notes

Although the concrete formal notation used in this article will be introduced in Section 2, in this introduction we assume that the reader is familiar with some concepts related to finite-state automata theory.

Throughout this article, we assume that 0 log 0 = 0. In addition, logarithm to base 2 is used in this article.

PFA of this type are also known as Bakis models (Bakis, 1976).

HMM will be defined formally in Section 5.

Note that different *pseudo*-topological orders can exist, but this is not relevant in this article. We will consider just one of them.