## Abstract

Observable operator models (OOMs) are a class of models for stochastic processes that properly subsumes the class that can be modeled by finite-dimensional hidden Markov models (HMMs). One of the main advantages of OOMs over HMMs is that they admit asymptotically correct learning algorithms. A series of learning algorithms has been developed, with increasing computational and statistical efficiency, whose recent culmination was the error-controlling (EC) algorithm developed by the first author. The EC algorithm is an iterative, asymptotically correct algorithm that yields (and minimizes) an assured upper bound on the modeling error. The run time is faster by at least one order of magnitude than EM-based HMM learning algorithms and yields significantly more accurate models than the latter. Here we present a significant improvement of the EC algorithm: the constructive error-controlling (CEC) algorithm. CEC inherits from EC the main idea of minimizing an upper bound on the modeling error but is constructive where EC needs iterations. As a consequence, we obtain further gains in learning speed without loss in modeling accuracy.

## 1. Introduction

Observable operator models (Jaeger, 2000b) are a mathematical model class for stochastic processes that generalizes hidden Markov models (HMMs) (Bengio, 1999) and can be represented in a structurally similar matrix formalism, with the crucial difference that the range of model parameters is extended from nonnegative to arbitrary real numbers. This relaxation on the parameter range endows OOMs with linear algebraic properties that are not available for HMMs. These properties engender a basic scheme for learning OOMs that is constructive and asymptotically correct. The statistical efficiency of this basic scheme, however, depends crucially on the design of two auxiliary matrices, which are now named the *characterizer* and the *indicator* (Zhao, Jaeger, & Thon, 2009) and will be denoted by *C* and *Q*, respectively, throughout this letter.

To optimize the design of the matrices *C* and *Q* is one of the major streams in OOM research. The early learning algorithms (Jaeger, 1998, 2000a, 2000b) relied on simple heuristics for creating *C* and *Q*, and the ensuing poor statistical efficiency made OOMs learned in this way no match for HMMs trained with the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977). Kretzschmar (2003) identified the crucial role of the matrices *C* and *Q* in the basic scheme and provided a first learning algorithm whose rationale was to optimize their design. Since then, the importance and the challenge of optimizing *C* and *Q* have become increasingly better understood.

After Kretzschmar's lead work, two theoretically justified and computationally accessible optimality criteria for designing *C* and *Q* were worked out, resulting in the two current state-of-the-art instantiations of the basic learning scheme of OOMs: the efficiency-sharpening (ES) algorithm (Jaeger et al., 2005) and the error-controlling (EC) algorithm (Zhao, Jaeger, & Thon, 2009). Both ES and EC implement iterative schemes to optimize *C* and *Q*. Their computational and statistical efficiencies are by and large similar (and outperform EM-based HMM learning), with perhaps a small advantage for EC (Zhao, Jaeger, & Thon, 2009). The underlying ideas for the two are, however, interestingly and fundamentally different.

The perspective of ES is statistical. This algorithm aims at minimizing the effects of stochastic variation in *C* and *Q*. An algebraic characterization of minimal variance of the estimator is exploited to iteratively improve *C* and *Q* toward the goal of minimizing the variance of the overall algorithm.

The EC algorithm, adopting a complementary view, concentrates on algebraic properties of the learning equations. This algorithm optimizes *C* and *Q* by exploiting algebraic laws characterizing the condition of matrices, which in turn is related to an upper bound of the relative error in model parameters (or, for short, modeling error). As a side effect, this focus on matrix conditions affords a much improved numerical stability of EC over ES.

Although the basic OOM learning scheme is constructive, neither ES nor EC has this attractive property, for both employ an iterative procedure to compute the auxiliary matrices *C* and *Q*. In this letter, following the road of EC, we derive a new upper bound of modeling error that is tighter than the one used by the EC algorithm, and we propose a constructive method for minimizing this upper bound. Moreover, this constructive algorithm for optimizing *C* and *Q* is globally optimal, whereas the iterative methods employed by EC and ES are only locally optimal. Inserting this analytical minimizer into the basic learning scheme, we obtain an improved version of EC. The resulting constructive error-controlling (CEC) algorithm is the first learning algorithm for OOMs that is constructive (hence, fast), asymptotically correct (the hallmark of all OOM learning algorithms), and statistically efficient. This combination of properties recommends CEC, in our view, as a potentially eminently useful technique.

We perceive this work as a follow-up to Zhao, Jaeger, and Thon (2009). While self-contained, this letter is therefore more condensed in some expository parts than its precursor, and it may be helpful to consult the latter when more background or detail is desired. We first briefly reintroduce the EC algorithm in section 2, clarifying the problem to be addressed and fixing notation. Section 3 describes the derivation of CEC in detail, which includes a novel (tighter) upper bound of modeling error, an analytical minimizer of this error bound, and the overall CEC algorithm. We then assess the performance of the CEC algorithm in comparison with EC and ES on the same modeling tasks as those from Zhao, Jaeger, and Thon (2009) and document the empirical results in section 4. Finally, we summarize the letter and draw some conclusions in section 5.

## 2. The Error-Controlling Algorithm

*O*= {

*a*

^{1},

*a*

^{2}, …,

*a*

^{ℓ}} be a finite set of observable values (an alphabet), and let be a stochastic process with all

*X*∈

_{n}*O*and initial joint probabilities: Following the notation convention of Zhao, Jaeger, and Thon (2009), we use lowercase letter with a bar to denote finite-length sequences of symbols from

*O*(e.g., ) and

*O** to denote the set of all such finite sequences, including the empty sequence ϵ. The above family of joint probabilities can then be rewritten as , with the agreement that

*P*(ϵ) = 1.

*X*) is a triple of the

_{n}*m*-dimensional Euclidean space , an

*O*-indexed family {τ

_{a}}

_{a∈O}of square matrices of order

*m*(called

*observable operators*), and an initial vector such that for any finite sequence , it holds that where

**1**

_{m}denotes the

*m*-dimensional column vector of units and the product of observable operators associated with the sequence (notice the reversal of order in indexing).

We consider the following OOM learning task: assuming the process (*X _{n}*) is stationary, ergodic, and has initial distribution determined by equation 2.1, the objective is to estimate an OOM of (

*X*) from one of its finite initial realizations (say, ), which can be used to approximately reproduce the distribution of (

_{n}*X*), that is, . A general constructive algorithm (the basic scheme) for this learning task is outlined as follows (Jaeger, 2000b; Jaeger et al., 2005):

_{n}- For some sufficiently large , take two sets and of basic strings (’s are called
*indicative strings*and ’s*characteristic strings*), which enumerate*O*, the set of all sequences over^{k}*O*of length*k*— so*K*= ℓ^{k}; and construct the*K*×*K*normalized counting matrices and (∀*a*∈*O*) with their (*i*,*j*)th entry determined by where (*T*− 2*k*) is the normalization factor and (, respectively) is the concatenation of (and*a*, resp.) and ; for any , denotes the occurrence number of in and its occurrence number in*s*_{1}*s*_{2}…*s*_{T−1}. Design two auxiliary matrices: the characterizer and the indicator , such that

**1**^{⊺}_{m}*C*=**1**^{⊺}_{K}and the*m*×*m*matrix is invertible.Obtain an (asymptotically correct) estimate of an OOM for the process (

*X*) by computing observable operators by and an initial vector by solving the equations and ._{n}

We begin with a historical remark. Kretzschmar (2003) derived the learning equations and, noting the importance of the matrix in the above basic learning scheme, suggested that the design of *C* and *Q* should make the condition number minimal, where, for any matrix *A*, ‖*A*‖_{2} denotes its spectral norm (i.e., the largest singular value of *A*). However, this algebraic criterion, while apparently reasonable at first glance, is actually misguided in the sense that it is always possible to obtain κ = 1, the smallest possible value of κ, by first creating an arbitrary such that **1**^{⊺}_{m}*C* = **1**^{⊺}_{K} and that has full rank *m*, and then putting , the pseudo-inverse of : with *C* and *Q* constructed in this way, we know (the identity matrix of order *m*) and hence κ = 1.

*C*and

*Q*, which does not fall prey to this fallacy on this direction, has been developed in our previous work. It is based on two findings. First, if the probability matrices (the “true” normalized counting matrices) and ’s, are known, then the true OOM of the underlying process can be reconstructed using the equations (Jaeger et al., 2005). Second, as proven in proposition 3 of Zhao, Jaeger, and Thon (2009) the relative error in estimated observable operators and the statistical error in counting matrices and are governed by the inequality where, for any matrix

*A*, ‖

*A*‖ denotes its Frobenius norm ; τ is the tall matrix formed by stacking all τ

_{a}’s one above another: (in Matlab's notation); and , , and are tall matrices constructed, respectively, from ’s, ’s, and ’s by the same stacking scheme; and are the error matrices of and , respectively; and (recall that) ℓ is the alphabet size.

**1**

^{⊺}

_{m}

*C*=

**1**

^{⊺}

_{K}, forms the optimization problem, for obtaining optimal

*C*and

*Q*. In the EC algorithm, this problem is solved by a simple numerical method: one randomly creates an

*m*×

*K*matrix

*C*=

*C*

^{(0)}with column sums 1 and iteratively updates

*Q*and

*C*by until some stopping criterion is met, say, κ

^{(t−1)}

_{1}− κ

^{(t)}

_{1}< δ, for some predefined number δ>0. Zhao, Jaeger, and Thon (2009) showed that {κ

^{(t)}

_{1}}

_{t=1,2,…}forms a decreasing sequence with κ

^{(t)}

_{1}⩾ 0, and so the convergence of (

*C*

^{(t)},

*Q*

^{(t)}) to a local optimum is guaranteed.

In sum, the error-controlling algorithm presented in Zhao, Jaeger, and Thon (2009) is actually an instantiation of the basic learning scheme in which the auxiliary matrices *C* and *Q* are optimized by the iterative procedure specified by equation 2.5. In the next section, we derive an improved version of EC in which the iterative procedure 2.5 is replaced by a globally optimal analytical solution to an optimization problem that is very similar to equation 2.4.

## 3. The Constructive Error-Controlling Algorithm

Although the basic learning scheme is constructive, the EC algorithm is not, since it involves an iterative procedure to optimize *C* and *Q*. In this section, a constructive method for computing the matrices *C* and *Q* is described, yielding an improved version of EC that, to our knowledge, is the first constructive learning algorithm of OOMs. To this end, we introduce another upper bound of the relative error , which is tighter than equation 2.3. We state this upper bound as a proposition.

*C*); using equation 2.3 and the definition of and , one can also directly verify this identity. To establish inequality 3.1, we invoke a well-known matrix inequality (see, e.g., Golub & Loan, 1996): It follows from equations 3.2 and 3.3 that But in proposition 3 of Zhao, Jaeger, and Thon (2009), we have already proven and , so inequality 3.1 follows.

Note that for any matrix *A*, ‖*A*‖_{2} equals the largest singular value of *A*, and ‖*A*‖ is the square root of the square sum of all singular values of *A*. So ‖*A*‖_{2} ⩽ ‖*A*‖, and equation 3.1 indeed provides a tighter upper bound of than equation 2.3.

*C*and

*Q*that is very similar to equation 2.4: Zhao, Jaeger, and Thon (2009) demonstrated that adding the extra constraint to problem 2.4 will not change its minimizer. Similarly, here we can prove the optimization problem 3.4 is essentially equivalent to where the expression of κ

_{2}has been simplified by using . To see this, for any minimizer (

*C*,

*Q*) of problem 3.4, we put

*C*

_{0}=

*C*and , then and κ

_{2}(

*C*,

*Q*) = κ

_{2}(

*C*

_{0},

*Q*

_{0}), which means (

*C*

_{0},

*Q*

_{0}) also minimizes the quantity κ

_{2}and, additionally, satisfies the constraint .

Before introducing the constructive method for minimizing equation 3.5, we make some remarks on problem 3.5 and the normalized counting matrices and obtained from the training sequence via equation 2.2. Note that some of these remarks have already been presented in Zhao, Jaeger, and Thon (2009). Here we restate them in a clearer way to clarify the meaning of some notations that will be used later:

- • It is easy to see that in the learning equations , the normalized counting matrices can be replaced by the raw counting matrices and ’s, that is, , where It is also clear that replacing by in problem 3.5 does not change its minimizer (
*C*,*Q*), although the minimal value of κ_{2}is now reduced by a factor of (*T*− 2*k*). - •
The order of the above raw counting matrices,

*K*= ℓ^{k}, increases exponentially with*k*, which at first glance results in that only small values of*k*are feasible in the basic learning scheme, and hence only short-term history context effects of the target process can be learned, and the optimization problem 3.5 becomes computationally inaccessible quickly. Nevertheless, first, equation 3.6 reveals that, given the sequence , the element sums of and are (*T*− 2*k*), so for large*k*, the matrices and are actually very sparse; second, the learning equations reveal that, discarding zero rows (columns) in , the corresponding rows (columns) in ’s (we call the resulting matrices the*compressed counting matrices*and denote them by and ) and the corresponding columns in*C*(rows in*Q*) will not change the estimation ; third, those “undiscarded” columns of*C*(rows of*Q*), can be efficiently optimized by solving a compressed version of equation 3.5, which will be introduced presently. Therefore, in practice we can set the length*k*of basic strings to be large enough to capture long-term properties of the underlying process. - •
As Zhao, Jaeger, and Thon (2009) pointed out, in a variant method to fix the basic strings and , one can admit them to have different length

*k*. The only condition that has to be observed when selecting basic strings is that should form a “cover” of some set*O*, in the sense that for each string from^{k}*O*, there is a unique , which is a prefix of . Moreover, by using compact suffix trees (Gusfield, 1997), we can construct the compressed counting matrices and from the training sequence in linear time, that is, flops (Ukkonen, 1995).^{k}

*N*×

*M*(note that typically

*N*,

*M*≪

*K*). By the above discussion, we know where (resp. ) is formed from (resp. ) by discarding its columns (resp. rows) at positions corresponding to zero rows (resp. zero columns) in . It remains to find a way to compute and , which is efficient in memory space. Equation 3.5 is too inefficient to be directly used to compute

*C*and

*Q*and then extract and . To simplify the notation, we move to the upper-left area of (see equation 3.6) by reordering the basic strings and , making the other entries of be zero. Correspondingly, the matrices

*C*,

*Q*can now be written as and (in Matlab's notation), respectively, with and . The optimization problem 3.5 is therefore equivalent to

^{1}By proposition 2 (see below), we know and

*Q*

_{0}= 0, which immediately follows that and that . We thus get the desired compressed version of problem 3.5:

### 3.1. An Analytical Solution to Problem 3.9.

*r*——where

*D*= diag{

*d*

_{1},

*d*

_{2}, …,

*d*} with

_{r}*d*

_{1}⩾

*d*

_{2}⩾ ⋅ ⋅ ⋅ ⩾

*d*>0. Then has the compact SVD , where

_{r}*L*

_{1}and

*R*

_{1}are tall matrices formed by extracting the first

*r*columns from and , respectively. Writing and , and utilizing the fact that ‖[

*A*,

*B*]‖

^{2}= ‖

*A*‖

^{2}+ ‖

*B*‖

^{2}, we expand the above optimization problem to where

*C*≔

_{i}*CL*and

_{i}*Q*≔

_{i}*R*

^{⊺}

_{i}

*Q*(

*i*= 1, 2).

*In problem 3.11, let the matrices C_{1} and Q_{1} be fixed. Then and Q_{2} = 0 form a minimizer of the target κ^{2}_{2}.*

*l*^{⊺}≔

**1**

^{⊺}

_{N}

*L*

_{1}and . Here we should point out that to get problem 3.12 from 3.10 requires only computing the compact SVD . This is because we can compute the quantity ϕ from

*L*

_{1}by and retrieve the optimal

*C*and

*Q*from a solution (

*C*

_{1},

*Q*

_{1}) to problem 3.12 by

*In the optimization problem 3.12, if C_{1} is fixed, then Q_{1} = (C_{1}D)† is a minimizer of κ^{2}_{2}.*

See section A.2 for the proof. Assume *C*_{1}*D* has the compact SVD *C*_{1}*D* = *LSR*^{⊺}, then *C*_{1} = *LSR*^{⊺}*D*^{−1} and, by proposition 3, *Q*_{1} = *RS*^{−1}*L*^{⊺}. For this setting of (*C*_{1}, *Q*_{1}), the second constraint *C*_{1}*DQ*_{1} = *I _{m}* is automatically fulfilled, and the first constraint

**1**

^{⊺}

_{m}

*C*

_{1}=

*l*^{⊺}now becomes

**1**

^{⊺}

_{m}

*LSR*

^{⊺}=

*l*^{⊺}

*D*, which implies that (1) the subspace (in ) spanned by columns of

*R*(the range space of

*R*) contains the vector

*D*

**, that is,**

*l**R*

**=**

*x**D*

**for some ; and (2) by multiplying**

*l**R*on the right to this equality,

**1**

^{⊺}

_{m}

*LS*=

*l*^{⊺}

*DR*.

*C*

_{1}

*D*=

*LSR*

^{⊺}, let

*r*_{i}be the

*i*th column of

*R*and β

_{i}=

*r*^{⊺}

_{i}

*D*

^{−2}

*r*_{i}and write

*S*= diag{

*s*

_{1},

*s*

_{2}, …,

*s*} with

_{m}*s*

_{1}⩾

*s*

_{2}⩾ ⋅ ⋅ ⋅ ⩾

*s*>0. Then, by brute force computation, we obtain Now assume the matrix , which satisfies

_{m}*R*

^{⊺}

*R*=

*I*and

_{m}**1**

^{⊺}

_{m}

*LSR*

^{⊺}=

*l*^{⊺}

*D*, is already known. Then the vector

*l*^{⊺}

*DR*≕

*v*^{⊺}= [

*v*

_{1},

*v*

_{2}, …,

*v*] and the scalars β

_{m}_{i}=

*r*^{⊺}

_{i}

*D*

^{−2}

*r*_{i}are also known. Moreover, by the equality

**1**

^{⊺}

_{m}

*LS*=

*l*^{⊺}

*DR*, the vector

**has the property which is a constant independent of the choice of the matrix**

*v**R*.

*i*, let α

_{i}=

*s*/

_{i}*s*. Then

_{m}*s*= α

_{i}_{i}

*s*and α

_{m}_{1}⩾ α

_{2}⩾ ⋅ ⋅ ⋅ ⩾ α

_{m}= 1. From the equality ∑

^{m}

_{i=1}(

*v*/

_{i}*s*)

_{i}^{2}=

*m*, we know . Therefore, from which we immediately see that κ

^{2}

_{2}reaches its minimal value when α

_{i}= 1—

*s*=

_{i}*s*for all

_{m}*i*. It then follows from equations 3.13 and 3.14 that

*l*^{⊺}

*D*

^{2}

**is independent of the matrix**

*l**R*, equation 3.16 indicates that to minimize the target κ

^{2}

_{2}, the choice of

*R*should make the sum ∑

^{m}

_{i=1}β

_{i}as small as possible. By the definition of β

_{i}, we know ∑

^{m}

_{i=1}β

_{i}= tr(

*R*

^{⊺}

*D*

^{−2}

*R*). We therefore get the following optimization problem to determine the matrix

*R*: Note that if

*R*is a minimizer of problem 3.18, so is

*RU*, where

*U*can be any orthogonal matrix of order

*m*. We thus assume, without loss of generality, that the last column of

*R*is

*r*_{m}=

*D*

**/‖**

*l**D*

**‖—the unit vector on the direction of**

*l**D*

**(for vectors, their Frobenius norm equals their Euclidean norm, so we can abuse our notation of ‖ · ‖)—since**

*l**D*

**lies in the range space of**

*l**R*. Thus, the matrix

*R*has the form

*R*= [

*HX*,

*r*_{m}], where is fixed and has columns forming a basis of the null space of

*r*_{m}; that is,

*H*can be any matrix with the properties

*H*

^{⊺}

*H*=

*I*

_{r−1}and

*r*^{⊺}

_{m}

*H*= 0 (see section A.3 for a simple construction of such a matrix

*H*); and

*X*is a some (

*r*− 1) by (

*m*− 1) matrix satisfying

*X*

^{⊺}

*X*=

*I*

_{m−1}. Substituting the expression

*R*= [

*HX*,

*r*_{m}] into equation 3.18, we get As is well known, this optimization problem has an analytical solution: the

*i*th column of

*X*, denoted

*x*_{i}, is exactly the eigenvector of

*H*

^{⊺}

*D*

^{−2}

*H*with respect to its

*i*th smallest eigenvalue λ

_{i}, which, surprisingly, is equal to β

_{i}. This can be proven in one line: β

_{i}=

*r*^{⊺}

_{i}

*D*

^{−2}

*r*_{i}=

*x*^{⊺}

_{i}

*H*

^{⊺}

*D*

^{−2}

*H*

*x*_{i}= λ

_{i}

*x*^{⊺}

_{i}

*x*_{i}= λ

_{i}.

*L*such that

**1**

^{⊺}

_{m}

*L*=

*u*^{⊺}. With the matrix

*R*constructed as above, we find that that is,

*v*= 0 for

_{i}*i*<

*m*and

*v*= ‖

_{m}*D*

**‖. By the relation**

*l**u*=

_{i}s_{i}*v*and equation 3.17, we know , so

_{i}*L*can be any orthogonal matrix with as its last column. For instance, we can set A more general construction of such matrices

*L*is presented in appendix A.3.

### 3.2. The Constructive Error-Controlling Algorithm.

The discussion on problem 3.9 sums up to a constructive procedure for evaluating the optimal and that minimizes the modeling error bound presented in inequality 3.1. Inserting this procedure into the basic learning scheme, we get the following constructive error-controlling (CEC) algorithm, in which some Matlab-style notation is employed:

The time complexity of the CEC algorithm is dominated by the collection of the compressed counting matrices, which costs flops, and the computation of the compact SVD , which, in the worst case where the matrix has full rank *r* = *N* (assume that *M* = *N*), amounts to flops (Golub & Loan, 1996). Thus, the total time complexity of CEC is about . Here we recall that the EM algorithm for training HMMs and the ES algorithm have time complexity , where *n* is the number of EM or ES iterations. We therefore conclude that the CEC algorithm is faster than EM and ES, since *N* ≪ *T* is the typical case. The comparision between EC and CEC is more sophisticated. As analyzed in Zhao, Jaeger, and Thon (2009), the time complexity of EC is about , which is typically higher than that of CEC, so EC is slower than CEC, as shown by our simulations. However, in cases where large (full-rank) counting matrices are obtained, CEC may need more CPU time (mainly due to the SVD of ) than EC.

### 3.3. An Efficient Variant to CEC.

We now introduce an efficient variant to the CEC algorithm. The time complexity of the CEC algorithm can be reduced to if we compute only the first *m* (the model dimension) singular triples of the matrix . That is, instead of the compact SVD , we use the first (largest) *m* singular values and left- (right-) singular vectors of to form the matrix *D*, *L*_{1}, and *R*_{1}, respectively, and then estimate the OOM as in algorithm 1. To distinguish the two versions of CEC, we call the efficient variant presented here the CEC-B algrithm and the original one described by algorithm 1 the CEC-A algorithm in the sequel.

The basic idea behind CEC-B is rather simple. The counting matrix is in fact the statistical approximation of the probability matrix (up to a counting factor), which should have rank *m*. Recall that we have already assumed that the underlying process can be modeled by some *m*-dimensional OOM. Therefore, we can (and it is quite natural) first estimate (the “true” counting matrix) from and then use to design the optimal *C* and *Q*, which is what CEC-B does.

By the above discussion, one immediately sees that CEC-A and CEC-B should have approximately the same performance, especially when the training data set is large (since then ). We would, however, emphasize again that CEC-B is computationally cheaper than CEC-A.

We conclude this section with an important remark on a painful issue in OOM theory: the negative probability problem (Jaeger, 2000b). It refers to the fact that the model learned by existing OOM learning algorithms, including the CEC algorithm presented in this letter, can produce negative “probabilities” for some sequence . For a long time, we and others attempted to find ways to eliminate this problem. Very recently, it was proven that it is undecidable to determine whether a given candidate set of observable operators leads to negative values on some strings (Wiewiora, 2008). This finding does not preclude the possibility of finding nontrivial and practically useful sufficient conditions for nonnegativity. In practice, the current state of the art relies on heuristic methods that essentially “repair” the OOM state vector by small correction terms when it would give negative “probability” values. These work well (see the one presented in appendix J in Jaeger et al., 2005).

## 4. Empirical Results

In this section, we assess the performance of CEC-A,B in comparison with the EC and ES algorithms on three sets of symbolic sequence modeling tasks: the first two are repeated from Zhao, Jaeger, and Thon (2009): learning models for a quantized logistic system and a suite of seven partially observable Markovian decision processes (POMDPs) that have been used frequently as benchmark tasks in the literature. The third is from (Jaeger et al., 2005): to train OOMs on Mark Twain's short story, “The One Million Pound Bank-Note.”

### 4.1. Modeling the Symbolic Logistic System.

In this experiment we consider the task of modeling a dynamical system that consists of a continuous-valued iterated map followed by a quantizer that maps real numbers to discrete symbols. The dynamics of the continuous-valued procedure is governed by the logistic mapping *x*_{n+1} = 4*x _{n}*(1 −

*x*), with initial values

_{n}*x*

_{0}arbitrary selected from the interval (0, 1). As is well known, in its attractor set (0, 1), this system shows strong chaotic behavior (with Lyapunov exponent λ = ln 2). The attractor set (0, 1) is divided into ℓ = 16 equidistant subintervals, yielding an alphabet of 16 symbols and a 16-output quantizer that converts the continuous-valued sequence (

*x*) into a symbolic process (

_{n}*X*).

_{n}*L*= 30,000, are created by running the quantized logistic system; (2) OOMs of dimension

*m*∈ {5, 10, …, 30, 40, …, 70} are then estimated from each such sequence, using the EC, ES, and CEC-A,B algorithms; (3) we compute the normalized log-likelihood (NLL) of these learned models on the other 19 sequences as follows: where ℓ is the alphabet size,

*S*is the test data set that consists of the other 19 (testing) sequences, and

*S*

^{#}is the number of sequences in

*S*, which takes the value 19 in this experiment. Here are the settings of our simulation:

- •
When evaluating test NLLs of learned OOMs, the heuristic method described in appendix J of Jaeger et al. (2005) is employed to address the negative probability problem, from which each of the EC, ES, and CEC-A,B algorithms suffers.

- •
The length of characteristic/indicative strings is set to be

*k*= 5. - •
As in Zhao, Jaeger, and Thon (2009), the ES algorithm is terminated after 10 iterations, and the estimated model with the highest training NLL is selected as the final output of ES. But unlike as in Zhao, Jaeger, and Thon (2009), ES is now started with a more carefully chosen initial model, so that it is numerically stable for all model dimensions.

*H*(

*X*), where (

_{n}*X*) is the underlying stochastic process that we want to model and

_{n}*H*(

*X*) is its entropy rate (under the base-ℓ logarithm), defined by the formula where, as introduced at the beginning of section 2, is the (initial) probability distribution of the process (

_{n}*X*). Specifically, it turns out that the quantized logistic process described above has entropy rate

_{n}*H*(

*X*) = 0.25 (Zhao, Jaeger, & Thon, 2009), meaning that the quantized logistic process is indeed a stochastic process, although its continuous space motion is rather simple and deterministic (in the sense that we can accurately predict the next output

_{n}*x*

_{n+1}based on the current one

*x*). Thus, in this experiment, a test NLL of −

_{n}*H*(

*X*) = −0.25 represents the upper limit that a learning algorithm can reach.

_{n}*m*∈ {5, 10, …, 30, 40, …, 70} were estimated, by the EC, ES, and CEC-A,B algorithm, respectively. The test NLL of these learned models was then computed on the other 19 sequences. We therefore obtain NLL values after the simulation. We plotted in Figure 1a the average test NLLs of each OOM learning algorithm versus model dimensions, in Figure 1b the standard deviation of these test NLLs, and in Figure 2 the total CPU time needed for learning these models. All algorithms are programmed in Matlab (with C-mex functions embedded) and implemented on a Pentium-M 1.73 GHz laptop.

From the figures, we see that (1) CEC yields slightly more accurate models than EC and ES; (2) for large model dimensions, CEC is significantly faster than EC and ES; and (3) while CEC-B has nearly the same test NLLs as CEC-A (see the Figure 1 caption for more detail), it is two to three times faster than CEC-A. Another important observation is that for model dimensions *m* ⩾ 40, all models (under ES, EC, and CEC) level out in their NLLs, which reach ≈−0.261, −0.257, −0.257, respectively, with a small margin toward the theoretical optimum of −0.25.

### 4.2. POMDP Benchmark Suite.

In the second experiment we compare the performance of ES, EC, and CEC on the task of learning models of seven POMDPs that were frequently used as benchmark tasks for assessing the performance of predictive state representations (PSRs) (Littman & Sutton, 2001) in the literature.

First, we point out the relationship between OOMs and PSRs. PSRs are a class of models for discrete, stochastic input-output systems, which generalizes partially observable Markov decision processes (POMDPs) like OOMs generalize HMMs. PSRs have been partly inspired by OOMs but have developed a representation format for input-output processes different from the format of input-output OOMs described in Jaeger (1998). Developing learning algorithms and specialized variants of PSRs is an active area of research (McCracken & Bowling, 2005; Wolfe, James, & Singh, 2005; Bowling, McCracken, James, Neufeld, & Wilkinson, 2006; Rudary & Singh, 2006; Wolfe, 2006; Wingate & Singh, 2007). Zhao, Jaeger, and Thon (2009) demonstrated that ES and EC strongly outperform known PSR learning algorithms. These results are not repeated here. We remark furthermore that extending the ES, EC, and CEC algorithms to input-output OOMs constitutes a current line of research in our group.

The seven target POMDPs are taken from a public Web source.^{2} They are Bridge, Cheese, Maze4x3, Network, Paint, Shuttle, and Tiger. We use the same experimental settings as Zhao, Jaeger, and Thon (2009) and for more detail refer readers to that article. We use notations common in the PSR literature:

- •
As is commonly done in the PSR field, the input policy is simply to choose, at each time step, an action

*a*from the input alphabet, say*A*, according to the uniform random distribution (over*A*). So the training/test sequences have the form*a*^{1}*o*^{1}*a*^{2}*o*^{2}, …,*a*, a sequence of alternating actions^{N}o^{N}*a*∈^{i}*A*and observations*o*∈^{i}*O*. - •
For each of the POMDP domains, we train OOMs on (input-output) training sequences of varying lengths

*N*= 10^{3}–10^{7.3}using the CEC-A or B (or EC, ES) algorithm, as follows. We (naively) regard each combination (*a*) of an action^{i}o^{i}*a*and its immediate observation^{i}*o*as a single symbol^{i}*s*from the alphabet^{i}*S*=*A*×*O*and learn an OOM from the sequence*s*^{1}*s*^{2}…*s*(with the length of basic strings set to be^{N}*k*= 2). - • The quality of learned models is measured by their average one-step prediction error
*E*on testing sequences of length*N*= 10^{4}(the format that has been used for assessing the main PSR algorithms): where*P*is the correct probability (computed using the known underlying POMDP) and is the model prediction for the next observation given the testing history*h*=_{i}*a*^{1}*o*^{1}, …,*a*and the action^{i}o^{i}*a*^{i+1}. For learned OOMs (over*S*=*A*×*O*), we compute the above predicting probabilities by

In Figure 3 we plotted the average one-step prediction error *E* (*y*-axis) of the ES, EC, and CEC-A,B algorithms on the seven POMDP domains versus training sequence length (*x*-axis), and in Figure 4 the training CPU time cost by these algorithms for each training length. These are the main findings:

- •
The four OOM learning algorithms all show a near-log-linear relationship between the length

*N*of training sequences and the prediction error*E*—that is, log*E*≈ α − βlog*N*, or, equivalently,*E*≈*e*^{α}*N*^{−β}=*E*_{0}*N*^{−β}(*E*_{0}, β>0)—in all except the Cheese domain. This partly demonstrates that OOM learning is asymptotically correct (since*E*→ 0 when*N*→ ∞) and statistically efficient (since the convergence of*E*to 0 is fast), as has been claimed in our previous OOM references. - •
In the Cheese domain, the prediction error

*E*does not decrease anymore when*N*⩾ 10^{5}. One possible reason is that the basic string length*k*= 2 is too small to capture all relevant statistics. We therefore increase the value of*k*by 1 and do the simulation again. Figure 3h shows the empirical results, from which we see the desired property of OOM learning. - •
The two CEC algorithms and the EC algorithm have nearly the same one-step prediction error. But for a small training data set, CEC is faster than EC, whereas for a large data set, the difference in speed becomes invisible. The reason is that when the training sequence length

*T*is large (compared to the size*N*of counting matrices), both CEC and EC have time complexity dominated by , and when*T*is small, CEC has the time complexity or , smaller than that of EC, which is . - •
Neither CEC nor EC outperforms ES in all domains. They win in the Network, Paint, and Tiger domains but are no match for ES in the Cheese and Maze4x3 domains. The four algorithms show essentially the same performance in the Shuttle domain. But both CEC and EC are about 10 times faster than ES, especially for large training data sets.

- •
Bridge is an interesting domain, in which CEC/EC is inferior to ES when

*N*⩽ 10^{5}, but when*N*becomes larger (⩾3 × 10^{5}), CEC/EC starts to yield a smaller prediction error than ES (In fact, other domains show a similar phenomenon, which is, however, less obvious, as in the Bridge domain). This illustrates that asymptotically, CEC/EC converges faster than ES () but also reveals that in some domains, CEC/EC might not be as statistically efficient as ES when the training data set is small. - •
In the Bridge domain, the prediction error of EC jumps to a high level at

*T*= 10^{7}, and ES shows a similar jump in the Paint domain when*T*= 10^{6}, whereas CEC always get smoothly decreasing prediction error. This partially illustrates that CEC are numerically more stable than EC and ES.

It should be noted that in the Maze4x3 domain (see Figure 4c), the training CPU times for EC and CEC-A,B show a significant drop when the training data size increases from *T* = 10^{5.67} to *T* = 10^{6}. This is not a coincidence, because we repeated the experiment 10 times, Matlab's rand() function initialized by different seeds and always got similar results. One possible reason for this is that for training sequences of length ⩾10^{6}, the counting matrices are already very near to the true probability matrices and so have (numerical) rank equal to the model dimension *m* = 11. This might speed up the SVD of the counting matrix —only the first 11 singular values and vectors need to be computed—and reduce the number of EC iterations needed for optimizing the two auxiliary matrices *C* and *Q*. In the ideal case where the counting matrix has rank *m*, only one EC iteration is needed.

### 4.3. Modeling Text Sources.

The third experiment models a text source—“The 1,000,000 Pound Bank-Note,” a short story written by Mark Twain. To simplify the task, we preprocessed the text string by deleting all special characters except the blank, changing capital letters to small ones, which gave us an alphabet of 27 symbols: a, b, …, z plus the blank. We then sorted the resulting string sentence-wise, obtaining two substrings of roughly equal length (21,042 and 20,569 symbols, respectively) that were used as training and test sequences.

As before, we estimated OOMs of dimensions {5, 10, …, 50, 60, …, 100, 120, 150} from the training sequence by the EC, ES, and CEC-A,B algorithms, respectively, and then computed the training and test NLLs of each learned OOM, as shown in Figures 5a and 5b. Here the length of characteristic or indicative strings is set as *k* = 2. We also plotted the training time of each learning algorithm versus the model dimension in Figure 6. From these figures, we see that OOMs learned by the four algorithms have nearly the same training and test NLLs (ES is a bit better for low-dimensional models but becomes worse when model dimension increases). Among the four algorithms CEC-B is the fastest one: it is about 2 times faster than CEC-A, which is in turn 10 to 100 times faster than ES and EC.

## 5. Conclusion and Discussion

We have derived the constructive error-controlling algorithm (CEC-A), the first constructive algorithm for learning OOMs. We have also proposed an efficient variant of CEC-A, referred to as CEC-B. Our numerical experiments reveal:

- •
As expected, the two CEC algorithms improve over their predecessor, the EC algorithm, in most domains. CEC usually learns more accurate models (at least not worse) than EC. When learning high-dimensional models, CEC are much faster than EC.

- •
Compared with the ES algorithm, both CEC algorithms are much faster (especially when learning models from large data sets), while still showing comparable test accuracy.

- •
For large data sets, the two CEC algorithms have nearly the same performance, but for small data sets, CEC-B is about two to three times faster than CEC-A.

We conclude with some open questions:

- •
Like the EC algorithm, currently the CEC algorithm is given the desired model dimension and a (fixed) basic string length as parameters. This will have to be automated in the future. Also, the CEC algorithm should be generalized to become, like the ES algorithm, based on variable-length subsequence counting statistics, instead of the fixed-length statistics of the current version.

- •
As Zhao, Jaeger, and Thon (2009) pointed out, a mathematical understanding of the relationships between the algebraic and the statistical aspects of EC/CEC versus ES, or of the algebraic working principle versus the statistical effects, is a theoretically intriguing question.

- •
As mentioned at the end of section 3, all OOM learning algorithms (and the OOM theory itself) suffer from the negative probability problem. As the undecidability result of Wiewiora (2008) shows, this is a difficult problem in the current linear framework of OOMs. In response to this issue, M.-J. Z. is developing and investigating nonlinear variants of OOMs that are free from this negativity problem by design (Zhao & Jaeger, 2007).

## Appendix A: Proofs and Algorithms

### A.1. Proof of Proposition 2.

*C*

_{1}and

*Q*

_{1}are fixed, then, as

*C*

_{2}and

*Q*

_{2}are independent of each other, we can split problem 3.11 into two simpler problems: Assume the

*i*th column of

*L*

_{2}is

*b*_{i}and denote by

*y*_{i}the

*i*th column of

*C*

_{2}. Then the above first problem is equivalent to , (∀

*i*), since ‖

*C*

_{2}‖

^{2}= ∑

_{i}‖

*y*_{i}‖

^{2}. By the Cauchy-Schwarz inequality, we know |

**1**

^{⊺}

_{m}

*y*_{i}| ⩽ ‖

**1**

_{m}‖ · ‖

*y*_{i}‖, that is, ; with equality if and only if

*y*_{i}= α

**1**

_{m}, which, together with the constraint

**1**

^{⊺}

_{m}

*y*_{i}=

**1**

^{⊺}

_{N}

*b*_{i}, implies that . It then follows that is the unique minimizer of the first problem.

To prove that *Q*_{2} = 0 minimizes ‖[*Q*_{1}; *Q*_{2}]‖_{2}, we will use an alternative definition of spectral norm: ‖*A*‖_{2} = max{‖*A*** x**‖ : ‖

**‖ = 1}. Let**

*x***be such that ‖**

*x***‖ = 1 and ‖[**

*x**Q*

_{1}; 0]‖

_{2}= ‖[

*Q*

_{1}; 0]

**‖, then ‖[**

*x**Q*

_{1};

*Q*

_{2}]‖

^{2}

_{2}⩾ ‖[

*Q*

_{1};

*Q*

_{2}]

**‖**

*x*^{2}= ‖[

*Q*

_{1}

**;**

*x**Q*

_{2}

**]‖**

*x*^{2}⩾ ‖[

*Q*

_{1}

**; 0]‖**

*x*^{2}= ‖[

*Q*

_{1}; 0]

**‖**

*x*^{2}= ‖[

*Q*

_{1}; 0]‖

^{2}

_{2}and so ‖[

*Q*

_{1};

*Q*

_{2}]‖

_{2}⩾ ‖[

*Q*

_{1}; 0]‖

_{2}, which is what we want to prove.

### A.2. Proof of Proposition 3.

Assume *C*_{1}*D* has the compact SVD *C*_{1}*D* = *LSR*^{⊺}. Then *L*^{⊺}*L* = *I _{m}* and the constraint

*C*

_{1}

*DQ*

_{1}=

*I*now reads

_{m}*LSR*

^{⊺}

*Q*

_{1}=

*I*. It follows that

_{m}*SR*

^{⊺}

*Q*

_{1}

*L*=

*L*

^{⊺}

*LSR*

^{⊺}

*Q*

_{1}

*L*=

*L*

^{⊺}

*L*=

*I*, and so

_{m}*R*

^{⊺}

*Q*

_{1}

*L*=

*S*

^{−1}, which in turn implies that

*RR*

^{⊺}

*Q*

_{1}

*LL*

^{⊺}=

*RS*

^{−1}

*L*

^{⊺}= (

*C*

_{1}

*D*)†. This identity, together with the facts that ‖

*L*‖

_{2}= ‖

*R*‖

_{2}= 1, ‖

*A*‖

_{2}= ‖

*A*

^{⊺}‖

_{2}, and ‖

*AB*‖

_{2}⩽ ‖

*A*‖

_{2}· ‖

*B*‖

_{2}, implies that ‖(

*C*

_{1}

*D*)†‖

_{2}⩽ ‖

*Q*

_{1}‖

_{2}, which completes the proof.

### A.3. Simple Construction of Two Orthogonal Matrices.

*L*

**=**

*x***and (2) how to find a basis of the null space of**

*y*

*x*^{⊺}, that is, to find a matrix satisfying

*H*

^{⊺}

*H*=

*I*

_{n−1}and

*x*^{⊺}

*H*= 0? As one can easily see,

*L*and

*H*are not unique, so we add an extra condition on

*L*: it should be as close to

*I*as possible, in the sense that ‖

_{n}*L*−

*I*‖ is minimized. As

_{n}*L*

^{⊺}

*L*=

*I*, we have ‖

_{n}*L*−

*I*‖

_{n}^{2}= 2tr(

*I*−

_{n}*L*). Thus, by brute force computation, In the above formula, putting

**= [0, …, 0, 1]**

*y*^{⊺}and extracting the first (

*n*− 1) columns of

*L*, we get the matrix

*H*.

## Acknowledgments

The results reported in this article were achieved when the first author worked under a grant (JA 1210/1-1 & 2) from the Deutsche Forschungsgemeinschaft (DFG). We also thank two anonymous reviewers for their very careful and constructive review.

## Notes

^{1}

For any matrix *A*, ‖*A*‖^{2} is the square sum of its entries, so .

^{2}

POMDP page, http://www.cs.brown.edu/research/ai/pomdp/, 1999.