## Abstract

Observable operator models (OOMs) are a class of models for stochastic processes that properly subsumes the class that can be modeled by finite-dimensional hidden Markov models (HMMs). One of the main advantages of OOMs over HMMs is that they admit asymptotically correct learning algorithms. A series of learning algorithms has been developed, with increasing computational and statistical efficiency, whose recent culmination was the error-controlling (EC) algorithm developed by the first author. The EC algorithm is an iterative, asymptotically correct algorithm that yields (and minimizes) an assured upper bound on the modeling error. The run time is faster by at least one order of magnitude than EM-based HMM learning algorithms and yields significantly more accurate models than the latter. Here we present a significant improvement of the EC algorithm: the constructive error-controlling (CEC) algorithm. CEC inherits from EC the main idea of minimizing an upper bound on the modeling error but is constructive where EC needs iterations. As a consequence, we obtain further gains in learning speed without loss in modeling accuracy.

## 1.  Introduction

Observable operator models (Jaeger, 2000b) are a mathematical model class for stochastic processes that generalizes hidden Markov models (HMMs) (Bengio, 1999) and can be represented in a structurally similar matrix formalism, with the crucial difference that the range of model parameters is extended from nonnegative to arbitrary real numbers. This relaxation on the parameter range endows OOMs with linear algebraic properties that are not available for HMMs. These properties engender a basic scheme for learning OOMs that is constructive and asymptotically correct. The statistical efficiency of this basic scheme, however, depends crucially on the design of two auxiliary matrices, which are now named the characterizer and the indicator (Zhao, Jaeger, & Thon, 2009) and will be denoted by C and Q, respectively, throughout this letter.

To optimize the design of the matrices C and Q is one of the major streams in OOM research. The early learning algorithms (Jaeger, 1998, 2000a, 2000b) relied on simple heuristics for creating C and Q, and the ensuing poor statistical efficiency made OOMs learned in this way no match for HMMs trained with the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977). Kretzschmar (2003) identified the crucial role of the matrices C and Q in the basic scheme and provided a first learning algorithm whose rationale was to optimize their design. Since then, the importance and the challenge of optimizing C and Q have become increasingly better understood.

After Kretzschmar's lead work, two theoretically justified and computationally accessible optimality criteria for designing C and Q were worked out, resulting in the two current state-of-the-art instantiations of the basic learning scheme of OOMs: the efficiency-sharpening (ES) algorithm (Jaeger et al., 2005) and the error-controlling (EC) algorithm (Zhao, Jaeger, & Thon, 2009). Both ES and EC implement iterative schemes to optimize C and Q. Their computational and statistical efficiencies are by and large similar (and outperform EM-based HMM learning), with perhaps a small advantage for EC (Zhao, Jaeger, & Thon, 2009). The underlying ideas for the two are, however, interestingly and fundamentally different.

The perspective of ES is statistical. This algorithm aims at minimizing the effects of stochastic variation in C and Q. An algebraic characterization of minimal variance of the estimator is exploited to iteratively improve C and Q toward the goal of minimizing the variance of the overall algorithm.

The EC algorithm, adopting a complementary view, concentrates on algebraic properties of the learning equations. This algorithm optimizes C and Q by exploiting algebraic laws characterizing the condition of matrices, which in turn is related to an upper bound of the relative error in model parameters (or, for short, modeling error). As a side effect, this focus on matrix conditions affords a much improved numerical stability of EC over ES.

Although the basic OOM learning scheme is constructive, neither ES nor EC has this attractive property, for both employ an iterative procedure to compute the auxiliary matrices C and Q. In this letter, following the road of EC, we derive a new upper bound of modeling error that is tighter than the one used by the EC algorithm, and we propose a constructive method for minimizing this upper bound. Moreover, this constructive algorithm for optimizing C and Q is globally optimal, whereas the iterative methods employed by EC and ES are only locally optimal. Inserting this analytical minimizer into the basic learning scheme, we obtain an improved version of EC. The resulting constructive error-controlling (CEC) algorithm is the first learning algorithm for OOMs that is constructive (hence, fast), asymptotically correct (the hallmark of all OOM learning algorithms), and statistically efficient. This combination of properties recommends CEC, in our view, as a potentially eminently useful technique.

We perceive this work as a follow-up to Zhao, Jaeger, and Thon (2009). While self-contained, this letter is therefore more condensed in some expository parts than its precursor, and it may be helpful to consult the latter when more background or detail is desired. We first briefly reintroduce the EC algorithm in section 2, clarifying the problem to be addressed and fixing notation. Section 3 describes the derivation of CEC in detail, which includes a novel (tighter) upper bound of modeling error, an analytical minimizer of this error bound, and the overall CEC algorithm. We then assess the performance of the CEC algorithm in comparison with EC and ES on the same modeling tasks as those from Zhao, Jaeger, and Thon (2009) and document the empirical results in section 4. Finally, we summarize the letter and draw some conclusions in section 5.

## 2.  The Error-Controlling Algorithm

Let O = {a1, a2, …, a} be a finite set of observable values (an alphabet), and let be a stochastic process with all XnO and initial joint probabilities:
Following the notation convention of Zhao, Jaeger, and Thon (2009), we use lowercase letter with a bar to denote finite-length sequences of symbols from O(e.g., ) and O* to denote the set of all such finite sequences, including the empty sequence ϵ. The above family of joint probabilities can then be rewritten as , with the agreement that P(ϵ) = 1.
A (finite-dimensional) OOM for the process (Xn) is a triple of the m-dimensional Euclidean space , an O-indexed family {τa}aO of square matrices of order m (called observable operators), and an initial vector such that for any finite sequence , it holds that
2.1
where 1m denotes the m-dimensional column vector of units and the product of observable operators associated with the sequence (notice the reversal of order in indexing).

We consider the following OOM learning task: assuming the process (Xn) is stationary, ergodic, and has initial distribution determined by equation 2.1, the objective is to estimate an OOM of (Xn) from one of its finite initial realizations (say, ), which can be used to approximately reproduce the distribution of (Xn), that is, . A general constructive algorithm (the basic scheme) for this learning task is outlined as follows (Jaeger, 2000b; Jaeger et al., 2005):

1. For some sufficiently large , take two sets and of basic strings (’s are called indicative strings and ’s characteristic strings), which enumerate Ok, the set of all sequences over O of length k — so K = ℓk; and construct the K × K normalized counting matrices and (∀aO) with their (i, j)th entry determined by
where (T − 2k) is the normalization factor and (, respectively) is the concatenation of (and a, resp.) and ; for any , denotes the occurrence number of in and its occurrence number in s1s2sT−1.
2. Design two auxiliary matrices: the characterizer and the indicator , such that 1mC = 1K and the m × m matrix is invertible.

3. Obtain an (asymptotically correct) estimate of an OOM for the process (Xn) by computing observable operators by and an initial vector by solving the equations and .

We begin with a historical remark. Kretzschmar (2003) derived the learning equations and, noting the importance of the matrix in the above basic learning scheme, suggested that the design of C and Q should make the condition number minimal, where, for any matrix A, ‖A2 denotes its spectral norm (i.e., the largest singular value of A). However, this algebraic criterion, while apparently reasonable at first glance, is actually misguided in the sense that it is always possible to obtain κ = 1, the smallest possible value of κ, by first creating an arbitrary such that 1mC = 1K and that has full rank m, and then putting , the pseudo-inverse of : with C and Q constructed in this way, we know (the identity matrix of order m) and hence κ = 1.

An alternative algebraic criterion for optimizing the matrices C and Q, which does not fall prey to this fallacy on this direction, has been developed in our previous work. It is based on two findings. First, if the probability matrices (the “true” normalized counting matrices) and ’s,
are known, then the true OOM of the underlying process can be reconstructed using the equations (Jaeger et al., 2005). Second, as proven in proposition 3 of Zhao, Jaeger, and Thon (2009) the relative error in estimated observable operators and the statistical error in counting matrices and are governed by the inequality
2.3
where, for any matrix A, ‖A‖ denotes its Frobenius norm ; τ is the tall matrix formed by stacking all τa’s one above another: (in Matlab's notation); and , , and are tall matrices constructed, respectively, from ’s, ’s, and ’s by the same stacking scheme; and are the error matrices of and , respectively; and (recall that) ℓ is the alphabet size.
Inequality 2.3 is the theoretical foundation of the EC algorithm, from which one immediately sees that one possible way to control the modeling error is to minimize the quantity . This criterion, together with constraint 1mC = 1K, forms the optimization problem,
2.4
for obtaining optimal C and Q. In the EC algorithm, this problem is solved by a simple numerical method: one randomly creates an m × K matrix C = C(0) with column sums 1 and iteratively updates Q and C by
2.5
until some stopping criterion is met, say, κ(t−1)1 − κ(t)1 < δ, for some predefined number δ>0. Zhao, Jaeger, and Thon (2009) showed that {κ(t)1}t=1,2,… forms a decreasing sequence with κ(t)1 ⩾ 0, and so the convergence of (C(t), Q(t)) to a local optimum is guaranteed.

In sum, the error-controlling algorithm presented in Zhao, Jaeger, and Thon (2009) is actually an instantiation of the basic learning scheme in which the auxiliary matrices C and Q are optimized by the iterative procedure specified by equation 2.5. In the next section, we derive an improved version of EC in which the iterative procedure 2.5 is replaced by a globally optimal analytical solution to an optimization problem that is very similar to equation 2.4.

## 3.  The Constructive Error-Controlling Algorithm

Although the basic learning scheme is constructive, the EC algorithm is not, since it involves an iterative procedure to optimize C and Q. In this section, a constructive method for computing the matrices C and Q is described, yielding an improved version of EC that, to our knowledge, is the first constructive learning algorithm of OOMs. To this end, we introduce another upper bound of the relative error , which is tighter than equation 2.3. We state this upper bound as a proposition.

Proposition 1.
Following the notations that have been introduced, we have
3.1

Proof.
From the proof to proposition 3 of Zhao, Jaeger, and Thon (2009), we get
3.2
where (ℓ copies of C); using equation 2.3 and the definition of and , one can also directly verify this identity. To establish inequality 3.1, we invoke a well-known matrix inequality (see, e.g., Golub & Loan, 1996):
3.3
It follows from equations 3.2 and 3.3 that
But in proposition 3 of Zhao, Jaeger, and Thon (2009), we have already proven and , so inequality 3.1 follows.

Note that for any matrix A, ‖A2 equals the largest singular value of A, and ‖A‖ is the square root of the square sum of all singular values of A. So ‖A2 ⩽ ‖A‖, and equation 3.1 indeed provides a tighter upper bound of than equation 2.3.

Inequality 3.1 provides us with another algebraic criterion for computing the optimal C and Q that is very similar to equation 2.4:
3.4
Zhao, Jaeger, and Thon (2009) demonstrated that adding the extra constraint to problem 2.4 will not change its minimizer. Similarly, here we can prove the optimization problem 3.4 is essentially equivalent to
3.5
where the expression of κ2 has been simplified by using . To see this, for any minimizer (C, Q) of problem 3.4, we put C0 = C and , then and κ2(C, Q) = κ2(C0, Q0), which means (C0, Q0) also minimizes the quantity κ2 and, additionally, satisfies the constraint .

Before introducing the constructive method for minimizing equation 3.5, we make some remarks on problem 3.5 and the normalized counting matrices and obtained from the training sequence via equation 2.2. Note that some of these remarks have already been presented in Zhao, Jaeger, and Thon (2009). Here we restate them in a clearer way to clarify the meaning of some notations that will be used later:

• •
It is easy to see that in the learning equations , the normalized counting matrices can be replaced by the raw counting matrices and ’s, that is, , where
3.6
It is also clear that replacing by in problem 3.5 does not change its minimizer (C, Q), although the minimal value of κ2 is now reduced by a factor of (T − 2k).
• •

The order of the above raw counting matrices, K = ℓk, increases exponentially with k, which at first glance results in that only small values of k are feasible in the basic learning scheme, and hence only short-term history context effects of the target process can be learned, and the optimization problem 3.5 becomes computationally inaccessible quickly. Nevertheless, first, equation 3.6 reveals that, given the sequence , the element sums of and are (T − 2k), so for large k, the matrices and are actually very sparse; second, the learning equations reveal that, discarding zero rows (columns) in , the corresponding rows (columns) in ’s (we call the resulting matrices the compressed counting matrices and denote them by and ) and the corresponding columns in C (rows in Q) will not change the estimation ; third, those “undiscarded” columns of C (rows of Q), can be efficiently optimized by solving a compressed version of equation 3.5, which will be introduced presently. Therefore, in practice we can set the length k of basic strings to be large enough to capture long-term properties of the underlying process.

• •

As Zhao, Jaeger, and Thon (2009) pointed out, in a variant method to fix the basic strings and , one can admit them to have different length k. The only condition that has to be observed when selecting basic strings is that should form a “cover” of some set Ok, in the sense that for each string from Ok, there is a unique , which is a prefix of . Moreover, by using compact suffix trees (Gusfield, 1997), we can construct the compressed counting matrices and from the training sequence in linear time, that is, flops (Ukkonen, 1995).

Now assume the compressed counting matrices and are of size N × M (note that typically N, MK). By the above discussion, we know
3.7
where (resp. ) is formed from (resp. ) by discarding its columns (resp. rows) at positions corresponding to zero rows (resp. zero columns) in . It remains to find a way to compute and , which is efficient in memory space. Equation 3.5 is too inefficient to be directly used to compute C and Q and then extract and . To simplify the notation, we move to the upper-left area of (see equation 3.6) by reordering the basic strings and , making the other entries of be zero. Correspondingly, the matrices C, Q can now be written as and (in Matlab's notation), respectively, with and . The optimization problem 3.5 is therefore equivalent to1
3.8
By proposition 2 (see below), we know and Q0 = 0, which immediately follows that and that . We thus get the desired compressed version of problem 3.5:
3.9

### 3.1.  An Analytical Solution to Problem 3.9.

To simplify the notations, we first rewrite the problem but with the superscript dropped:
3.10
where is seen as some known matrix and , . Taking the (full-size) SVD of , say, , and putting and , we get a simplified form (the matrix now becomes the diagonal one ) of equation 3.10:
as applying a unitary transformation on a matrix will not change its spectral norm and Frobenius norm. Now assume that has rank r—where D = diag{d1, d2, …, dr} with d1d2 ⩾ ⋅ ⋅ ⋅ ⩾ dr>0. Then has the compact SVD , where L1 and R1 are tall matrices formed by extracting the first r columns from and , respectively. Writing and , and utilizing the fact that ‖[A, B]‖2 = ‖A2 + ‖B2, we expand the above optimization problem to
3.11
where CiCLi and QiRiQ (i = 1, 2).

Proposition 2.

In problem 3.11, let the matrices C1 and Q1 be fixed. Then and Q2 = 0 form a minimizer of the target κ22.

See section A.1 for the proof. According to the above proposition, we can rewrite problem 3.11 as
3.12
where l1NL1 and . Here we should point out that to get problem 3.12 from 3.10 requires only computing the compact SVD . This is because we can compute the quantity ϕ from L1 by
and retrieve the optimal C and Q from a solution (C1, Q1) to problem 3.12 by

Proposition 3.

In the optimization problem 3.12, if C1 is fixed, then Q1 = (C1D)† is a minimizer of κ22.

See section A.2 for the proof. Assume C1D has the compact SVD C1D = LSR, then C1 = LSRD−1 and, by proposition 3, Q1 = RS−1L. For this setting of (C1, Q1), the second constraint C1DQ1 = Im is automatically fulfilled, and the first constraint 1mC1 = l now becomes 1mLSR = lD, which implies that (1) the subspace (in ) spanned by columns of R (the range space of R) contains the vector Dl, that is, Rx = Dl for some ; and (2) by multiplying R on the right to this equality, 1mLS = lDR.

Furthermore, in the compact SVD C1D = LSR, let ri be the ith column of R and βi = riD−2ri and write S = diag{s1, s2, …, sm} with s1s2 ⩾ ⋅ ⋅ ⋅ ⩾ sm>0. Then, by brute force computation, we obtain
Now assume the matrix , which satisfies RR = Im and 1mLSR = lD, is already known. Then the vector lDRv = [v1, v2, …, vm] and the scalars βi = riD−2ri are also known. Moreover, by the equality 1mLS = lDR, the vector v has the property
3.13
which is a constant independent of the choice of the matrix R.
Writing 1mLu = [u1, u2, …, um], we get uisi = vi and so ui = vi/si from the constraint 1mLS = lDR = v. Since L is an orthogonal matrix, we know
3.14
Note that this is the only condition u should satisfy since we can choose L freely.
Summing up the above discussion, we get the expanded “scalar” version of problem 3.12:
3.15
For each i, let αi = si/sm. Then si = αism and α1 ⩾ α2 ⩾ ⋅ ⋅ ⋅ ⩾ αm = 1. From the equality ∑mi=1(vi/si)2 = m, we know . Therefore,
from which we immediately see that κ22 reaches its minimal value
3.16
when αi = 1—si = sm for all i. It then follows from equations 3.13 and 3.14 that
3.17
As the value of lD2l is independent of the matrix R, equation 3.16 indicates that to minimize the target κ22, the choice of R should make the sum ∑mi=1βi as small as possible. By the definition of βi, we know ∑mi=1βi = tr(RD−2R). We therefore get the following optimization problem to determine the matrix R:
3.18
Note that if R is a minimizer of problem 3.18, so is RU, where U can be any orthogonal matrix of order m. We thus assume, without loss of generality, that the last column of R is rm = Dl/‖Dl‖—the unit vector on the direction of Dl (for vectors, their Frobenius norm equals their Euclidean norm, so we can abuse our notation of ‖ · ‖)—since Dl lies in the range space of R. Thus, the matrix R has the form R = [HX, rm], where is fixed and has columns forming a basis of the null space of rm; that is, H can be any matrix with the properties HH = Ir−1 and rmH = 0 (see section A.3 for a simple construction of such a matrix H); and X is a some (r − 1) by (m − 1) matrix satisfying XX = Im−1. Substituting the expression R = [HX, rm] into equation 3.18, we get
As is well known, this optimization problem has an analytical solution: the ith column of X, denoted xi, is exactly the eigenvector of HD−2H with respect to its ith smallest eigenvalue λi, which, surprisingly, is equal to βi. This can be proven in one line: βi = riD−2ri = xiHD−2Hxi = λixixi = λi.
It now remains only to find an orthogonal matrix L such that 1mL = u. With the matrix R constructed as above, we find that
that is, vi = 0 for i < m and vm = ‖Dl‖. By the relation uisi = vi and equation 3.17, we know , so L can be any orthogonal matrix with as its last column. For instance, we can set
3.19
A more general construction of such matrices L is presented in appendix A.3.

### 3.2.  The Constructive Error-Controlling Algorithm.

The discussion on problem 3.9 sums up to a constructive procedure for evaluating the optimal and that minimizes the modeling error bound presented in inequality 3.1. Inserting this procedure into the basic learning scheme, we get the following constructive error-controlling (CEC) algorithm, in which some Matlab-style notation is employed:

The time complexity of the CEC algorithm is dominated by the collection of the compressed counting matrices, which costs flops, and the computation of the compact SVD , which, in the worst case where the matrix has full rank r = N (assume that M = N), amounts to flops (Golub & Loan, 1996). Thus, the total time complexity of CEC is about . Here we recall that the EM algorithm for training HMMs and the ES algorithm have time complexity , where n is the number of EM or ES iterations. We therefore conclude that the CEC algorithm is faster than EM and ES, since NT is the typical case. The comparision between EC and CEC is more sophisticated. As analyzed in Zhao, Jaeger, and Thon (2009), the time complexity of EC is about , which is typically higher than that of CEC, so EC is slower than CEC, as shown by our simulations. However, in cases where large (full-rank) counting matrices are obtained, CEC may need more CPU time (mainly due to the SVD of ) than EC.

### 3.3.  An Efficient Variant to CEC.

We now introduce an efficient variant to the CEC algorithm. The time complexity of the CEC algorithm can be reduced to if we compute only the first m (the model dimension) singular triples of the matrix . That is, instead of the compact SVD , we use the first (largest) m singular values and left- (right-) singular vectors of to form the matrix D, L1, and R1, respectively, and then estimate the OOM as in algorithm 1. To distinguish the two versions of CEC, we call the efficient variant presented here the CEC-B algrithm and the original one described by algorithm 1 the CEC-A algorithm in the sequel.

The basic idea behind CEC-B is rather simple. The counting matrix is in fact the statistical approximation of the probability matrix (up to a counting factor), which should have rank m. Recall that we have already assumed that the underlying process can be modeled by some m-dimensional OOM. Therefore, we can (and it is quite natural) first estimate (the “true” counting matrix) from and then use to design the optimal C and Q, which is what CEC-B does.

By the above discussion, one immediately sees that CEC-A and CEC-B should have approximately the same performance, especially when the training data set is large (since then ). We would, however, emphasize again that CEC-B is computationally cheaper than CEC-A.

We conclude this section with an important remark on a painful issue in OOM theory: the negative probability problem (Jaeger, 2000b). It refers to the fact that the model learned by existing OOM learning algorithms, including the CEC algorithm presented in this letter, can produce negative “probabilities” for some sequence . For a long time, we and others attempted to find ways to eliminate this problem. Very recently, it was proven that it is undecidable to determine whether a given candidate set of observable operators leads to negative values on some strings (Wiewiora, 2008). This finding does not preclude the possibility of finding nontrivial and practically useful sufficient conditions for nonnegativity. In practice, the current state of the art relies on heuristic methods that essentially “repair” the OOM state vector by small correction terms when it would give negative “probability” values. These work well (see the one presented in appendix J in Jaeger et al., 2005).

## 4.  Empirical Results

In this section, we assess the performance of CEC-A,B in comparison with the EC and ES algorithms on three sets of symbolic sequence modeling tasks: the first two are repeated from Zhao, Jaeger, and Thon (2009): learning models for a quantized logistic system and a suite of seven partially observable Markovian decision processes (POMDPs) that have been used frequently as benchmark tasks in the literature. The third is from (Jaeger et al., 2005): to train OOMs on Mark Twain's short story, “The One Million Pound Bank-Note.”

### 4.1.  Modeling the Symbolic Logistic System.

In this experiment we consider the task of modeling a dynamical system that consists of a continuous-valued iterated map followed by a quantizer that maps real numbers to discrete symbols. The dynamics of the continuous-valued procedure is governed by the logistic mapping xn+1 = 4xn(1 − xn), with initial values x0 arbitrary selected from the interval (0, 1). As is well known, in its attractor set (0, 1), this system shows strong chaotic behavior (with Lyapunov exponent λ = ln 2). The attractor set (0, 1) is divided into ℓ = 16 equidistant subintervals, yielding an alphabet of 16 symbols and a 16-output quantizer that converts the continuous-valued sequence (xn) into a symbolic process (Xn).

We follow the experimental settings of Zhao, Jaeger, and Thon (2009): (1) 20 sequences, each of which has length L = 30,000, are created by running the quantized logistic system; (2) OOMs of dimension m ∈ {5, 10, …, 30, 40, …, 70} are then estimated from each such sequence, using the EC, ES, and CEC-A,B algorithms; (3) we compute the normalized log-likelihood (NLL) of these learned models on the other 19 sequences as follows:
4.1
where ℓ is the alphabet size, S is the test data set that consists of the other 19 (testing) sequences, and S# is the number of sequences in S, which takes the value 19 in this experiment. Here are the settings of our simulation:
• •

When evaluating test NLLs of learned OOMs, the heuristic method described in appendix J of Jaeger et al. (2005) is employed to address the negative probability problem, from which each of the EC, ES, and CEC-A,B algorithms suffers.

• •

The length of characteristic/indicative strings is set to be k = 5.

• •

As in Zhao, Jaeger, and Thon (2009), the ES algorithm is terminated after 10 iterations, and the estimated model with the highest training NLL is selected as the final output of ES. But unlike as in Zhao, Jaeger, and Thon (2009), ES is now started with a more carefully chosen initial model, so that it is numerically stable for all model dimensions.

As Zhao, Jaeger, and Thon (2009) have pointed out, in practice, the quantity should assume values from −1 to −H(Xn), where (Xn) is the underlying stochastic process that we want to model and H(Xn) is its entropy rate (under the base-ℓ logarithm), defined by the formula
4.2
where, as introduced at the beginning of section 2, is the (initial) probability distribution of the process (Xn). Specifically, it turns out that the quantized logistic process described above has entropy rate H(Xn) = 0.25 (Zhao, Jaeger, & Thon, 2009), meaning that the quantized logistic process is indeed a stochastic process, although its continuous space motion is rather simple and deterministic (in the sense that we can accurately predict the next output xn+1 based on the current one xn). Thus, in this experiment, a test NLL of −H(Xn) = −0.25 represents the upper limit that a learning algorithm can reach.
In sum, here our data set contains 20 sequences of length 30,000, from each of which four OOMs of dimensions m ∈ {5, 10, …, 30, 40, …, 70} were estimated, by the EC, ES, and CEC-A,B algorithm, respectively. The test NLL of these learned models was then computed on the other 19 sequences. We therefore obtain
NLL values after the simulation. We plotted in Figure 1a the average test NLLs of each OOM learning algorithm versus model dimensions, in Figure 1b the standard deviation of these test NLLs, and in Figure 2 the total CPU time needed for learning these models. All algorithms are programmed in Matlab (with C-mex functions embedded) and implemented on a Pentium-M 1.73 GHz laptop.
Figure 1:

Test NLLs of learned OOMs on the quantized logistic system: (a) The average test NLLs. To make the difference more visible, the y-axis is computed by y = −0.256 − average-NLL and represented in logarithmic scale (note that the number −0.256 is a little smaller than the previously introduced upper limit of test NLLs −0.25). (b) The standard deviation of the test NLLs. Note that in this experiment (also in the second and the third experiments), models learned by CEC-A and CEC-B have nearly the same quality. The relative difference in test NLL or one-step prediction error (see equation 4.3) is less than 0.1% on all data sets. As a consequence, in Figures 1, 3, and 5, the curves for CEC-A and CEC-B are almost identical.

Figure 1:

Test NLLs of learned OOMs on the quantized logistic system: (a) The average test NLLs. To make the difference more visible, the y-axis is computed by y = −0.256 − average-NLL and represented in logarithmic scale (note that the number −0.256 is a little smaller than the previously introduced upper limit of test NLLs −0.25). (b) The standard deviation of the test NLLs. Note that in this experiment (also in the second and the third experiments), models learned by CEC-A and CEC-B have nearly the same quality. The relative difference in test NLL or one-step prediction error (see equation 4.3) is less than 0.1% on all data sets. As a consequence, in Figures 1, 3, and 5, the curves for CEC-A and CEC-B are almost identical.

Figure 2:

Training CPU time of the ES, EC and CEC-A,B algorithms on the quantized logistic system.

Figure 2:

Training CPU time of the ES, EC and CEC-A,B algorithms on the quantized logistic system.

From the figures, we see that (1) CEC yields slightly more accurate models than EC and ES; (2) for large model dimensions, CEC is significantly faster than EC and ES; and (3) while CEC-B has nearly the same test NLLs as CEC-A (see the Figure 1 caption for more detail), it is two to three times faster than CEC-A. Another important observation is that for model dimensions m ⩾ 40, all models (under ES, EC, and CEC) level out in their NLLs, which reach ≈−0.261, −0.257, −0.257, respectively, with a small margin toward the theoretical optimum of −0.25.

### 4.2.  POMDP Benchmark Suite.

In the second experiment we compare the performance of ES, EC, and CEC on the task of learning models of seven POMDPs that were frequently used as benchmark tasks for assessing the performance of predictive state representations (PSRs) (Littman & Sutton, 2001) in the literature.

First, we point out the relationship between OOMs and PSRs. PSRs are a class of models for discrete, stochastic input-output systems, which generalizes partially observable Markov decision processes (POMDPs) like OOMs generalize HMMs. PSRs have been partly inspired by OOMs but have developed a representation format for input-output processes different from the format of input-output OOMs described in Jaeger (1998). Developing learning algorithms and specialized variants of PSRs is an active area of research (McCracken & Bowling, 2005; Wolfe, James, & Singh, 2005; Bowling, McCracken, James, Neufeld, & Wilkinson, 2006; Rudary & Singh, 2006; Wolfe, 2006; Wingate & Singh, 2007). Zhao, Jaeger, and Thon (2009) demonstrated that ES and EC strongly outperform known PSR learning algorithms. These results are not repeated here. We remark furthermore that extending the ES, EC, and CEC algorithms to input-output OOMs constitutes a current line of research in our group.

The seven target POMDPs are taken from a public Web source.2 They are Bridge, Cheese, Maze4x3, Network, Paint, Shuttle, and Tiger. We use the same experimental settings as Zhao, Jaeger, and Thon (2009) and for more detail refer readers to that article. We use notations common in the PSR literature:

• •

As is commonly done in the PSR field, the input policy is simply to choose, at each time step, an action a from the input alphabet, say A, according to the uniform random distribution (over A). So the training/test sequences have the form a1o1a2o2, …, aNoN, a sequence of alternating actions aiA and observations oiO.

• •

For each of the POMDP domains, we train OOMs on (input-output) training sequences of varying lengths N = 103–107.3 using the CEC-A or B (or EC, ES) algorithm, as follows. We (naively) regard each combination (aioi) of an action ai and its immediate observation oi as a single symbol si from the alphabet S = A × O and learn an OOM from the sequence s1s2sN (with the length of basic strings set to be k = 2).

• •
The quality of learned models is measured by their average one-step prediction error E on testing sequences of length N = 104 (the format that has been used for assessing the main PSR algorithms):
4.3
where P is the correct probability (computed using the known underlying POMDP) and is the model prediction for the next observation given the testing history hi = a1o1, …, aioi and the action ai+1. For learned OOMs (over S = A × O), we compute the above predicting probabilities by
4.4

In Figure 3 we plotted the average one-step prediction error E (y-axis) of the ES, EC, and CEC-A,B algorithms on the seven POMDP domains versus training sequence length (x-axis), and in Figure 4 the training CPU time cost by these algorithms for each training length. These are the main findings:

• •

The four OOM learning algorithms all show a near-log-linear relationship between the length N of training sequences and the prediction error E—that is, log E ≈ α − βlog N, or, equivalently, EeαN−β = E0N−β (E0, β>0)—in all except the Cheese domain. This partly demonstrates that OOM learning is asymptotically correct (since E → 0 when N → ∞) and statistically efficient (since the convergence of E to 0 is fast), as has been claimed in our previous OOM references.

• •

In the Cheese domain, the prediction error E does not decrease anymore when N ⩾ 105. One possible reason is that the basic string length k = 2 is too small to capture all relevant statistics. We therefore increase the value of k by 1 and do the simulation again. Figure 3h shows the empirical results, from which we see the desired property of OOM learning.

• •

The two CEC algorithms and the EC algorithm have nearly the same one-step prediction error. But for a small training data set, CEC is faster than EC, whereas for a large data set, the difference in speed becomes invisible. The reason is that when the training sequence length T is large (compared to the size N of counting matrices), both CEC and EC have time complexity dominated by , and when T is small, CEC has the time complexity or , smaller than that of EC, which is .

• •

Neither CEC nor EC outperforms ES in all domains. They win in the Network, Paint, and Tiger domains but are no match for ES in the Cheese and Maze4x3 domains. The four algorithms show essentially the same performance in the Shuttle domain. But both CEC and EC are about 10 times faster than ES, especially for large training data sets.

• •

Bridge is an interesting domain, in which CEC/EC is inferior to ES when N ⩽ 105, but when N becomes larger (⩾3 × 105), CEC/EC starts to yield a smaller prediction error than ES (In fact, other domains show a similar phenomenon, which is, however, less obvious, as in the Bridge domain). This illustrates that asymptotically, CEC/EC converges faster than ES () but also reveals that in some domains, CEC/EC might not be as statistically efficient as ES when the training data set is small.

• •

In the Bridge domain, the prediction error of EC jumps to a high level at T = 107, and ES shows a similar jump in the Paint domain when T = 106, whereas CEC always get smoothly decreasing prediction error. This partially illustrates that CEC are numerically more stable than EC and ES.

Figure 3:

Average one-step prediction error of the ES, EC, and CEC-A,B algorithms on seven POMDP benchmark problems (x-axis: (log10 of) length of training sequences; y-axis: average one-step prediction error). Panel h is a rerun of panel b with different parameter settings. See the text.

Figure 3:

Average one-step prediction error of the ES, EC, and CEC-A,B algorithms on seven POMDP benchmark problems (x-axis: (log10 of) length of training sequences; y-axis: average one-step prediction error). Panel h is a rerun of panel b with different parameter settings. See the text.

Figure 4:

Training time of the ES, EC, and CEC-A,B algorithms on seven POMDP benchmark problems (x-axis: (log10 of) length of training sequences; y-axis: training CPU time). Panel h is a rerun of panel b with different parameter settings. See the text.

Figure 4:

Training time of the ES, EC, and CEC-A,B algorithms on seven POMDP benchmark problems (x-axis: (log10 of) length of training sequences; y-axis: training CPU time). Panel h is a rerun of panel b with different parameter settings. See the text.

It should be noted that in the Maze4x3 domain (see Figure 4c), the training CPU times for EC and CEC-A,B show a significant drop when the training data size increases from T = 105.67 to T = 106. This is not a coincidence, because we repeated the experiment 10 times, Matlab's rand() function initialized by different seeds and always got similar results. One possible reason for this is that for training sequences of length ⩾106, the counting matrices are already very near to the true probability matrices and so have (numerical) rank equal to the model dimension m = 11. This might speed up the SVD of the counting matrix —only the first 11 singular values and vectors need to be computed—and reduce the number of EC iterations needed for optimizing the two auxiliary matrices C and Q. In the ideal case where the counting matrix has rank m, only one EC iteration is needed.

### 4.3.  Modeling Text Sources.

The third experiment models a text source—“The 1,000,000 Pound Bank-Note,” a short story written by Mark Twain. To simplify the task, we preprocessed the text string by deleting all special characters except the blank, changing capital letters to small ones, which gave us an alphabet of 27 symbols: a, b, …, z plus the blank. We then sorted the resulting string sentence-wise, obtaining two substrings of roughly equal length (21,042 and 20,569 symbols, respectively) that were used as training and test sequences.

As before, we estimated OOMs of dimensions {5, 10, …, 50, 60, …, 100, 120, 150} from the training sequence by the EC, ES, and CEC-A,B algorithms, respectively, and then computed the training and test NLLs of each learned OOM, as shown in Figures 5a and 5b. Here the length of characteristic or indicative strings is set as k = 2. We also plotted the training time of each learning algorithm versus the model dimension in Figure 6. From these figures, we see that OOMs learned by the four algorithms have nearly the same training and test NLLs (ES is a bit better for low-dimensional models but becomes worse when model dimension increases). Among the four algorithms CEC-B is the fastest one: it is about 2 times faster than CEC-A, which is in turn 10 to 100 times faster than ES and EC.

Figure 5:

Training (a) and test (b) NLLs of learned OOMs in “The 1,000,000 Pound Bank-Note.”

Figure 5:

Training (a) and test (b) NLLs of learned OOMs in “The 1,000,000 Pound Bank-Note.”

Figure 6:

Training times of the ES, EC, and CEC-A,B algorithms on “The 1,000,000 Pound Bank-Note.”

Figure 6:

Training times of the ES, EC, and CEC-A,B algorithms on “The 1,000,000 Pound Bank-Note.”

## 5.  Conclusion and Discussion

We have derived the constructive error-controlling algorithm (CEC-A), the first constructive algorithm for learning OOMs. We have also proposed an efficient variant of CEC-A, referred to as CEC-B. Our numerical experiments reveal:

• •

As expected, the two CEC algorithms improve over their predecessor, the EC algorithm, in most domains. CEC usually learns more accurate models (at least not worse) than EC. When learning high-dimensional models, CEC are much faster than EC.

• •

Compared with the ES algorithm, both CEC algorithms are much faster (especially when learning models from large data sets), while still showing comparable test accuracy.

• •

For large data sets, the two CEC algorithms have nearly the same performance, but for small data sets, CEC-B is about two to three times faster than CEC-A.

We conclude with some open questions:

• •

Like the EC algorithm, currently the CEC algorithm is given the desired model dimension and a (fixed) basic string length as parameters. This will have to be automated in the future. Also, the CEC algorithm should be generalized to become, like the ES algorithm, based on variable-length subsequence counting statistics, instead of the fixed-length statistics of the current version.

• •

As Zhao, Jaeger, and Thon (2009) pointed out, a mathematical understanding of the relationships between the algebraic and the statistical aspects of EC/CEC versus ES, or of the algebraic working principle versus the statistical effects, is a theoretically intriguing question.

• •

As mentioned at the end of section 3, all OOM learning algorithms (and the OOM theory itself) suffer from the negative probability problem. As the undecidability result of Wiewiora (2008) shows, this is a difficult problem in the current linear framework of OOMs. In response to this issue, M.-J. Z. is developing and investigating nonlinear variants of OOMs that are free from this negativity problem by design (Zhao & Jaeger, 2007).

## Appendix A:  Proofs and Algorithms

### A.1.  Proof of Proposition 2.

If C1 and Q1 are fixed, then, as C2 and Q2 are independent of each other, we can split problem 3.11 into two simpler problems:
Assume the ith column of L2 is bi and denote by yi the ith column of C2. Then the above first problem is equivalent to , (∀ i), since ‖C22 = ∑iyi2. By the Cauchy-Schwarz inequality, we know |1myi| ⩽ ‖1m‖ · ‖yi‖, that is, ; with equality if and only if yi = α1m, which, together with the constraint 1myi = 1Nbi, implies that . It then follows that is the unique minimizer of the first problem.

To prove that Q2 = 0 minimizes ‖[Q1; Q2]‖2, we will use an alternative definition of spectral norm: ‖A2 = max{‖Ax‖ : ‖x‖ = 1}. Let x be such that ‖x‖ = 1 and ‖[Q1; 0]‖2 = ‖[Q1; 0]x‖, then ‖[Q1; Q2]‖22 ⩾ ‖[Q1; Q2]x2 = ‖[Q1x;  Q2x]‖2 ⩾  ‖[Q1x; 0]‖2 =  ‖[Q1; 0]x2 =  ‖[Q1; 0]‖22 and so ‖[Q1; Q2]‖2 ⩾ ‖[Q1; 0]‖2, which is what we want to prove.

### A.2.  Proof of Proposition 3.

Assume C1D has the compact SVD C1D = LSR. Then LL = Im and the constraint C1DQ1 = Im now reads LSRQ1 = Im. It follows that SRQ1L = LLSRQ1L = LL = Im, and so RQ1L = S−1, which in turn implies that RRQ1LL = RS−1L = (C1D)†. This identity, together with the facts that ‖L2 = ‖R2 = 1, ‖A2 = ‖A2, and ‖AB2 ⩽ ‖A2 · ‖B2, implies that ‖(C1D)†‖2 ⩽ ‖Q12, which completes the proof.

### A.3.  Simple Construction of Two Orthogonal Matrices.

Here we consider the following simple problems in linear algebra: given two vectors with unit norm, (1) how to (efficiently) construct an orthogonal matrix such that Lx = y and (2) how to find a basis of the null space of x, that is, to find a matrix satisfying HH = In−1 and xH = 0? As one can easily see, L and H are not unique, so we add an extra condition on L: it should be as close to In as possible, in the sense that ‖LIn‖ is minimized. As LL = In, we have ‖LIn2 = 2tr(InL). Thus, by brute force computation,
A.1
In the above formula, putting y = [0, …, 0, 1] and extracting the first (n − 1) columns of L, we get the matrix H.

## Acknowledgments

The results reported in this article were achieved when the first author worked under a grant (JA 1210/1-1 & 2) from the Deutsche Forschungsgemeinschaft (DFG). We also thank two anonymous reviewers for their very careful and constructive review.

## Notes

1

For any matrix A, ‖A2 is the square sum of its entries, so .

## References

Bengio
,
Y.
(
1999
).
Markovian models for sequential data
.
Neural Computing Surveys
,
2
,
129
162
.
Bowling
,
M.
,
McCracken
,
P.
,
James
,
M.
,
Neufeld
,
J.
, &
Wilkinson
,
D.
(
2006
).
Learning predictive state representations using non-blind policies
. In
Proceedings of the 23rd International Conference on Machine Learning (ICML)
(pp. 129–136)
.
New York
:
ACM Press
.
Dempster
,
A. P.
,
Laird
,
N. M.
, &
Rubin
,
D. B.
(
1977
).
Maximum likelihood from incomplete data via the EM-algorithm
.
Journal of the Royal Statistical Society
,
39
(
1
),
1
38
.
Golub
,
G. H.
, &
Loan
,
C. F. v.
(
1996
).
Matrix computations
(3rd ed.).
Baltimore, MD
:
Johns Hopkins University Press, 3rd edition
.
Gusfield
,
D.
(
1997
).
Algorithms on strings, trees, and sequences: Computer science and computational biology
.
Cambridge
:
Cambridge University Press
.
Jaeger
,
H.
(
1998
).
Discrete-time, discrete-valued observable operator models: A tutorial
. (GMD Rep. 42).
Sankt Augustin
:
GMD
.
Jaeger
,
H.
(
2000a
).
Modeling and learning continuous-valued stochastic processes with OOMs
(GMD Rep. 102).
Sankt Augustin
:
GMD
.
Jaeger
,
H.
(
2000b
).
Observable operator models for discrete stochastic time series
.
Neural Computation
,
12
(
6
),
1371
1398
.
Jaeger
,
H.
,
Zhao
,
M.-J.
,
Kretzschmar
,
K.
,
Oberstein
,
T. G.
,
Popovici
,
D.
, &
Kolling
,
A.
(
2005
).
Learning observable operator models via the ES algorithm
. In
S. Haykin, J. Principe, T. Sejnowski, & J. McWhirter (Eds.)
,
New directions in statistical signal processing: From systems to brains
.
Cambridge, MA
:
MIT Press
.
Kretzschmar
,
K.
(
2003
).
Learning symbol sequences with observable operator models (GMD Rep. 161)
.
Sankt Augustin
:
Fraunhofer Institute AIS
.
Littman
,
M. L.
, &
Sutton
,
R. S.
(
2001
).
Predictive representation of state
. In
T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.)
,
Advances in neural information processing systems, 14
(pp. 1555–1561)
.
Cambridge, MA
:
MIT Press
.
McCracken
,
P.
, &
Bowling
,
M.
(
2005
).
Online discovery and learning of predictive state representations
. In
Y. Weiss, B. Schölkopf, & J. Platt (Eds.)
,
Advances in neural information processing systems
,
18
(pp. 875–882)
.
Rudary
,
M.
, &
Singh
,
S.
(
2006
).
Predictive linear-gaussian models of controlled stochastic dynamical systems
. In
Proceedings of the 23rd International Conference on Machine Learning (ICML)
(pp. 777–784)
.
New York
:
ACM Press
.
Ukkonen
,
E.
(
1995
).
On-line construction of suffix trees
.
Algorithmica
,
14
(
3
),
249
260
.
Wiewiora
,
E. W.
(
2008
).
Modeling probability distributions with predictive state representation
.
Unpublished doctoral dissertation, University of California, San Diego
.
Wingate
,
D.
, &
Singh
,
S.
(
2007
).
On discovery and learning of models with predictive state representations of state for agents with continuous actions and observations
. In
Proceedings of the 6th International Conference on Autonomous Agents and Multiagent Systems (AAMAS)
.
New York
:
ACM Press
.
Wolfe
,
B.
(
2006
).
Predictive state representations with options
. In
Proceedings of the 23rd International Conference on Machine Learning (ICML)
(pp. 1025–1032)
.
New York
:
ACM Press
.
Wolfe
,
B.
,
James
,
M. R.
, &
Singh
,
S. P.
(
2005
).
Learning predictive state representations in dynamical systems without reset
. In
Proceedings of the 22nd International Conference on Machine Learning
(pp. 985–992)
.
New York
:
ACM Press
.
Zhao
,
M.-J.
, &
Jaeger
,
H.
(
2007
).
Norm observable operator models
(Tech. Rep. 8)
.
Bremen
:
Jacobs University
.
Zhao
,
M.-J.
,
Jaeger
,
H.
, &
Thon
,
M.
(
2009
).
A bound on modeling error in observable operator models and an associated learning algorithm
.
Neural Computation
,
21
(
9
),
2687
2712
.