Abstract

Optimal control theory and machine learning techniques are combined to formulate and solve in closed form an optimal control formulation of online learning from supervised examples with regularization of the updates. The connections with the classical linear quadratic gaussian (LQG) optimal control problem, of which the proposed learning paradigm is a nontrivial variation as it involves random matrices, are investigated. The obtained optimal solutions are compared with the Kalman filter estimate of the parameter vector to be learned. It is shown that the proposed algorithm is less sensitive to outliers with respect to the Kalman estimate (thanks to the presence of the regularization term), thus providing smoother estimates with respect to time. The basic formulation of the proposed online learning framework refers to a discrete-time setting with a finite learning horizon and a linear model. Various extensions are investigated, including the infinite learning horizon and, via the so-called kernel trick, the case of nonlinear models.

1  Introduction

In recent years, the combination of techniques from the fields of optimization/control and machine learning has led to a successful interaction between the two disciplines. As an example, we mention reinforcement learning (Bertsekas, 1996b; Sutton & Barto, 1998). The cross-fertilization between these two fields shows itself in both directions.

1.1  Application of Machine Learning Techniques to Optimization/Control

Sparsity-inducing regularization techniques from machine learning have often been exploited to find suboptimal solutions to an initially unregularized optimization problem, having at the same time a sufficiently small number of nonzero arguments. For instance, the least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996) was applied in Gnecco, Morisi, and Bemporad (2015) to consensus problems and in Gallieri and Maciejowski (2012) to model predictive control (MPC).

Applications of machine learning techniques to control can be found, in Suykens, Vandewalle, & De Moor (2001) and in the series of papers (Gnecco & Sanguineti, 2010; Gaggero, Gnecco, & Sanguineti, 2013, 2014), where least squares support vector machines (LS-SVMs) and one-hidden-layer perceptron neural networks, respectively, were applied to find suboptimal solutions to optimal control problems. Mesbahi and Egerstedt (2010) applied spectral graph theory methods, already exploited successfully in machine learning problems (Belkin & Niyogi, 2006), to the control of multiagent dynamical systems.

Least squares support vector machines and spectral graph theory also have been applied, respectively, to system identification (Lataire, Piga, & Tóth, 2014) and control of epidemics (Chakrabarti, Wang, Wang, Leskovec, & Faloutsos, 2008).

1.2  Application of Optimization/Control Techniques to Machine Lear-ning

This is the direction we followed in this letter: we develop an approach that exploits for machine learning techniques from optimization and control (particularly, optimal control). We focus on the problem of estimating online an unknown parameter vector modeling a noisy linear relationship between input examples and their output labels. The emphasis is on linear regression, but an extension to nonlinear regression is also detailed. Moreover, the model can be applied to linear (respectively, nonlinear) binary classification by interpreting a suitable nonlinear transformation (e.g., a sigmoidal mapping) of the output as the a posteriori probability (given the input example) of belonging to one of the two classes.

Specifically, we propose and solve in closed form an optimal control formulation of online learning with supervised examples and a regularization of the updates. In the online framework, the examples become available one by one as time passes and the training of the learning machine is performed continuously. Online learning problems have been investigated, in, for example, Zinkevich (2003), Ralaivola and d’Alché Buc (2005), Smale and Tao (2006), Ying and Pontil (2008), Liu, Park, Wang, & Príncipe (2009), Shalev-Shwartz (2012), but without using an approach based on optimal control theory. As suggested by the preliminary results that we obtained in Gnecco, Bemporad, Gori, Morisi, and Sanguineti (2015), such an approach can provide a strong theoretical foundation to the choice of a specific online learning algorithm by selecting the updates of the parameter estimate as the outputs of a sequence of control laws that solve a suitable optimal control problem modeling online learning itself.1 A distinguishing feature of our study is that we derive online learning algorithms as closed-form optimal solutions to suitable online learning problems. In contrast, typically, work in the literature proposes a certain algorithm and then investigates its properties, but does not analyze the optimality of such an algorithm with respect to a suitable online learning problem. A remarkable exception is Bertsekas (1996a), but it refers to a deterministic optimization problem and, different from our approach, does not contain any regularization of the updates of the parameter estimate.

In a nutshell, our contributions are the following:

  • We make the machine learning community aware of a point of view that until now might have been overlooked.

  • By exploiting such a viewpoint, we develop a novel machine learning paradigm, for which we provide closed-form solutions.

  • We make connections between our results and other machine learning algorithms.

1.3  The Adopted Learning Model

The learning model that we adopt can be considered a nontrivial variation (due to the presence of suitable random matrices) of the linear quadratic (LQ) and linear quadratic gaussian (LQG) optimal control problems, which we briefly summarize. The LQ problem (Bertsekas, 1995) consists of the minimization—with respect to a set of control laws, one for each decision stage—of a convex quadratic cost related to the control of a linear dynamical system, which is decomposed into the summation of several convex quadratic per stage costs, associated with a priori given cost matrices. At each stage, a control law is applied. It is a function of an information vector, which collects all the information available to the controller up to that stage. More precisely, the information vector is formed by the sequence of controls applied to the dynamical system up to the previous stage and by the sequence of measures of the state of the dynamical system itself, acquired up to the current stage. A peculiarity of the LQ problem is that such measures are linearly related to the state, again through suitable a priori given measurement matrices. The measures may be corrupted by additive noise, with given covariance matrices. When all the noise vectors are gaussian, one obtains the LQG problem, for which closed-form optimal control laws in feedback form are known.2 They are computed by solving recursively suitable Riccati equations and applying the Kalman filter (Sutton, 1992) to estimate the current state of the dynamical system.

The main difference between the LQ/LQG problems and the proposed formulation of online learning with supervised examples and regularization is the following. In our approach, both the cost and measurement matrices are random, being associated with the input examples, which become available as time goes on. It is worth mentioning that the randomness of some matrices in the context of the LQ optimal control problem was considered also in Bertsekas (1995), but in a way not directly applicable to the online learning problem investigated in this letter (see remark 16 in section 3.2 for further details). First, we consider a linear relationship between the input examples and their labels, possibly corrupted by additive noise, and collect into the state vector both the current estimate of the parameter vector modeling the input-output relationship and the parameter vector itself, which is unknown. Then we relax the linearity assumption and address a more general nonlinear context. The goal consists in finding an optimal online estimate of the parameter vector, on the basis of the information associated with the incoming examples, modeled in the simplest case as independent and identically distributed random vectors.3

Each decision stage corresponds to the presentation of one example to the learning machine, whereas one part of the convex per stage cost penalizes quadratically the difference between the observed output and the one predicted by the learning machine, by using the current estimate of the parameter vector. Causality in the updates (i.e., their independence on future examples, which is important for a truly “online” framework) is preserved by constraining the updates to depend only on the history of the observations and updates up to the decision time, likewise in the LQ/LQG problems.

At each stage, the error on future examples is taken into account through the conditional expectation of the summation of the associated per stage costs, conditioned on the current information vector. In the stationary case, the link between the examples used for training and the future examples is only in their common generation model. In order to reduce the influence of outliers on the online estimate of the parameter vector, its smoothness with respect to time is also enforced through the presence of a suitable regularization term in the functional to be optimized, weighted by a positive regularization parameter.4 This represents the second part of the convex and quadratic per stage cost. The optimal solution is obtained by applying dynamic programming (DP) and requires the solution of suitable Riccati equations. The difference already noted between the classical LQ/LQG problems and the proposed online learning framework (i.e., the random nature of some matrices) determines two different forms for such equations—for the backward and forward phases of DP, respectively. When the optimization horizon is infinite, it is necessary to take into account the random nature of the matrices to perform a convergence analysis of the online estimate of the unknown parameter vector.

Table 1 provides the correspondence between the notations used for optimal control and the ones used for the proposed online learning framework.

Table 1:
Some Correspondences between the Machine Learning Terminology and the Optimal Control One.
Update Control 
Updating function Control function 
Problem OLL (Online Learning) Optimal control problem 
Learning horizon Optimization horizon 
Learning functional Cost functional 
Average learning functional Average cost functional 
Update Control 
Updating function Control function 
Problem OLL (Online Learning) Optimal control problem 
Learning horizon Optimization horizon 
Learning functional Cost functional 
Average learning functional Average cost functional 

1.4  Relationships with Other Machine Learning Techniques

The approaches to online learning most closely related to this work are Kalman filtering (Sutton, 1992; see also Bertsekas, 1995, appendix E) and its kernel version (Ralaivola & d’Alché Buc, 2005; Liu et al., 2009), in which, however, no penalization is made directly on the control (updating) variables. Indeed, one of our contributions consists in developing a theoretical framework in which such a penalization is taken into account and in providing in most cases closed-form results. Interestingly, the obtained solutions can be interpreted as smoothed versions (with respect to time) of the solution obtained applying the Kalman filter only. Most important, we show, both theoretically and numerically, that our solutions are less sensitive to outliers than the Kalman filter estimates. This is very useful (e.g., if one wants to give more importance to a whole set of most recently presented examples than to the current example) to obtain estimates that change more smoothly with respect to time (smoothness of an estimate is a desirable property, e.g., in applications to online system identification and control, in which one has also to control the system just identified). The issue of a smaller sensitivity to outliers is especially important in case of deviations from the gaussianity assumptions (which cannot be a priori excluded in practical applications), or of a mismatch between the true covariance matrices of the noise and the ones used to update the Kalman filter estimates. Other motivations for the presence in our formulation of a penalization of the updating variables are numerical benefits due to the regularization term (a typical way of dealing with ill-conditioning through regularization approaches) and better behavior when considering the extension of the online learning problem to the case of a time-varying parameter vector. All of these issues are discussed later in the letter. It is also worth mentioning that various forms of regularization of the Kalman filter (and, in an offline context, of the Kalman smoother) were also proposed in the recent literature (e.g., Jay, Duvaut, Darolles, & Gouriéroux, 2011; Wang, Li, & Rizos, 2012; Li, Gui, Gu, Han, & Du, 2014; Liu & Chang, 2016; Khan, Bouaynaya, & Fathallah-Shaykh, 2014), where, however, an optimal control approach was not used.

For each decision stage, the updating formula that provides the solution to the proposed learning paradigm is similar to the ones of other online estimates obtained through various machine learning techniques such as stochastic gradient descent. However, there is a substantial difference: we derive it as the optimal solution of an optimization problem modeling online learning, and this allows us to prove various interesting properties. We believe that this approach could also be fruitfully applied to other machine learning techniques used in online learning.

A number of extensions of the basic problem formulation is also described with some detail at the end of the letter, showing the generality of our theoretical framework, highlighting the large potentialities of the application of optimal control theory to machine learning, and providing hints to stimulate further interdisciplinary research on this topic. A connection with neuroscience is discussed too.

1.5  Organization of the Letter

Section 2 is a nontechnical overview of the main results derived in the letter, written to allow readers who are not familiar with optimal control, but work in the field of machine learning, to appreciate the nature of our approach and its contributions. At the same time, it provides a summary of the main results of the paper. Section 3 introduces and solves the proposed model of online learning as an LQ optimal control problem with random matrices and finite optimization horizon and provides closed-form expressions for the optimal solutions in the LQG case. Section 4 extends the analysis to the infinite-horizon framework. Section 5 investigates convergence properties of the algorithm, whereas section 6 compares the proposed online approach with average regret minimization and the Kalman filter estimates, both theoretically and numerically. Section 7 extends the analysis to nonlinear models (kernel methods). Other extensions are described in section 8. Section 9 is a conclusive discussion. To improve the readability, most technical proofs are contained in the appendix.

2  Overview of the Main Results

In the following, we summarize the main results with links to the parts of the letter in which they are presented:

We derive closed-form optimal solutions for the proposed optimal control formulation of online learning and for some of its extensions. They are expressed in terms of two Riccati equations (see section 3), associated, respectively, with the backward phase of DP (to determine the gain matrix of the optimal controller) and the determination of the gain matrix in the Kalman filter (in the case of gaussian random vectors). Different from the LQG problem, the two Riccati equations have different natures: one involves expectations of random matrices (so it may be called an average Riccati equation), whereas the other involves realizations of random matrices (so it may be called a stochastic Riccati equation). As a consequence, a specific study, detailed in the letter, is needed to study properties of their solutions, which confirms that the proposed problem is not a trivial application of the LQG problem to an online learning framework.

We analyze both theoretically and numerically the role of the regularization parameter (see section 3.3).

In the infinite-horizon case, we investigate the existence of a stationary (and linear) optimal updating function (see section 4), and stability issues, that is, the convergence to 0 of the mean square error between the parameter vector and its online estimate when the number of examples goes to infinity (see section 5). In this context, another nontrivial difference with respect to the classical LQG problem arises. When computing certain expectations conditioned on the current information vector, one has to take into account that the information vector at a generic stage has additional components deriving from the realizations of the output measurement matrices up to the stage itself (as these are random matrices, associated with the input examples). As a consequence, the Kalman gain matrix, which is shown in the letter to be embedded in the optimal solution, is not only stage dependent but also a random matrix (although it becomes deterministic when conditioned on the input examples already presented to the learning machine). This motivates the investigation of issues such as its convergence in probability and the convergence of its expectation when the number of examples goes to infinity.

We discuss the connection of the proposed online learning framework with average regret minimization. We prove that the sequence of our estimates minimizes the average regret functional (see section 6.1).

We investigate the connections among our solution, the Kalman filter estimate, and stochastic gradient descent (see remark 9 and sections 6.26.4). We prove that our solution can be interpreted as a smoothed Kalman filter estimate, with time-varying gain matrix, and we show that it outperforms the latter in terms of its larger smoothness (section 6.3; see also section 8e and its smaller sensitivity to outliers, section 6.4).

We discuss cases in which the proposed solution can be computed efficiently (see, e.g., the comments presented after proposition 15, remark 24, and the numerical results reported in section 6.3).

We address the case of nonlinear input-output relationships, modeled using the kernel trick of kernel machines (see section 7). As is well known, the kernel trick is based on a preliminary (in general nonlinear) mapping of the input space to a larger-dimensional feature space, to which the original linear model is applied in a second step. The kernel trick, which consists of computing certain inner products in the auxiliary feature space through a suitable function called “kernel,” can be applied in our context since we show that the optimal solution can be expressed in terms of inner products in the feature space that can be computed through a kernel.

We describe various other extensions (see section 8), such as the case of a time-varying parameter vector to be learned, the introduction of a discount factor, the inclusion of additional regularization terms, a continuous-time extension framework, and a possible extension of the problem formulation through techniques from robust estimation and control.

Table 2 collects some acronyms of frequent use in the letter.

Table 2:
Acronyms of Frequent Use.
Problem OLL Online learning problem over finite horizon N 
 and regularization parameter  
Problem OLL Online learning problem over infinite horizon 
 and with regularization parameter  
OLL estimate Estimate obtained solving problem OLL or OLL 
LQ Linear quadratic 
LQG Linear quadratic gaussian 
LQR Linear quadratic regulator 
ARE Average Riccati equation 
SRE Stochastic Riccati equation 
KF Kalman filter 
MSE Mean square error 
Problem OLL Online learning problem over finite horizon N 
 and regularization parameter  
Problem OLL Online learning problem over infinite horizon 
 and with regularization parameter  
OLL estimate Estimate obtained solving problem OLL or OLL 
LQ Linear quadratic 
LQG Linear quadratic gaussian 
LQR Linear quadratic regulator 
ARE Average Riccati equation 
SRE Stochastic Riccati equation 
KF Kalman filter 
MSE Mean square error 

3  The Basic Case: Discrete Time, Finite Horizon, and Linear Model

For simplicity, we consider first a discrete time setting with a finite learning horizon and a linear model. Then we address the extensions to an infinite learning horizon and nonlinear models.

3.1  Problem Formulation

Assumption 1
At each time a learning machine can observe the supervised pair , where is a column vector and . The output yk is generated from the input xk according to the following linear model:
formula
3.1
where is a measurement noise and is a random column vector, unknown to the learning machine, and to be estimated by the learning machine itself by using the sequence of examples as they become available.
Assumption 2

The random variables w, , are mutually independent5 and (only for simplicity of notation and without any loss of generality) have mean 0. The random variables have the same finite variance , and each xk has finite covariance matrix .

Assumption 3
Starting from the initial estimate of w, at each time the learning machine builds an estimate of w, generated according to
formula
3.2
where uk is the update of the estimate of w at the time k (to be optimized according to a suitable optimality criterion, defined later on).
Remark 1.

It is important to observe that model 3.1 is time invariant in the sense that the same w is used to generate every yk, starting from every xk and every .6 So once a realization of the random vector w has been generated, this can be interpreted as a fixed vector, to be estimated by the learning machine using the online supervised examples.

To analyze the time evolution of the estimate, one has to consider the following dynamical system (see Alessandri, Sanguineti, & Maggiore, 2002 for a similar approach), with state vector and initial conditions and :
formula
3.3
together with the measures
formula
3.4
where .
Assumption 4
The update uk at the time k is chosen as a function , called the updating function, of the information vector Ik at the same time, which collects the “history” up to the time k and is defined as
formula
for and
formula

Hence, the update uk depends only on the sequence of examples observed up to the current stage k and on the sequence of previous updates uj (or, equivalently, since , on the sequence of previous updates of the estimate of w).

In our model, the updating functions uk are chosen in order to minimize a learning functional over a finite learning horizon, defined as follows:

Definition 1
(learning functional over horizon N). Let N be a positive integer, , , and
formula
3.5

We state the following online learning problem (in this letter, the symbol “” is used to denote optimality):

Problem OLL

(online learning over a finite horizon).  Given the finite learning horizon N, the examples generated at each time instant according to the model defined by assumptions 1 and 2, and the learning machine defined by assumption 3, find the finite sequence of optimal updating functions with the structure defined by assumption 5, that minimizes the learning functional 3.5.

Problem OLL can be considered a parameter identification problem or an optimal estimation problem, as the final goal consists of estimating optimally the parameter vector w relating the input examples to their outputs, given the current subsequence of examples and the adopted optimality criterion. It can also be considered an optimal control problem, interpreting the updating function uk as a control function for the dynamical system 3.3. Although this last interpretation may seem less natural than the first two, it is motivated by the fact that problem OLL and its variations presented later in the letter can be investigated using optimal control techniques, as is done in the following.

For every , we shall call online estimate (OLL estimate, for short).

Remark 2.

The term in the learning functional, equation 3.5, penalizes a large deviation of the learning machine estimate of the label yk from its best estimate (in a mean square sense) , which would have been obtained if w were known, whereas the term penalizes a large square of the norm of the update uk of the estimate of w, and is a regularization term, which measures the trade-off between the two terms.

Remark 3.
“Shifted” Kalman filter estimates correspond to the limit case in the formulation of problem OLL. Indeed, in such a case, each term
formula
in equation 3.5 is minimized when is the conditional expectation of wk given , that is, when it is the Kalman filter estimate of wk at time (see, e.g., Bertsekas, 1995).7 It is worth mentioning that since the parameter vector to be learned is constant and the data generation model is described by equation 3.1, the specific Kalman estimation problem is equivalent to recursive least squares (see Sayed, 2003 for a proof of this equivalence).

In sections 6.3 and 6.4, we discuss some relationships of the proposed learning framework with the classical Kalman filter (Sutton, 1992). As it will be shown by proposition 28 and by the numerical results in Figure 1, the presence of the regularization term in the learning functional 3.5 can make the resulting sequence of optimal estimates of w smoother with respect to the time index, and less sensitive to outliers, than the sequence of estimates obtained by using the classical Kalman filter, under gaussian assumptions on the random variables w and .

Remark 4.

The constraint that each update uk (hence, also each optimal updating function ) depends only on the sequence of examples observed up to the current stage k and on the sequence of previous updates uj implies that no future examples are taken into account to update the current estimate of w. Hence, the proposed solution is actually a model of online learning. Instead, batch learning corresponds to the case where one assumes that all the sequence of examples is known to the learning machine, starting from time .

Remark 5.
An alternative definition of the learning functional can be obtained by replacing the term in 3.5 by , that is, by the square of the difference between the label estimated by the learning machine before measuring yk (but knowing xk) and the label yk generated by model 3.1 at the time k (note that, different from the term , both are observable at time k). However, by taking expectations and recalling that has mean 0 and is mutually independent from xk, wk, and , one obtains
formula
3.6
Hence, since the last term in equation 3.6 is a constant, the learning functionals 3.5 and 3.6 have the same sequence of optimal updating functions. It is worth noting that in both formulas 3.5 and 3.6, in order to generate the estimates , one uses only the probability distribution of wk conditioned on the already available observations.
The statement of problem OLL can be simplified by defining the learning error,
formula
which evolves according to
formula
3.7
where
formula
Of course, means . Moreover, since both and xk are known at time k, one can replace the measures yk by
formula
hence obtaining the measurement equation
formula
3.8
where , and has the same variance as . In this case, the “history” of the learning machine, measures, and past updates up to time k is collected in the new information vector , defined as
formula
for and
formula
There is a one-to-one correspondence between the information vectors Ik and . So, the optimization of the learning functional 3.5 assuming that the dynamical system evolves according to equation 3.3, the sequence of measures is provided by equation 3.4, and the updating functions uk have the form , is equivalent to the optimization of the following learning functional:
formula
3.9
assuming that the error vector evolves according to equation 3.7, the sequence of measures is provided by equation 3.8, and the update uk is now a function of the information vector . Such a problem is a nontrivial variation of the classical LQ problem (Bertsekas, 1995). Whereas in the latter, the matrices Ck and Qk are deterministic, in the proposed formulation of online learning, they are random, since they depend on the input examples xk. Another difference is that for , the information vector includes the realizations of the inputs xj, hence also of the matrices Cj and Qj.

3.2  Solution of the Finite Horizon Online Learning Problem

To solve problem OLL, we make extensive use of the concept of the cost-to-go function from the theory of dynamic programming (Bertsekas, 1995). In our context, the cost-to-go function at the time stage is defined as
formula
3.10
whereas
formula
3.11
Finally, the optimal value of the learning functional equation 3.9, is
formula
Under mild regularity conditions (see remark 12), the cost-to-go functions can be determined recursively by solving the Bellman Equations,
formula
3.12
for .
Remark 6.

The regularity conditions mentioned above are satisfied, for example, in the case (studied in this letter), where the random vectors w and are gaussian. Indeed, in such a context, the optimal updating functions that will be provided by equation 3.13 are linear with respect to the information vector (Bertsekas, 1995; Bertsekas & Shrieve, 1978). In the following, the mild regularity conditions mentioned above will be implicity assumed throughout the letter.

Equations 3.11 and 3.12 are similar to those for the cost-to-go functions in the LQ problem (see, e.g., Bertsekas, 1995), with the difference that in the present context, the matrices Qk and Ck are random. Moreover, both matrices become known to the learning machine at time k, as they can be derived from the information vector . In the following, we sometimes use the superscript “” not only for the optimal updating functions but also to denote vectors (e.g, and ek) evaluated when the sequence of optimal updating functions, equation 3.13, is applied.

Proposition 1
(optimal updating functions and average Riccati equation).  Let assumptions 15 be satisfied. Then the updating functions that solve problem OLL are given, for , by
formula
3.13
where
formula
3.14
and the matrices
formula
3.15
formula
3.16
and
formula
3.17
are symmetric positive-semidefinite. The recursions above are initialized by
formula
3.18
and
formula
3.19

Equation 3.17 can be called an average Riccati equation (ARE) since it contains the expectation term . In practice, it can be solved offline, as can the classical deterministic Riccati equation of the linear quadratic regulator (LQR) subproblem (Bertsekas, 1995), simply by replacing Qk (which is deterministic in the LQ problem) by . It is worth remarking that solving the ARE, equation 3.17, does not require the knowledge of future input examples and that all the matrices Lk in equation 3.14 have spectral radius8 strictly smaller than 1. Finally, the matrices Fk are reported in formula 3.16 because they are used to express ) (see formula (A.18) in the appendix). They are also used in the infinite-horizon version of problem OLL (problem OLL), to reduce one part of the proof of proposition 23 in section 4 to the finite-horizon case.

Due to equation 3.13, in order to generate the optimal update at time k, one has to compute . Let us now make the following additional assumption:

Assumption 5

(gaussian random variables).  The random variables w and are gaussian.

The next proposition shows that when the additional assumption 14 is satisfied, the optimal estimate of the proposed framework tracks the (usually time-varying) Kalman filter estimate. Indeed, inspection of its proof shows that
formula
is the Kalman filter (KF estimate) of the error vector at the time k, based on the information vector , thus getting a Kalman filter recursion scheme.
In the following, we denote by
formula
the KF estimate of at the time k, based on the information vector Ik (or, equivalently, on the corresponding information vector ). Moreover, for let
formula
3.20
be the (conditional) covariance matrix9 of ek, conditioned on the information vector , and
formula
3.21
the (unconditional) covariance matrix10 of e0, which is equal to the (unconditional) covariance matrix of w.
Proposition 2
(optimal online estimate and Stochastic Riccati Equation).  Let assumptions 114 be satisfied. Then
formula
3.22
where, for
formula
3.23
formula
3.24
and, for
formula
3.25
with the initializations
formula
3.26
formula
3.27
and
formula
3.28

Equation 3.25 has the form of the Riccati equation of the well-known Kalman filter. Due to the stochastic nature of Ck, it can be called a stochastic Riccati equation (SRE). From a computational point of view, solving equation 3.25 is easy even in a high-dimensional setting (i.e., when the dimension d of the input space is large). Indeed, (which needs to be inverted in equation 3.25) is a scalar. Similarly, in formula 3.24, one has to invert the scalar . In other applications of the Kalman filter, one has to invert matrices. Moreover, differently from the average Riccati equation, the stochastic Riccati equation has to be solved online.

Remark 7.

It is worth mentioning that also Bertsekas (1995) investigates an LQ optimal control problem with random matrices. In that case, however, there is perfect information on the state, and the randomness is limited to the system dynamics. For that problem, a suitable average Riccati equation is obtained, but no stochastic Riccati equation. Hence, that formulation, though inspiring for our work presented in this letter, cannot be applied directly to our online learning framework.

Remark 8.
Equations 3.13 and 3.23 show that the classical separation principle of control and estimation holds also for problem OLL. More precisely, it is reduced to two subproblems, which can be solved independently: the determination of the matrices Lk (solution of the LQR subproblem) and the determination of the Kalman gain matrices Hk (solution of the Kalman filter estimation subproblem). One might wonder why, in problem OLL one gets, instead of the classical Riccati equation, two different kinds of equations for the two subproblems, the ARE, equation 3.17, and the SRE, equation 3.25, in spite of the well-known duality between the LQR subproblem and the Kalman filter estimation problem (Söderström, 2002). The reason is that when moving from the LQR subproblem to the Kalman filter estimation subproblem, the roles of the matrices
formula
in the primal problem (i.e., the LQR subproblem) are played, respectively, by the following matrices of the dual problem (i.e., the Kalman filter estimation problem):
formula
where is the covariance matrix of the system noise (a kind of noise that is not present in model 3.7), hence it is an all-0s matrix. Now, in our specific context, in the primal problem, the matrix Qk is random, whereas in the dual problem, the matrix is deterministic. Similarly, in the primal problem, the matrix Bk is deterministic, whereas in the dual problem, the matrix is random. This lack of symmetry is the reason that the two Riccati equations 3.17 and 3.25, have different forms.

The next proposition states some properties of the solution to the SRE. For two symmetric square matrices S1 and S2 of the same dimension, means that is symmetric and positive-semidefinite. When it is evident from the context, we use the symbol 0 to denote a matrix whose elements are all equal to 0.

Proposition 3

(properties of the solution to the SRE).  Let assumptions 114 be satisfied. Then

  • formula

    (i.e., the sequence is “nonnegative” and monotonic “nonincreasing” in a generalized sense, according to ) for all the realizations of the random matrices and .

  • For all the realizations of these random matrices,
    formula
    and
    formula
  • There exists a symmetric and positive-semidefinite matrix such that
    formula
  • If all the input examples xk have a common probability distribution with bounded support and the same positive-definite covariance matrix in such a way that
    formula
    for all k, then with a priori probability 1, one has
    formula
    When equation 3.34 holds, then
    formula
  • For every and all the realizations of the random matrices,
    formula

An intuitive explanation of the second bound in equation 3.30 is the following. When the time index moves from k to , the new information acquired at the time cannot deteriorate, on the average, the quality of the KF estimate, which is in accordance with its optimality properties (Bertsekas, 1995). Equations 3.29, 3.30, and 3.36 will be used, together with equations 3.34 and 3.35, in the convergence analysis of the proposed method for (see section 4 and the appendix).

Remark 9.

An important assumption that is needed in the proof of proposition 3.iv is that the common covariance matrix of the input examples is positive-definite. When this is not the case, this means that with probability 1, all the input examples belong to a finite-dimensional subspace of ; hence, with probability 1, it is not possible to extract from the input-output pairs any information about the component of w that it is orthogonal to that subspace unless such a component is correlated with the projection of w on . However, one still has the convergence of both the KF estimate and the OLL estimate of w to the projection of w on , as can be shown by setting the problem directly on . Morover, the possible absence of information in the data about the component of w that it is orthogonal to has no negative consequences on the estimation process, in the sense that in order to compute for a possibly unseen input x, one needs, with probability 1, to know only the component of w that belongs to the subspace .

3.3  Role of the Regularization Parameter

Let us investigate the behavior of the optimal updating functions provided by equations 3.13 and 3.14 for the two limit cases, and , and for intermediate values of .

3.3.1  The Case

The penalization of the update uk in the learning functional, equation 3.9, becomes negligible, and one obtains from equation 3.14, and
formula
3.37
from equation 3.13. Hence, one gets (from equations 3.7 and 3.37),
formula
Equivalently, in terms of the unknown vector w and its optimal estimates , at the times k and , respectively, one obtains
formula
hence
formula
which is just the KF estimate of w at the time k, based on the information vector Ik.

3.3.2  The Case

The penalization of the update uk in the learning functional 3.9 becomes larger and larger. Indeed, for large enough, one obtains from equation 3.14, and
formula
from equation 3.13. Hence, one gets
formula
and
formula

3.3.3  Intermediate Values of

In this case the estimate enjoyes convergence properties similar to the ones of the KF estimate , as illustrated numerically in Figure 1. Moreover, is a smoothed version of the estimate . The sequence of estimates is smoother and less sensitive to outliers than the sequence of estimates , as a large change in the estimate when moving from to is penalized by the presence of the term in the cost functional 3.9. This can be seen also by formula 3.22, as equation 3.14 implies that all the eigenvalues of the symmetric matrix Lk are inside the unit circle. A deeper investigation of these three issues (convergence, smoothness, and sensitivity to outliers) is made in sections 5, 6.3, and 6.4, respectively.

Figure 1:

A comparison between the components of the OLL estimate and the KF estimate . A three-dimensional case has been considered, with the realization , and online examples have been used to train the learning machine. The input examples have been generated with components mutually independent and uniformly distributed in , whereas the covariance matrix of w has been chosen to be diagonal with diagonal entries equal to 4, and the variance of the measurement noise is equal to 1, as is the regularization parameter .

Figure 1:

A comparison between the components of the OLL estimate and the KF estimate . A three-dimensional case has been considered, with the realization , and online examples have been used to train the learning machine. The input examples have been generated with components mutually independent and uniformly distributed in , whereas the covariance matrix of w has been chosen to be diagonal with diagonal entries equal to 4, and the variance of the measurement noise is equal to 1, as is the regularization parameter .

4  LQG Learning Over an Infinite Horizon

To address the infinite-horizon case, we remove the final-stage cost (or, equivalently, we assume with probability 1, hence also with probability 1) and let (the precise formulation is provided later in this section).

Assumption 6
(identical distributions of the input examples).  The random variables are identically distributed and have the same positive-definite covariance matrix,
formula
for every Moreover, the common probability distribution has bounded support.
Due to assumption 20, the analysis has some similarities with the one for the optimal solution to the LQG problem performed, for example, in Bertsekas (1995). We denote by a symmetric and positive-definite square root of . As one can check directly from the definitions of reachability and observability11 (Antsaklis & Michel, 2007), we observe that the pair
formula
is reachable,12 whereas the pair
formula
is observable. Hence, one can apply proposition 4.1 of Bertsekas (1995), from which it follows that the ARE, equation 3.17 admits a stationary solution , associated with the two stationary matrices
formula
4.2
and
formula
(see 3.16). Moreover, by reversing the time indices in equation 3.17, (i.e., setting and ), the solution of the ARE
formula
4.3
(which is equivalent to equation 3.17), converges to for any initialization of the positive-semidefinite matrix P0, still by Bertsekas (1995, proposition 4.1).

The (average) learning functional over infinite horizon is defined as follows:

Definition 2
(average learning functional over infinite horizon).  Let , and
formula
4.4

Problem OLL

(online learning over infinite horizon).  Given the examples generated at each time instant according to the model defined by assumptions 1 and 2, and the learning machine defined by assumption 3, find the infinite sequence of optimal updating functions with the structure defined by assumption 5 that minimizes the average learning functional 4.4.

As with problem OLL, for every we shall call online estimate (OLL estimate, for short). In the following, we consider directly the LQG case, with identical distributions for the input examples.

Proposition 4
(optimal updating functions and ARE).  Let assumptions 120 be satisfied. Then, updating functions that solve problem OLL are given, for by
formula
4.5
where L is defined in equation 4.2.13
Remark 10.
A nice feature of the AREs, equation 3.17 (with ) and equation 4.3 is that their common stationary solution can be easily expressed in terms of the eigenvalues/eigenvectors of the matrix , which can be useful from a computational point of view. Indeed, let us express as
formula
where U is a basis of orthogonal unit-norm eigenvectors of (hence, ) and is a diagonal matrix collecting the corresponding positive eigenvalues. We recall that satisfies
formula
that is,
formula
4.6
Now, has the same eigenvectors as . Hence, and also have the same eigenvectors, so is expressed as
formula
4.7
where is a suitable diagonal matrix, with positive eigenvalues. Due to equation 4.6, the diagonal elements and are related by
formula
Hence, by the positiveness of , one gets
formula
Similarly, the stationary matrix L can be expressed as
formula
4.8
where the elements of the diagonal matrix are
formula
4.9
A particularly simple case occurs when the matrix is diagonal; hence, one can choose the matrices U and as the identity matrix I, so and L are also diagonal, by formulas 4.7 and 4.8. Moreover, if is proportional to the identity matrix I, and L are also proportional to I. Finally, similar remarks also hold in the finite-horizon case for the matrices and Lk in case the matrices commute (this happens, for example, when all the matrices are equal to ).

5  Convergence Properties of the Online Learning Estimates in Terms of Mean Square Errors

Let us denote by
formula
the mean-square error of the KF estimate at time k. Different from the LQG case detailed in Bertsekas (1995), the expectation of is needed in the following, as depends on the sequence of random matrices . Under the assumptions of proposition 3.iv, this converges to the 0 matrix as k tends to (see formula 3.34).
Proposition 5

(convergence of the MSE of the KF estimate).  Let assumptions 120 be satisfied. Then the following hold:

i.
formula

  • For every
    formula
    where c1 is a positive constant such that with a priori probability 1. Moreover, .
  • formula
  • Every element of Hk converges to 0 also in probability, that is, for every ,
    formula

Note that the upper bound in proposition 5.ii provides for the convergence to 0 of the mean square error of the KF estimate of w at the time k, a rate of order . As for proposition 5.iv, an intuitive explanation is the following. As the parameter w to be learned does not change in time, after a sufficiently large number of “good” examples, the learning machine has practically learned w, and future examples are practically not needed (of course, this holds in the case, considered so far, in which the parameter w does not change with time; see section 8 for a relaxation of this assumption).

Now, let
formula

denote the mean square error of the OLL estimate at time k. The next proposition provides a recursion to compute and bound from above and states its convergence to 0. We refer to remark 41 in the appendix for a possible way to derive estimates of the associated rate of convergence.

Let
formula
and denote by
formula
5.4
the (unconditional) covariance matrix of . Moreover, we denote by
formula
5.5

the (unconditional) covariance matrix of .

Proposition 6

(convergence of the MSE of the OLL estimate).  Let assumptions 120 be satisfied. Then the following hold:

  • formula
  • formula
  • Under the assumptions made in section 4, one has
    formula

6  Comparisons with Other Machine Learning Techniques

In this section, we present some connections and comparisons between the solutions to our learning paradigm and machine learning techniques such as average regret minimization (section 6.1), stochastic gradient descent (section 6.2), and Kalman-based estimates (sections 6.3 and 6.4).

6.1  Connections with Average Regret Minimization

As for the finite-horizon case, the minimization of the average learning functional, equation 4.4 is equivalent to the minimization of the alternative learning functional
formula
6.1
since equation 6.1 is just equal to
formula
6.2
(see formula 3.6). We now consider the limit case , denoting by the corresponding average learning functional. Then, observing that the following equality holds (due to the independence of the disturbance noises ),
formula
6.3
we obtain
formula
6.4
which can be interpreted as an average regret functional (Zinkevich, 2003). Under the assumptions of proposition 23 (with ), the sequence of KF estimates minimizes the functional , hence also the average regret functional, equation 6.4. Moreover, the minimum average regret is 0 because, by proposition 5.ii, one has
formula
6.5
then one gets also
formula
6.6

Nevertheless, the next proposition shows that for any , also the sequence of OLL estimates minimizes the average regret functional 6.4.

Proposition 7.

For any , under the assumptions of proposition 23, the sequence of OLL estimates minimizes the average regret functional, equation 6.4.

6.2  Connections with KF and Stochastic Gradient Descent

At each time , our OLL estimate of the parameter vector w associated with the data generation model has the following recursive form (see, e.g., the statement of proposition 15):
formula
6.7
where Lk is a suitable square matrix and is the Kalman filter estimate of w at time k, based on the vector Ik that collects all the information available to the learning machine up to time k. Hence, it follows from equation 6.7 that our estimates are obtained from the Kalman filter estimates through an additional smoothing step. The form of equation 6.7 is similar to the ones of other online estimates obtained through various machine learning techniques, such as stochastic gradient descent (Spall, 2003). However, there is a substantial difference: we derive equation 6.7 as the optimal solution of a suitable optimal control/estimation problem, showing various interesting consequences of that in the letter, made possible by our use of optimal control/estimation techniques. We believe that this approach could also be fruitfully applied to other machine learning techniques used in online learning. Moreover, we offer a principled way to construct the matrix Lk in equation 6.7 as the solution of a suitable Riccati equation.

6.3  Outperformance with Respect to KF in Terms of Smoothness

For simplicity, we consider the finite-horizon case. The extension to the infinite-horizon case can be performed by a limit process, as in section 4. The next proposition shows that the OLL estimates are smoother than the KF estimates in the sense that the value of
formula
when and the updates are generated by equation 3.13 is smaller than or equal to the corresponding value obtained when the updates are generated by equation 3.37 with replaced by = (i.e., in the limit ). The limit problem obtained when tends to 0 is just the Kalman estimation problem, whose optimal sequence of updates is
formula
where
formula
Proposition 8.
Let assumptions 114 be satisfied. Then
formula
The simulation results shown in Figure 2, which refers to a setup similar to the one of Figure 1, are in line with the result from proposition 28. The figure suggests that for every , the stronger result,
formula
6.8
may also hold.
Figure 2:

A comparison between the empirical averages of the square l2-norms of the vectors of updates used to generate, respectively, the OLL estimates and the KF estimates . The parameters are the same as in Figure 1, apart from w, which is generated according to a gaussian distribution, with mean and covariance matrix . The empirical averages of the square l2-norms have been computed by considering 10,000 independent simulations.

Figure 2:

A comparison between the empirical averages of the square l2-norms of the vectors of updates used to generate, respectively, the OLL estimates and the KF estimates . The parameters are the same as in Figure 1, apart from w, which is generated according to a gaussian distribution, with mean and covariance matrix . The empirical averages of the square l2-norms have been computed by considering 10,000 independent simulations.

Finally, Figures 3 and 4 show that both approaches are also suitable for parameter vectors of much larger dimension. Indeed, it refers to the case of and online examples. Figure 3 reports the l2-norm of the error vector associated with the KF estimate and with the OLL estimate, respectively, at the generic stage k, for a single realization of the sequence of the input-output examples. The running time of such a simulation (whose code was written in Matlab R2013, as for all of the other simulations) was about 29 seconds, on a notebook with a 1.40 GHz CPU and 4 GB of RAM. Figure 4 reports a similar comparison for the l2-norm of the update vector, for the same realization. It is worth noting that for a better comparison of the two figures, l2-norms and not square l2-norms have been considered therein (in Figure 2, the square l2-norm of the update vector was considered, for its relationship with proposition 28). To compare the two approaches, various choices for the regularization parameter have been considered in Figures 3 and 4 (see the captions for details). In more detail, Figure 3 shows that for this case of a time-invariant parameter vector, the KF estimate and the OLL estimates have a similar error (in a root mean square sense, slightly smaller for the KF estimate, which is, however, much less smooth with respect to the time index k). Moreover, the OLL estimates have much smaller norms of the update vectors (in Figure 4, they are practically negligible compared to the ones associated with the KF estimate, for all the values considered for the regularization parameter). In item e.3 of section 8, we show that for the case of a slowly time-varying parameter vector, the OLL estimate can achieve even a much smaller error than the KF estimate, under a suitable periodic reinitialization of the matrices (see Figure 8).

Figure 3:

For a setup similar to the one of Figure 2 but with , online examples, a single realization of the sequence of input-output examples and three choices () for the regularization parameter: l2-norm of the error vector associated with the KF estimate and OLL estimate at the generic stage k. The square root of the mean square l2-norm of the error vector (averaged with respect to all the stages) is about 14.4134 for the KF estimate and, respectively, 14.6413, 14.8192, and 15.0421 for the OLL estimates with .

Figure 3:

For a setup similar to the one of Figure 2 but with , online examples, a single realization of the sequence of input-output examples and three choices () for the regularization parameter: l2-norm of the error vector associated with the KF estimate and OLL estimate at the generic stage k. The square root of the mean square l2-norm of the error vector (averaged with respect to all the stages) is about 14.4134 for the KF estimate and, respectively, 14.6413, 14.8192, and 15.0421 for the OLL estimates with .

Figure 4:

For the same setup and the same realization of the input-output examples as in Figure 3: l2-norm of the update vector used to generate, respectively, the KF estimate and the OLL estimate at the generic stage k.

Figure 4:

For the same setup and the same realization of the input-output examples as in Figure 3: l2-norm of the update vector used to generate, respectively, the KF estimate and the OLL estimate at the generic stage k.

6.4  Outperformance with Respect to KF in Terms of Sensitivity to Outliers

Here we further compare numerically the KF and OLL estimates, now in terms of their different sensitivity to outliers. To this end, we periodically alter the output data perturbation model, choosing the disturbance to be equal to a positive constant z1 when k is a multiple of some positive integer Z; otherwise, it is equal to a negative constant z2 when k is not a multiple of Z. The two constants z1 and z2 are chosen in such a way that the empirical mean (over any time window of duration Z) of the s is 0 (i.e., the condition is imposed), and the empirical variance (over the same time-window) is (i.e., the condition is imposed). Hence, and are obtained. Moreover, for a fair comparison with the KF estimate, this modified assumption on the output data-perturbation model is not included in the optimization problem producing the OLL estimates.14 In other words, that knowledge is not provided to the learning machine.

The numerical results reported in Figure 5 show clearly the much smaller sensitivity to outliers of the OLL estimates with respect to the KF ones (details about the parameter choices are reported in the figure caption). This is ultimately due to the larger smoothness of the OLL estimates with respect to the time index.

Figure 5:

For a setup similar to the one of Figure 1, but choosing , , examples, the diagonal entries of equal to 10, and the disturbance equal to when k is a multiple of , otherwise equal to : comparison between the components of the OLL estimate and of the KF estimate .

Figure 5:

For a setup similar to the one of Figure 1, but choosing , , examples, the diagonal entries of equal to 10, and the disturbance equal to when k is a multiple of , otherwise equal to : comparison between the components of the OLL estimate and of the KF estimate .

An additional theoretical motivation for the smaller sensitivity to outliers of our OLL estimates is obtained by an inspection of formula 3.22 in proposition 15. Limiting for simplicity of the analysis to the first OLL updates of the parameter vector, it follows by that formula and by that and , where is the first KF update, which is influenced only by the first example presented to the learning machine. Since (see formula 3.14), it is evident that the OLL estimate is less influenced by the presence of a possible outlier. Moreover, such an influence decreases by increasing the regularization parameter , since, by formula 3.14, the larger is, the smaller is. Figure 6 confirms this advantage of the OLL estimates, showing that the l2-norms of the vectors of updates used to generate the OLL estimates are typically much smaller than the l2-norms of the vectors of updates used to generate the KF estimates. An additional significant advantage of the OLL estimates in the presence of time-varying parameter vectors is detailed in item e of section 8.

Figure 6:

For the example in Figure 5: comparison between the l2-norms of the vectors of updates used to generate, respectively, the OLL estimates and the KF estimates . (Compared to Figure 2, the l2-norm is considered here instead of its square for graphical reasons to reduce the difference in range between the two graphs.)

Figure 6:

For the example in Figure 5: comparison between the l2-norms of the vectors of updates used to generate, respectively, the OLL estimates and the KF estimates . (Compared to Figure 2, the l2-norm is considered here instead of its square for graphical reasons to reduce the difference in range between the two graphs.)

7  Nonlinear Models of Data Generation and Application of Kernel Methods

An interesting extension of the model investigated in the previous sections is obtained by mapping the input data xk preliminarily to another Euclidean space, then applying the model in the new input space. More precisely, one introduces a (possibly nonlinear) mapping , where E is a Euclidean space of dimension dE, possibly larger than d (or even infinite). Then the equation 3.1 becomes
formula
7.1
where the parameter vector w now belongs to E. In this case, one can still apply all the techniques described in the letter taking E as the new input space. Of course, in doing this, the dimensions of some matrices would in general change (typically, they would increase); for instance, in case of a finite dE, the matrices , L, and would become matrices, whereas the Kalman gain matrix Hk would become a matrix. In case of an infinite-dimensional Hilbert space E, they would be replaced by suitable infinite-dimensional linear operators.

Interestingly, as we show below, when doing such an extension, one can apply the so-called kernel trick of kernel machines (Cristianini & Shawe-Taylor, 2000). More precisely, we show some circumstances under which, for every (possibly unseen) input x, one can express both and in terms of inner products of the form , where xj is an input example already seen by the learning machine. Hence, if one is able to express in a simple way (e.g., through a symmetric kernel function such that ), one can compute and even without knowing explicitly the expression of the mapping .

Remark 11.
As an example of a mapping and its associated kernel , we consider the case and the feature mapping , defined as
formula
where and are the two components of the vector x. Then, given any two input vectors , the inner product is expressed as
formula
which is the so-called homogeneous polynomial kernel (Cristianini & Shawe-Taylor, 2000) of order 2, whose evaluation involves only computations to be performed in the original input space .

We first consider how to compute using kernels, then how to compute , too. We make the following assumption.

Assumption 7
(covariance matrix of the measurement noise).  Let
formula
7.2
where and denotes the (matrix associated with the) identity operator on E.

The results presented in the next proposition for the kernel version of the KF estimate are essentially the same as the ones obtained in (Liu et al., 2009, theorems 2 and 3), which shows also how to express the linear combinations inside such equations, through an application of the matrix inversion lemma (see, e.g., Sayed, 2003). However, their extension to the kernel version of the OLL estimate, provided in proposition 34, is novel. In order to improve their readability, in the next formulas—equations 7.37.5 and 7.9--7.12—only the functional form of the right-hand side is provided.

Proposition 9.
Let assumptions 130 be satisfied for the kernel version of the KF estimate (i.e., with every xk replaced by ). Then, for every
formula
7.3
formula
7.4
and
formula
7.5
In case of a finite-dimensional space E, the convergence analysis is exactly the same as the one in proposition 25, and a similar (even though more technical) analysis is expected to hold for the infinite-dimensional case. Finally, in case the (matrix associated with the) covariance operator
formula
7.6
is only positive-semidefinite but not positive-definite, one could still follow remark 19 to prove the convergence of the estimate on the subspace on which the input data lie with probability 1. Such a subspace could be estimated by, for example, an application of kernel principal component analysis (KPCA) (Schölkopf, Smola, & Müller, 1998). Moreover, one could even redefine the problem taking that subspace as the new input space, making the operator be positive-definite when restricted on it.

After dealing with the kernel version of the KF estimate of w, we now investigate the kernel version of its OLL estimate. We make the following assumption:

Assumption 8

(covariance operator).  Let one of the following hold:

  • The covariance operator has the form
    formula
    for some .
  • formula

    where lU is a given positive integer, and are some unsupervised examples (assumed here for simplicity to be available to the learning machine starting from the time ).

Remark 12.

Assumption 8.i refers to a particularly simple model for , which is relaxed in assumption 8.ii, which refers to the case in which is modeled by an empirical estimate obtained using the unsupervised examples.

Proposition 10.
i. Let assumptions 130, and 8.i be satisfied for the kernel version of the KF estimate (i.e., with every xk replaced by ). Then for every
formula
7.9
and
formula
7.10
ii. If, instead, assumption 8.ii is used, then for every
formula
7.11
and
formula
7.12
Remark 13.

A significant advantage of the representation 7.11 over the representations 7.4 and 7.9 is that the vector has dimension at most lU.

Remark 14.

More generally, in assumption 8.ii could be previously seen input data, preferably not used by the learning machine in combination with labels, to possibly avoid overtraining. So their number could grow as the learning machine acquires examples. Of course, after adding new empirical data in the estimate 7.8, one could also accordingly update the matrix Lk (or, in the infinite-horizon case, the stationary matrix L), in item d.1 of section 8.

We conclude this section by mentioning that in the nonlinear case, differently from techniques such as the extended KF (Grewal & Andrews, 2001), the kernel version of the OLL estimate has the advantage of solving an optimal control (or optimal estimation) problem. Other approaches to online learning with kernels are described, for example, in Diethe and Girolami (2013).

8  Extensions

In this section, we illustrate some other extensions of the proposed OLL estimation scheme investigated in the letter.

8.1  Extension a: Nonzero Mean of xk

In this case, no significant change in the analysis is required. The only difference is that and are now correlation matrices instead of covariance matrices.

8.2  Extension b: Nonzero Mean of w

Propositions 25 and 26 still hold if , and the KF estimate and the OLL estimate are initialized, respectively, by
formula
and
formula
8.1
(notice that two different initialization indices have been used for the two estimates, where the subscript has been used to denote the a priori KF estimate—the one obtained before the presentation of the first example, whereas the subscript 0 has been used for initializing of the OLL estimate).15 Indeed, in such a case one obtains similar expressions as in the appendix for the matrices in equation A.48 and for the matrices , , , in equations 5.4, 5.5, A.65 and A.66, respectively, and the same equation, A.73, which is used there to obtain the convergence result, equation 5.6, through an analysis of the convergence of when k tends to .
Remark 15.

The case is important in practice, and, among other ones, it models the situation in which, after some number of measures, the time index k is shifted to the left (i.e., k is replaced by , or equivalently, one reformulates problem OLL using instead of 0 as the initial index in the summation of its objective, equation 3.5) and the knowledge derived by the previous estimates (i.e., the one up to the time ) is used to generate the term (this is actually a posteriori knowledge, since it summarizes the knowledge deriving from the previous estimates, but it becomes the new a priori knowledge for the new problem with modified starting index in the summation).

Similar convergence results are obtained even if one replaces equation 8.1 with
formula
8.2
Indeed, in such a case, the convergence analysis of made in the appendix is still valid. The only difference is that now one has , but in an infinite-horizon setting, also tends to 0 exponentially fast as k tends to , due to equation A.64 in the appendix with Lj replaced by L, since the matrix I+L has spectral radius smaller than 1.

8.3  Extension c: Introduction of a Bias in the Model

Instead of the measurement equation 3.1, one could consider
formula
8.3
where b is an additional parameter to be learned using the sequence of examples available online. This case can be reduced to equation 3.1 by replacing the input vector xk by , and the parameter vector w to be learned by . As the last component of the new input vector has nonzero mean, one is also reduced to case a) above. Moreover, the assumption of positive-definiteness of the correlation matrix of is satisfied automatically if it is satisfied by the covariance matrix of xk.

8.4  Extension d: More Complex Models for the Measurement Errors

The measurement errors could have nonzero means, nonidentical distributions, and/or be not mutually independent. The first two cases can be dealt with in a straightforward way. Indeed, in the first case, one has only to subtract the expectation of from the measure yk before presenting it as an input to the KF,16 while in the second case, one has to insert an additional index k to , using terms of the form in the Kalman filter recursion scheme 3.25 and in the Kalman gain matrix 3.24. Finally, in the correlated case, one could model the measurement noise as the output of an auxiliary uncontrolled linear dynamical system, which receives mutually independent noises as inputs. In this case, when the horizon tends to , the convergence of the solution of the ARE to a stationary solution could be more difficult to prove (or such a convergence could even not hold at all), since the reachability condition needed for the application of Bertsekas (1995, proposition 4.1) would be violated.

8.5  Extension e: Time-Varying Models for Some Parameters

When solving the “shifted version” of problem OLL that uses as the initial index (see remark 37), one could exploit, for some of its parameters, time-varying models (which could be also estimated online), including the cases of:

  1. Slowly time-varying covariance matrices of the input examples xk

  2. Slowly time-varying variances of the measurement noises

  3. A slowly time-varying parameter vector in the data-generation model, equation 3.1.

About issue e.1, we notice that the covariance matrices may be estimated online from the already-observed input data17xk (), when such covariance matrices do not depend on the time index (as assumed in the basic infinite-horizon version of the problem), but also if one makes the assumption instead that they are slowly time varying (in such a case, one could give more weight in such estimates to the last input data using, for example, a forgetting factor). Now let us suppose that one is solving the infinite-horizon version of problem OLL under the assumptions of section 4, replacing the “common”18 covariance matrix of the future input data by its initial estimate, here denoted as . However, let us also suppose that the estimate at time of derived from the previous input data (denoted in the following as ) is significantly different from (due, e.g., to slow changes with respect to time of the covariance matrices , or simply due to a possibly bad initial estimate ). Then, using the updated estimate as the model for the “common” covariance matrix of the future input data, one could also update the stationary matrix L of the proposed optimal controller, replacing by in the average Riccati equation, equation 3.17, and looking for its (new) “stationary” solution , hence deriving the (new) “stationary” matrix L from equation 4.2. Such new solutions will be “stationary” as long as the future estimates of the “common” covariance matrix of the future input data will not change significantly with respect to .

Concerning issue e.2, one may detect changes in from the already observed input/output data (), if one assumes that is also slowly time varying. In particular, such changes would be easily detected if, inside that set, the learning machine has at its disposal some pairs with similar values for xk, or even several groups of such pairs, each characterized by similar values of xk. In this way, one could generate, inside the ith such group,
formula
the auxiliary data matrix , whose element is defined as
formula
and has a very small dependence on w, since
formula
and
formula
Then one could use such matrices to estimate the variance of the measurement noise, giving more importance or weight to the last among such measures. The obtained estimate would be then used as the common variance in the shifted version of problem OLL that uses as the initial index, presented in section 4.

Finally, about issue e.3, we will show that a minor modification of the proposed OLL model can learn even a time-varying parameter vector w. We first focus on the case in which, at some time , the parameter vector w changes, then remains fixed. In this case, the mean square error of the estimate of the new parameter vector still converges to 0 when k tends to (see formula 5.6, together with item b in this section). However, the trace of may be extremely small (see equation 5.1, combined with Markov’s inequality), making the trace of be extremely small, for every . Since such matrices are used to construct the Kalman gain matrices Hk, the convergence to 0 of the error with respect to the new parameter vector may be extremely slow, for both the KF and the OLL estimates. This issue could be solved in both cases by a reinitialization at time of the covariance matrix to the a priori covariance matrix . More generally, a periodic reinitialization could be used to track a periodically (or continuosly) changing parameter vector w. In this case, the OLL estimate would have the advantage, with respect to the KF estimate, to change more slowly in time, making it more suitable to a slowly time-varying parameter vector. The next figures provide more insight into this last issue.

Figures 7 and 8 refer to a parameter vector that changes periodically, the change in its components being small and random. In Figure 7, there is no reinitialization to of any of the matrices , and the convergence to the new parameter vector of both the KF estimate and the OLL estimate is slow. Figure 8 instead refers to the case in which—the change in the parameter vector being the same—there is a periodic reinitialization of the matrices to (this second period has been chosen as different from the first one according to which the parameter vector changes, just to avoid giving the learning machine the advantage of knowing when the parameter vector changes, in order to model the more realistic situation in which this knowledge is not available to the machine). In this case, both estimates are able to track the time-varying parameter vector w in a better way, but the OLL estimate is smoother than the KF estimate due to the presence of the regularization parameter . So in this context, the OLL estimate is preferable to the KF estimate if the parameter vector changes slowly with time. Figures 9 and 10 show the reason that this happens: when there is no reinitialization to of the matrices , the Frobenius norm19 of the Kalman gain matrix Hk is expected to be small for k large (see formulas 3.24 and 5.1), which is confirmed by Figure 9. Hence, although the norm of the error tends to increase when the parameter vector changes, the KF estimate of w at time k is not affected so much by this change (see formula 3.23), hence, also the OLL estimate does not change so much (see formula 3.22). Instead, a reinitialization to tends to make the Frobenius norm of the Kalman gain matrix Hk much larger (which is confirmed by Figure 10), and this amplifies the effect of the larger norm of (due to the change in the parameter vector w) on the KF estimate of w at time k. For the OLL estimate, the change in the estimate is expected to be smaller due to the smoothing effect of formula 3.22. So, like the KF estimate, the OLL estimate is also able to track the change in the parameter vector but with smoother behavior.

Figure 7:

For the case of a time-varying parameter vector, a comparison between the components of the optimal estimate at the time k of the parameter vector w, obtained by solving problem OLL modeling online learning, and the corresponding components of the estimate at the time k, obtained by applying the Kalman filter. A setting similar to the one of Figure 2 has been considered, but with online examples, and the parameter vector w randomly changed of a small amount every 100 online examples. The covariance matrix of the initial w has been chosen to be diagonal with diagonal entries equal to 64, the variance of the measurement noise has been chosen to be equal to 1, and the regularization parameter has been set to 30. No periodic reinitialization to of any of the matrices was performed.

Figure 7:

For the case of a time-varying parameter vector, a comparison between the components of the optimal estimate at the time k of the parameter vector w, obtained by solving problem OLL modeling online learning, and the corresponding components of the estimate at the time k, obtained by applying the Kalman filter. A setting similar to the one of Figure 2 has been considered, but with online examples, and the parameter vector w randomly changed of a small amount every 100 online examples. The covariance matrix of the initial w has been chosen to be diagonal with diagonal entries equal to 64, the variance of the measurement noise has been chosen to be equal to 1, and the regularization parameter has been set to 30. No periodic reinitialization to of any of the matrices was performed.

Figure 8:

For the case of a time-varying parameter vector: a comparison similar to the one of Figure 7, with the same online examples and the same changes in w, but with a reinitialization to of the matrices performed every 120 online examples.

Figure 8:

For the case of a time-varying parameter vector: a comparison similar to the one of Figure 7, with the same online examples and the same changes in w, but with a reinitialization to of the matrices performed every 120 online examples.

Figure 9:

Frobenius norm of the Kalman gain matrix Hk for the experiment shown in Figure 7 (highly dense regions correspond to oscillations in the Frobenius norm with respect to the time index k).

Figure 9:

Frobenius norm of the Kalman gain matrix Hk for the experiment shown in Figure 7 (highly dense regions correspond to oscillations in the Frobenius norm with respect to the time index k).

Figure 10:

Frobenius norm of the Kalman gain matrix Hk for the experiment shown in Figure 8.

Figure 10:

Frobenius norm of the Kalman gain matrix Hk for the experiment shown in Figure 8.

Figures 11 and 12 demonstrate also that a periodic reinitialization to of the matrices does not negatively affect so much the tracking of a time-invariant parameter vector w in the case of the OLL estimate, whereas the KF estimate is more negatively affected. Again, this is due to the smoothing effect of formula 3.22.

Finally, various approaches to deal with learning time-varying parameters online were presented in McGough (2003); Liu et al. (2009); Chang (2014), in some cases in the context of state estimation of dynamical systems in the presence of outliers (Alessandri & Awawdeh, 2016; De Palma & Indiveri, 2016). As a possible extension, one could combine those approaches with the regularization of the updates included in our model, whose beneficial effects have also been demonstrated in this time-varying case.

Figure 11:

An experimental setting similar to the one of Figure 7 but with no change in the parameter vector.

Figure 11:

An experimental setting similar to the one of Figure 7 but with no change in the parameter vector.

Figure 12:

A comparison similar to the one of Figure 11 with the same online examples and with a reinitialization to of the matrices performed every 120 online examples.

Figure 12:

A comparison similar to the one of Figure 11 with the same online examples and with a reinitialization to of the matrices performed every 120 online examples.

8.6  Extension f: Insertion of a Discount Factor in the Problem

One may be interested in giving different weights to future expected errors, giving more importance to the present. This can be modeled by inserting a discount factor , and modifying the learning functional, equation 3.9, as follows:
formula
8.5

Then the resulting problem is just a variation, with random matrices, of the discounted LQ/LQG problem (see Davis & Vinter, 1985, sec. 6.3), for the version with deterministic matrices).

In practice, the modification 8.5 only slightly changes the Bellman equations for the cost-to-go functions, with the introduction of the discount factor . Particularly, the updates 3.14 to 3.16, and the ARE, equation 3.17, become, respectively,
formula
8.6
formula
8.7
formula
8.8
and
formula
8.9
whereas the SRE, equation 3.25, is not changed at all. For the infinite-horizon case, the learning functional is the limit of equation 8.5 for (so, no “average” learning functional similar to equation 4.4 is needed). Moreover, the stationary matrix
formula
8.10
associated with the stationary solution of the stationary version
formula
8.11
of the ARE, equation 8.9, is symmetric and such that has all its eigenvalues inside the unit circle, as it follows from the positive-definiteness of , which is derived directly from equation 8.11, assuming that is positive-definite, as in section 4. Concluding, these analogies with the undiscounted case would allow one, for example, to extend the properties stated in proposition 26 to the discounted case.

Regarding the modification 8.5, one can also observe that, like the basic version of problem OLL, the past updates of the estimate of the parameter vector w are not modified when a new example arrives and that, in any case, such updates have no influence on the cost-to-go functions at later stages. Moreover, the discount factor should be not confused with a forgetting factor20 (which gives less importance to the past errors, to determine the current estimate).

8.7  Extension g: Introduction of Additional Regularizations on the Estimates of w

Given a sequence of additional regularization parameters , one could insert terms of the form in the per stage cost of the learning functional, equation 3.5, hence replacing it by
formula
8.12
In this case, one would still obtain an ARE, but referred to the original dynamical system 3.3 instead of the reduced one, equation 3.7, due to the presence of the terms in the objective functional, extension 8.12. So the matrix Lk would have in this case the size . Instead, the SRE would be exactly the same as equation 3.25. A suitable choice for the sequence of the regularization parameters