## Abstract

It is well known in machine learning that models trained on a training set generated by a probability distribution function perform far worse on test sets generated by a different probability distribution function. In the limit, it is feasible that a continuum of probability distribution functions might have generated the observed test set data; a desirable property of a learned model in that case is its ability to describe most of the probability distribution functions from the continuum equally well. This requirement naturally leads to sampling methods from the continuum of probability distribution functions that lead to the construction of optimal training sets. We study the sequential prediction of Ornstein-Uhlenbeck processes that form a parametric family. We find empirically that a simple deep network trained on optimally constructed training sets using the methods described in this letter can be robust to changes in the test set distribution.

## 1 Introduction

The main problems in machine learning are density estimation, regression, and classification based on samples drawn according to an unknown but fixed probability distribution function $F$. To assess the quality of a machine learner, the notion of generalization was introduced, most prominently in statistical learning theory (Vapnik, 1998, 2013). Statistical learning theory describes conditions on the hypothesis space of the learning algorithm and the number of samples drawn from $F$ such that the empirical risk is close in probability to the expected risk. For generalization to be defined in this framework, it is crucial that the expected risk is calculated with respect to the same probability distribution function that generated the samples used for the evaluation of the empirical risk. A change in the probability distribution function cannot be directly incorporated into statistical learning theory.

Recent findings have shown, however, that even slight changes in the probability distribution function that generates the data (i.e., different distribution functions for the training or test set) lead to decreases in performance of the learned model (Recht, Roelofs, Schmidt, & Shankar, 2018). This problem can be partially circumvented by including data drawn from different possible probability distribution functions (which are allowed to possess different functional forms) in the training set, effectively demanding that a joint solution is found for all subproblems (Caruana, 1997). In the limit, it is possible that infinitely many probability distribution functions could have generated the data. One possible way of modeling the infinitely many data-generating probability distribution functions is by grouping them into a parametric family.

In this letter, we assume that the data-generating process is itself parametric. Data are then drawn from the whole parametric family: the task that a learning algorithm has to solve is to learn a model for the entire parametric family. Without further prior information on the specific probabilistic structure of the test set, it is a natural requirement to demand that a learned model is equally good for all members of the parametric family. The central question studied in this letter is therefore how training sets containing a finite number of samples can be constructed such that the training set represents the entire parametric family optimally. The tools needed for the analysis carried out in this letter mostly stem from information theory, specifically universal coding theory, and not from machine learning (Rissanen, 2007; Cover & Thomas, 2012).

For the sake of clarity and in order to derive quantitative statements, we focus on a specific stochastic process, the Ornstein-Uhlenbeck process. Being both a gaussian and Markovian process, this stochastic process is rich in structure while still being analytically tractable. Most of the results we present, however, apply to more general problem classes.

The problem of how to optimally sample from a parametric family is tightly connected to universal coding theory. Some universal coding inequalities described in section 2 directly correspond to the problem of sequential prediction in the case of an Ornstein-Uhlenbeck process as shown in section 3. The specific stochastic process chosen therefore yields a task (sequential prediction—having observed a time series up to sample $n$, sample $n+1$ is predicted) that directly corresponds to questions of how to sample a parametric family optimally in the sense of universal coding theory. The letter concludes by empirically studying the generalization behavior shown by deep networks trained on the Ornstein-Uhlenbeck parametric family in an autoregressive manner. We empirically find that a simple model trained on optimally constructed training sets generalizes better to changes in the test set distribution than if the model is trained on suboptimally generated training sets.

We use the following notation. Let $xn=x1,x2,\u2026,xn$ be a sequence of real-valued elements and $Xn=X1,X2,\u2026,Xn$ a sequence of random variables on $Rn$. In this work, $Xn$ will denote strictly stationary stochastic processes. Define a set of probability density functions (PDF) $P\lambda ,\lambda \u2208\Omega $ on $Rn$ with $\Omega $ a compact subset of $Rm$, assuming there are $m$ free parameters. $|\xb7|$ denotes the operation of taking the determinant of a square matrix. $log\xb7$ is the natural logarithm.

## 2 Review on Universal Coding

The previous ideas, although formulated in terms of probabilities (equivalently, in terms of log-loss) can be directly applied to the case of sequential prediction under the mean squared error (MSE) loss, at least for the Gauss-Markov processes used in this letter. This idea is described in section 3.

## 3 Lower Bounds on the Sequential Prediction Error

In this section, we first introduce the parametric family studied in this letter. Thereafter, we derive lower bounds on the sequential prediction error under the MSE loss for different priors $w$ from which the strictly stationary sampled Ornstein-Uhlenbeck processes are drawn.

### 3.1 Some Results on the Ornstein-Uhlenbeck Process

### 3.2 Sampling the Ornstein-Uhlenbeck Process

Equation 3.12 is a quadratic approximation to equation 3.10; it corresponds to a Taylor expansion truncated after the second expansion coefficient. Equation. 3.11, plugged into equation 2.4, yields Jeffreys' prior for the parametric family composed of sampled Ornstein-Uhlenbeck processes. Jeffreys' prior is shown in Figure 1 for $\delta =10$.

### 3.3 Lower Bounds

In section 2, various lower bounds under log loss were discussed that pertain to representing a parametric family by some mixture source. Here we discuss lower bounds under MSE loss for the task of sequential prediction tailored to the sampled Ornstein-Uhlenbeck process.

By choosing $\sigma 2\lambda =\sigma 21-e-2\theta \delta 2\theta $ and $p=1$ according to the Ornstein-Uhlenbeck process specifications, equation 3.6, the desired result is obtained.$\u25a1$

If the prior $w$ is chosen as Jeffreys' prior, then the random coding capacity $CR$ can be replaced by $Cn$ from equation 2.2 in the case of gaussian ARMA processes.

Theorem ^{1} is a generalization of a well-known lower bound obtained for a uniform prior $w$ (Rissanen, 1984). The greatest lower bound results from choosing Jeffreys' prior. In the case of a uniform prior $w$, the number of distinguishable distributions is proportional to $nm2$, provided that some parameter estimators exist that converge sufficiently fast (cf. Merhav & Feder, 1995). The conditions hold for the strictly stationary sampled Ornstein-Uhlenbeck process. In that case, $CR$ in inequality 3.13, has to be replaced by $m2logn$ with $m=2$ in our case on account of the number of free parameters in the Ornstein-Uhlenbeck parametric family. Note that if $w$ was chosen such that only one distribution could be effectively distinguished, the lower bound would be equal to $\sigma 21-e-2\theta \delta 2\theta $. The same lower bound would be reached if the two free parameters $\theta $ and $\mu $ were known and would not have to be estimated first. The second part of equation 3.13$1-\epsilon 2CRn-p$, hence measures the additional complexity of having unknown free parameters.

### 3.4 Distinguishability of Processes from the Ornstein-Uhlenbeck Parametric Family

## 4 Empirical Results with Deep Networks

The results described in sections 2 and 3 are intrinsic properties of parametric families. We first recapitulated general results of universal coding theory and derived specific results for the Ornstein-Uhlenbeck parametric family thereafter. By an empirical analysis, we show in this section that the previous statements have repercussions for machine learning as well. The choice of the specific learning algorithm is to some extent arbitrary for this task. We have hence chosen standard RNN architectures with long short-term memory (LSTM units; Hochreiter & Schmidhuber, 1997), as these are state of the art for time series prediction.

We first describe a constructive scheme to approximate the optimal solution from section 3.3 within the hypothesis space of an RNN. The approximation methods described in section 4.1 are used to verify that the chosen RNN architecture described in section 4.2 can in principle approximate closely the optimal solution. To carry out the approximations, the results from equation 3.5 and the appendix are required as the domain of the input to the RNN needs to be known.

### 4.1 Approximating the Optimal Solution through Explicit Construction

An RNN with a single hidden layer with LSTM units is used for the sequential prediction task. In order to approximate the solution based on the OLS equations discussed in section 3.3 (cf. Fuller & Hasza, 1980, for the OLS equations), each subexpression in the OLS equations is approximated through one of the units in the recurrent layer. In order to approximate the expression $x2+y$, for example, we first approximate $x$ and $y$ through two of the recurrent units, $x2$ with another unit, and finally $x2+y$ with a fourth unit. The OLS equations contain both polynomial terms of second order as well as reciprocal terms.

Three main ideas are used for the approximation of the equations with the LSTM layer. The first idea is to rescale the input to the approximately linear region of the corresponding tanh/sigmoid nonlinearity. This step requires a careful analysis of the growth behavior of the individual terms in the OLS equations. Equation 3.5 provides an upper and lower bound within finite time intervals for the strictly stationary Ornstein-Uhlenbeck process, with $C\u22481$ from numerical simulations. From this, as well as a more thorough analysis of the growth behavior of terms in the OLS equations detailed in the appendix, it is possible to obtain scaling factors that ensure that the rescaled input is within the linear region for some finite time horizon. The second idea is to approximate the multiplication operation required in the OLS equations by the use of Hadamard multiplication in the LSTM update equation for the cell state. The last idea is to approximate the division operation by first approximating the inverse of the divisor and then using the multiplication approximation to multiply the dividend and the inverse of the divisor. For the approximation of the inverse, we can either train a subnetwork to approximate the operation within our range of interest or we can use a constructive approximation scheme closely based on previous work (Jones, 1990).

### 4.2 Training on Jeffreys' Prior and Uniform Prior

To elucidate the importance of sampling of the parameter space on the performance of the RNN, we train two networks with the same configuration and training conditions: one where the process parameters are sampled according to Jeffreys' prior and the other where the sampling is carried out according to a uniform prior. We choose a network with a single layer of 100 units, followed by a linear transformation to a single dimension for the prediction. This network can approximate the optimal solution closely. The network is trained with stochastic gradient descent with a learning rate of 0.001 with early stopping. The range of the parameter $\mu $ for the process is $(-2,2)$, while the range for the parameter $\theta $ is $(0.01,3)$. The sampling interval $\delta $ is set to 10, while $n$ is arbitrarily set to 500.

Both of the trained models are tested on sequences drawn from the two priors: Jeffreys' and uniform. The results for the case of 50 parameters sampled during training are shown in Table 1. The results are averaged over 5 draws of parameter sampling and 10 random initializations of the network for each draw.

. | . | Test Prior
. | |
---|---|---|---|

. | . | Uniform . | Jeffreys' . |

Train | Uniform | $2.91\xb10.4$ | $3.83\xb10.25$ |

Prior | Jeffreys' | $2.94\xb10.32$ | $3.4\xb10.2$ |

Optimal | 0.79 | 1.11 |

. | . | Test Prior
. | |
---|---|---|---|

. | . | Uniform . | Jeffreys' . |

Train | Uniform | $2.91\xb10.4$ | $3.83\xb10.25$ |

Prior | Jeffreys' | $2.94\xb10.32$ | $3.4\xb10.2$ |

Optimal | 0.79 | 1.11 |

Note: “Optimal” is related to the lower bounds from section 3.3.

It is observed that with an increasing number of parameter samples drawn from the parameter space, the difference in the performance of the models trained on the two priors gets smaller. This can be seen in Figure 3, in which the performance of the models trained on stochastic process realizations drawn from the two priors (Jeffreys' and uniform) and tested on Jeffreys' prior is plotted against the number of stochastic process realizations drawn.

## 5 Discussion

Classical machine learning theory investigates the learnability of relationships from i.i.d. samples drawn from a fixed but unknown probability distribution, as alluded to in section 1. For the non-i.i.d. case, extensions of statistical learning theory, type guarantees have been developed (cf. Kuznetsov & Mohri, 2015; McDonald, Shalizi, & Schervish, 2017, as well as references therein). Generalization is always understood to refer to the same distribution generating the training and test set.

If multiple distributions are to be learned, it is natural to require the model to do equally well on all of them. This requirement can be directly translated into the language of universal coding theory. The number of independent realizations of stochastic processes $p$ drawn independently according to some prior $w$ on the compact parameter space, as well as the length $n$ of each stochastic process realization, are, as is intuitively clear, crucial for any required theory of generalization in the parametric family context. In classical statistical learning theory, $n$, as well as the complexity of the hypothesis space, is the main focus of investigation. For finite $n$, only finitely many stochastic processes are distinguishable. Asymptotically in $n$, for the stochastic processes considered in this letter, the capacity-inducing prior will be given by Jeffreys' prior. Since the maximum number of distinguishable models is close to $eCn$, $p$ will have to be at least equal to $eCn$. In fact, since $Cn$ is in general growing with increasing $n$, the minimum number of required stochastic process realizations $p$ will depend on $n$. The dependence of $p$ on $n$ therefore implicitly reflects the fact that the number of distinguishable distributions in a parametric family grows with increasing $n$. Since the capacity-inducing prior $w*$ is the prior under which the maximum number of distributions in the parametric family are distinguishable, it follows that $p$ adapted to this prior is sufficient for any other prior. Finding a $p$ adapted to $w*$ is therefore a necessary requirement if one attempts to learn the entire parametric family. The empirical counterpart of this statement for the case of MSE loss is found in Figure 3 as well as Table 1. Training on stochastic process realizations drawn from Jeffreys' prior ensures that testing on a different prior (here the uniform prior was chosen) does not lead to an increased MSE loss. Training on the uniform prior and testing on Jeffreys' prior, however, leads to a marked increase in MSE loss.

The capacity used in the lower bound equation 2.1, as well as in the lower bound equation 3.13, is the capacity of the parametric family, not the capacity of the hypothesis space of the machine learner. Notions of capacity for the machine learner reflect the richness of the class of functions that such a learner can approximate. The capacity $Cn$ measures the richness of the parametric family.

Assume that it was known only that a set of observations could be modeled by a parametric family with $m$ free parameters, while the specific form of the parametric family was not known. In such a case, it would not be possible to obtain $p$ such that, uniformly for all possible parametric families with $m$ free parameters, $p$ would be sufficient to guarantee that any parametric family could be fully learned (in the sense that the solution found should be close to a mixture source induced by the capacity-achieving prior). If the form of the parametric family was not known, it seems reasonable to use stochastic process realizations drawn uniformly from the space of parameters. If the capacity-inducing prior, however, was very different from the uniform prior, then most of the obtained realizations from the uniform prior would not facilitate learning the parametric family fully. The ill-adapted sampling mechanism would prohibit an optimal learning of the parametric family. The testing error in Figure 3, with testing performed by drawing stochastic process realizations from Jeffreys' prior and training carried out by using either Jeffreys' or the Uniform prior, converges to the same error for increasing $p$. This behavior is expected in view of the fact that the two priors are positive everywhere within the parameter space, as can be seen in Figure 1. A more subtle analysis of this fact can be carried out by noting that the number of distinguishable distributions under both priors is not too different from one another as discussed in section 3.3 for the parametric family considered in this letter.

Equation 3.13 provides a lower bound on the sequential prediction error for the MSE loss, assuming that the form of the parametric family was known. The empirical results obtained in section 4, do not require knowledge of the specific form. By the explicit construction detailed in section 4.1, we show that a solution close to an optimal solution lies in the hypothesis space of the chosen network architecture. It is hence guaranteed that the chosen deep network is in principle well specified. The results shown in Table 1 indicate that the empirical solution found by the network does not reach the lower bounds, here denoted by ”Optimal”, implying that an inefficiency exists in the optimization procedure. A thorough analysis is outside the scope of this letter, however, as it would require an investigation of the loss landscape of the chosen deep network with stochastic process realizations drawn according to some prior $w$ as input, as well as of the optimization algorithm used.

Empirically, it was observed in the experiments that if one first trains the deep network with observations drawn from some prior $w1$ until convergence and thereafter changes the prior to some $w2$ and continues training, the previously found solution changes. This behavior is expected in view of the previous discussion, as a changed prior induces a different optimal solution. It follows that there is a close link between optimal solutions and the sampling of parameter space.

Most of the previous statements hold for more general families of distributions and not only for parametric families. Equation 2.1, as well as the statements on the capacity-achieving prior, hold in particular in more general contexts (Merhav & Feder, 1995). The simple form of the capacity, equation 2.2, as well as the fact that Jeffreys' prior is asymptotically capacity inducing are, however, not correct in a more general context. To achieve optimality, however, the sampling mechanism should still be matched to $w*$.

## Appendix

## Acknowledgments

This work was partially supported by the European Union's Horizon 2020 research and innovation program under grant agreement 644732.