## Abstract

It is well known in machine learning that models trained on a training set generated by a probability distribution function perform far worse on test sets generated by a different probability distribution function. In the limit, it is feasible that a continuum of probability distribution functions might have generated the observed test set data; a desirable property of a learned model in that case is its ability to describe most of the probability distribution functions from the continuum equally well. This requirement naturally leads to sampling methods from the continuum of probability distribution functions that lead to the construction of optimal training sets. We study the sequential prediction of Ornstein-Uhlenbeck processes that form a parametric family. We find empirically that a simple deep network trained on optimally constructed training sets using the methods described in this letter can be robust to changes in the test set distribution.

## 1  Introduction

The main problems in machine learning are density estimation, regression, and classification based on samples drawn according to an unknown but fixed probability distribution function $F$. To assess the quality of a machine learner, the notion of generalization was introduced, most prominently in statistical learning theory (Vapnik, 1998, 2013). Statistical learning theory describes conditions on the hypothesis space of the learning algorithm and the number of samples drawn from $F$ such that the empirical risk is close in probability to the expected risk. For generalization to be defined in this framework, it is crucial that the expected risk is calculated with respect to the same probability distribution function that generated the samples used for the evaluation of the empirical risk. A change in the probability distribution function cannot be directly incorporated into statistical learning theory.

Recent findings have shown, however, that even slight changes in the probability distribution function that generates the data (i.e., different distribution functions for the training or test set) lead to decreases in performance of the learned model (Recht, Roelofs, Schmidt, & Shankar, 2018). This problem can be partially circumvented by including data drawn from different possible probability distribution functions (which are allowed to possess different functional forms) in the training set, effectively demanding that a joint solution is found for all subproblems (Caruana, 1997). In the limit, it is possible that infinitely many probability distribution functions could have generated the data. One possible way of modeling the infinitely many data-generating probability distribution functions is by grouping them into a parametric family.

In this letter, we assume that the data-generating process is itself parametric. Data are then drawn from the whole parametric family: the task that a learning algorithm has to solve is to learn a model for the entire parametric family. Without further prior information on the specific probabilistic structure of the test set, it is a natural requirement to demand that a learned model is equally good for all members of the parametric family. The central question studied in this letter is therefore how training sets containing a finite number of samples can be constructed such that the training set represents the entire parametric family optimally. The tools needed for the analysis carried out in this letter mostly stem from information theory, specifically universal coding theory, and not from machine learning (Rissanen, 2007; Cover & Thomas, 2012).

For the sake of clarity and in order to derive quantitative statements, we focus on a specific stochastic process, the Ornstein-Uhlenbeck process. Being both a gaussian and Markovian process, this stochastic process is rich in structure while still being analytically tractable. Most of the results we present, however, apply to more general problem classes.

The problem of how to optimally sample from a parametric family is tightly connected to universal coding theory. Some universal coding inequalities described in section 2 directly correspond to the problem of sequential prediction in the case of an Ornstein-Uhlenbeck process as shown in section 3. The specific stochastic process chosen therefore yields a task (sequential prediction—having observed a time series up to sample $n$, sample $n+1$ is predicted) that directly corresponds to questions of how to sample a parametric family optimally in the sense of universal coding theory. The letter concludes by empirically studying the generalization behavior shown by deep networks trained on the Ornstein-Uhlenbeck parametric family in an autoregressive manner. We empirically find that a simple model trained on optimally constructed training sets generalizes better to changes in the test set distribution than if the model is trained on suboptimally generated training sets.

We use the following notation. Let $xn=x1,x2,…,xn$ be a sequence of real-valued elements and $Xn=X1,X2,…,Xn$ a sequence of random variables on $Rn$. In this work, $Xn$ will denote strictly stationary stochastic processes. Define a set of probability density functions (PDF) $Pλ,λ∈Ω$ on $Rn$ with $Ω$ a compact subset of $Rm$, assuming there are $m$ free parameters. $|·|$ denotes the operation of taking the determinant of a square matrix. $log·$ is the natural logarithm.

## 2  Review on Universal Coding

We give a brief description of ideas from the universal coding literature that are crucial for this work. Assume that a family of PDFs $Pλ,λ∈Ω$ on $Rn$ and an observed sequence $xn=x1,x2,…,xn$ (which is generated by one of the densities in the family) is given. If the specific PDF $Pλ$ generating $xn$ is known, then the entropy rate $limn→∞1nEλ-logPλXn=Hλ$, with $Eλ·$ the expectation with respect to $Pλ$, corresponds to the best compression of the source. Such a compression statement follows from the asymptotic equipartition property (AEP; Cover & Thomas, 2012). For the sampled strictly stationary Ornstein-Uhlenbeck process, which is discussed in section 3 in more detail, the AEP holds (Barron, 1985). If $Pλ$ is not known, however, the question arises of whether it is still possible (asymptotically in $n$) to reach the entropy rate of the stochastic process, provided that the parametric family $Pλ,λ∈Ω$ is known. Universal coding theory answers this question in the affirmative for a wide class of parametric families (Merhav & Feder, 1998). To show this, a mixture source $Pxn=∫Ωwλ·Pλxndλ$ is introduced, with $w$ a PDF (we do not consider cases in which $w$ might be discrete) on $Ω$. This mixture source can then be used as a replacement for the unknown $Pλ$. A natural question associated with such a mixture source is how $w$ should be chosen. It is intuitively clear that mixture sources $Pxn$ set up by different $w$ will behave differently. It turns out that a particular choice of $w$ carries with it a notion of channel capacity. Let $Λ$ denote a random variable with PDF $w$ on $Ω$. The parameters $λ$ indexing $Pλ$ are realizations of $Λ$. The prior $w*$, which reaches channel capacity $Cn=supwIwΛ;Xn$ with channel input $Λ$ and channel output $Xn$, where $IwΛ;Xn$ denotes mutual information induced by $wλPλxn$, maximizes the mutual information between $Λ$ and $Xn$. If $Λ$ is distributed as $w*$, then observations $xn$ generated by $Pλ$ contain most information about the $m$ parameters in $Ω$. Additionally, $w*$ has the further property of being the prior that induces maximin redundancy (Merhav & Feder, 1998). The channel capacity $Cn$ is furthermore a lower bound on the Kullback-Leibler divergence between the true data-generating distribution $Pλ$ and any other PDF $Qxn$ (Merhav & Feder, 1995):
$DPλ||Q>1-εCn.$
(2.1)
Inequality 2.1 holds for all $ε>0$ and for all $λ∈Ω$ except for some $λ$ in a subset $B⊂Ω$ whose size under $w*$ vanishes at an exponential rate with $Cn$. For $w=w*$, $DPλ||P*=Cn$, with $P*$ the mixture source with capacity-achieving prior $w*$. Hence, for $w*$, nearly all sources $Pλ$ lie on or close to a hypersphere centered at $P*$ with Kullback-Leibler divergence equal to $Cn$, as can be inferred from the previous discussion and inequality 2.1. It is crucial to emphasize that this statement holds only for the capacity-achieving prior $w*$. Other mixture sources based on different priors $w$ will in general be closer to some subset of sources in the parametric family $Pλ,λ∈Ω$ and have larger Kullback-Leibler divergence than $Cn$ to other sources in the parametric family.
It is interesting to note that for the parametric family introduced in section 3 (sampled strictly stationary Ornstein-Uhlenbeck processes), an asymptotically accurate form of the channel capacity can be deduced (Rissanen, 1996),
$Cn=m2logn2π+log∫ΩIλdλ+o1,$
(2.2)
with $o1$ tending to zero for $n→∞$ and $Iλ$ the Fisher information matrix of the stochastic process,
$Iijλ*=limn→∞1n∂2∂λi∂λjEλ*-logPλXnλ*,$
(2.3)
with $i$ and $j$ ranging from 1 to $m$ and $λ*$ in $Ω$.
An additional interpretation of $Cn$ can be given in terms of the number of distributions in $Pλ,λ∈Ω$ that are distinguishable based on the observation of a sequence of length $n$ (Balasubramanian, 1996; Rissanen, 2007). It is intuitively clear that different sources in the parametric family $Pλ,λ∈Ω$ are not necessarily distinguishable after observing $n$ samples. This notion can be made more precise by using the language of hypothesis testing. For the parametric family discussed in this letter, this analysis is described in section 3. Note that equation 2.2 is a consequence of choosing Jeffreys' prior in the mixture source $Pxn$, which is given by the following expression (Jeffreys, 1998),
$wJeffreysλ=Iλ∫ΩIλ'dλ',$
(2.4)
which is asymptotically equal to the capacity-achieving prior $w*$ for the parametric family considered in this letter. The number of distinguishable distributions after observing a sequence of length $n$ is roughly equal to $eCn$. Since Jeffreys' prior, equation 2.4, is asymptotically capacity inducing, the maximal number of distinguishable distributions is reached for Jeffreys' prior. More precisely, if $Λ$ is distributed according to $wJeffreys$, then the sampled stochastic processes $Pλ$ are maximally distinguishable on average. Any other prior $w$ would (at least asymptotically) lead to a smaller number of distinguishable distributions. This argument can be strengthened by appealing to the analog of equation 2.1 for arbitrary priors (Merhav & Feder, 1995). It can be shown that $DPλ||Q$ is larger than $1-εCR$, with $ε>0$ and $CR$ equal to the logarithm of the maximal number of random sources chosen under the prior $w$ that can be distinguished in the sense of having a bounded error probability (Merhav & Feder, 1995). $Q$ is an arbitrary distribution on $xn$ as in equation 2.1. The inequality holds again for all parameters $λ$ except in a set $B'⊂Ω$ whose size measured by $w$ tends to zero for $n→∞$ under certain conditions.

The previous ideas, although formulated in terms of probabilities (equivalently, in terms of log-loss) can be directly applied to the case of sequential prediction under the mean squared error (MSE) loss, at least for the Gauss-Markov processes used in this letter. This idea is described in section 3.

## 3  Lower Bounds on the Sequential Prediction Error

In this section, we first introduce the parametric family studied in this letter. Thereafter, we derive lower bounds on the sequential prediction error under the MSE loss for different priors $w$ from which the strictly stationary sampled Ornstein-Uhlenbeck processes are drawn.

### 3.1  Some Results on the Ornstein-Uhlenbeck Process

The Ornstein-Uhlenbeck process is defined as
$dXt=θμ-Xtdt+σdWt,$
(3.1)
with $θ>0$, $μ∈R$, $t≥0$, $σ>0$ and $Wt$ the standard Wiener process. For the process to be strictly stationary, the first value $x0$ at time $t=0$ is drawn from a gaussian distribution with mean $μ$ and variance $σ22θ$. In the strictly stationary case, the Ornstein-Uhlenbeck process can be alternatively written as
$Xt=μ+σ2θe-θtWe2θt,$
(3.2)
with $We2θt$ a time-scaled Wiener process. We next derive some bounds on the growth of strictly stationary Ornstein-Uhlenbeck processes. These bounds are needed in the explicit construction of the recurrent neural network (RNN) that implements the asymptotically optimal solution of the sequential prediction problem described in section 4.1. To understand the growth behavior of the strictly stationary Ornstein-Uhlenbeck process, the law of the iterated logarithm is invoked:
$lim supt→∞Wt2tloglogt=1a.s.$
(3.3)
By applying the law of the iterated logarithm to the time-scaled Wiener process, the denominator of equation 3.3 is changed to $2e2θtlogloge2θt$, while the numerator is replaced by $We2θt$. Multiplying the denominator by $σ2θe-θt$, one obtains $σθlog2θt$. Hence one can conclude the following about the second term of equation 3.2:
$lim supt→∞σ2θe-θtWe2θt=σθlog2θt.$
(3.4)
For a finite $t>0$, there will in general exist a constant $C>0$ such that the strictly stationary Ornstein-Uhlenbeck process in $0,t$ will be almost surely contained within the interval
$μ-Cσθlog2θt,μ+Cσθlog2θt.$
(3.5)

### 3.2  Sampling the Ornstein-Uhlenbeck Process

We consider Ornstein-Uhlenbeck processes drawn from a parametric family. The two free parameters are $μ∈c,d$ with $c,d∈R$, $d>c$ and $θ∈a,b$ with $a,b∈R+$, $b>a$. $σ∈R+$ is arbitrary but fixed. The uniformly sampled Ornstein-Uhlenbeck process amounts to an autoregressive AR(1)-process,
$Xnδ=e-θδX(n-1)δ+μ1-e-θδ+εn,$
(3.6)
with $εn∼N0,σ22θ1-e-2θδ$ independent over time, $δ>0$ the distance between consecutive samples and $Xnδ$ the $n$th sample. $Xδ,X2δ,…,Xnδ⊤$ is distributed according to a multivariate normal distribution with mean vector $μ,μ,…,μ⊤$ and covariance matrix:
$Σ=σ22θ1e-θδ⋯e-θ(n-1)δe-θδ1⋯e-θ(n-2)δ⋮⋮⋱⋮e-θ(n-1)δe-θ(n-2)δ⋯1.$
(3.7)
We next derive the asymptotic Kullback-Leibler divergence between two strictly stationary Ornstein-Uhlenbeck processes as well as the Fisher information matrix of this stochastic process. Both are needed for the subsequent discussion of distinguishability, as well as for the explicit construction of Jeffreys' prior. The inverse of covariance matrix 3.7 is given by
$Σ-1=2θσ21-e-2θδ·1-e-θδ0⋯0-e-θδ1+e-2θδ-e-θδ⋯00-e-θδ1+e-2θδ⋯0⋮⋮⋮⋱⋮000⋯-e-θδ000⋯1,$
(3.8)
which is a symmetric tridiagonal matrix. From equation 3.8, we obtain the determinant of $Σ$,
$Σ=1|Σ-1|=σ2n1-e-2θδn-12θn.$
(3.9)
The asymptotic Kullback-Leibler divergence is then equal to
$Dμ1,θ1||μ0,θ0=limn→∞1nDPμ1,θ1||Pμ0,θ0=12θ0θ111-e-2θ0δ1-2e-θ0+θ1δ+e-2θ0δ+μ1-μ02θ0σ21-e-2θ0δ1-e-θ0δ2-12+12·log1-e-2θ0δ1-e-2θ1δ+12·logθ1θ0.$
(3.10)
Evaluating the Fisher information matrix, equation 2.3, for the strictly stationary sampled Ornstein-Uhlenbeck process, we find
$Iμ*,θ*=e2θ*δ-1-2θ*δ22θ*2e2θ*δ-12+δ21e2θ*δ-1002θ*σ2eθ*δ-1eθ*δ+1,$
(3.11)
where we first differentiate with respect to $θ$ and then with respect to $μ$. Equation 3.10 can be locally approximated as follows:
$Dμ1,θ1||μ0,θ0≈12θ1-θ0μ1-μ0Iθ0,μ0θ1-θ0μ1-μ0.$
(3.12)

Equation 3.12 is a quadratic approximation to equation 3.10; it corresponds to a Taylor expansion truncated after the second expansion coefficient. Equation. 3.11, plugged into equation 2.4, yields Jeffreys' prior for the parametric family composed of sampled Ornstein-Uhlenbeck processes. Jeffreys' prior is shown in Figure 1 for $δ=10$.

Figure 1:

Jeffreys' and uniform prior for the Ornstein-Uhlenbeck process.

Figure 1:

Jeffreys' and uniform prior for the Ornstein-Uhlenbeck process.

### 3.3  Lower Bounds

In section 2, various lower bounds under log loss were discussed that pertain to representing a parametric family by some mixture source. Here we discuss lower bounds under MSE loss for the task of sequential prediction tailored to the sampled Ornstein-Uhlenbeck process.

Theorem 1.
Consider gaussian ARMA processes with compact parameter space $Ω⊂Rm$, $m>0$, and $p$ autoregressive terms, $p. Given any prior $w$ on $Ω$ with corresponding random coding capacity $CR$ and any $ε>0$, the following lower bound is valid for all parameters $λ$ except in a set $B'⊂Ω$ whose size measured by $w$ tends to zero for $n→∞$:
$1n-pEλ∑i=p+1nXi-hiXi-12≥σ2λ1+1-ε2CRn-p,$
(3.13)
with $σ2λ$ the variance of the stationary Wold decomposition of the stochastic process and $xi^=hixi-1$ any measurable prediction function.
Proof.
The random coding theorem (Merhav & Feder, 1995) holds for gaussian ARMA processes. In case $Pλ$ and $Q$ from equation 2.1, as well as its extension to the random coding case, are both gaussian distributions, the random coding theorem leads directly to a lower bound on the MSE loss. $Pλxn$ is the probability of data sequence $xn$ induced by the gaussian ARMA model, while $Qxn$ is obtained by converting the arbitrary prediction function $xi^=hixi-1$ into a PDF:
$Qxi|xi-1=12πσ2λe-xi-hixi-122σ2λ.$
(3.14)
The prediction begins after observing $p$ initial values. We then find that
$EλlogPλXnQXn=-12n-p+12σ2λEλ∑i=p+1nXi-hiXi-12,$
(3.15)
which, upon rearranging and insertion into the random coding theorem and division by $n-p$, yields equation 3.13.$□$
Corollary 1.
For a strictly stationary sampled Ornstein-Uhlenbeck process with sampling interval $δ>0$, the following lower bound is obtained:
$1n-1Eμ,θ∑i=2nXiδ-hiXi-1δ2≥σ21-e-2θδ2θ1+1-ε2CRn-1.$
(3.16)
Proof.

By choosing $σ2λ=σ21-e-2θδ2θ$ and $p=1$ according to the Ornstein-Uhlenbeck process specifications, equation 3.6, the desired result is obtained.$□$

Remark 1.

If the prior $w$ is chosen as Jeffreys' prior, then the random coding capacity $CR$ can be replaced by $Cn$ from equation 2.2 in the case of gaussian ARMA processes.

Theorem 1 is a generalization of a well-known lower bound obtained for a uniform prior $w$ (Rissanen, 1984). The greatest lower bound results from choosing Jeffreys' prior. In the case of a uniform prior $w$, the number of distinguishable distributions is proportional to $nm2$, provided that some parameter estimators exist that converge sufficiently fast (cf. Merhav & Feder, 1995). The conditions hold for the strictly stationary sampled Ornstein-Uhlenbeck process. In that case, $CR$ in inequality 3.13, has to be replaced by $m2logn$ with $m=2$ in our case on account of the number of free parameters in the Ornstein-Uhlenbeck parametric family. Note that if $w$ was chosen such that only one distribution could be effectively distinguished, the lower bound would be equal to $σ21-e-2θδ2θ$. The same lower bound would be reached if the two free parameters $θ$ and $μ$ were known and would not have to be estimated first. The second part of equation 3.13$1-ε2CRn-p$, hence measures the additional complexity of having unknown free parameters.

The lower bound in equation 3.13 for Jeffreys' prior and the lower bound for the uniform prior can be reached asymptotically. By estimating the AR coefficient $ψ1=e-θδ$ and $ψ2=μ1-e-θδ$ with ordinary least squares (OLS), which for the Ornstein-Uhlenbeck process coincides with a maximum likelihood (ML) estimation of the two parameters conditioned on the first observation, and using these estimates to predict the next sample $x^iδ=ψ^1xi-1δ+ψ^2$, the following error is obtained: $Eμ,θ[Xiδ-X^iδ2]=σ21-e-2θδ2θ1+2i+O(i-32)$ (Fuller & Hasza, 1981, 1980). Summing the previous expression from $i=2$ to $n$ and dividing by $n-1$, one obtains
$1n-1∑i=2nEμ,θXiδ-X^iδ2=σ21-e-2θδ2θ1+2Hn-1n-1+OHn32-1n-1,$
(3.17)
with $Hi$ being the $i$th harmonic number and $Hi(m)$ the $i$th generalized harmonic number. For $n→∞$, $Hn$ can be approximated by $logn$, while the second term tends to zero. Hence, the lower bound in equation 3.13 can be reached asymptotically in the Ornstein-Uhlenbeck case, as can be seen by inspecting the asymptotic behavior of the term $Cnn-1$ with $Cn$ given by equation 2.2.

### 3.4  Distinguishability of Processes from the Ornstein-Uhlenbeck Parametric Family

We construct explicit regions of indistinguishability for the Ornstein-Uhlenbeck parametric family. If only a finite number of samples are given, then distinct strictly stationary Ornstein-Uhlenbeck processes will not be distinguishable if their parameters $θ0,μ0$ and $θ1,μ1$ are too close to one another in a suitable sense. To make this notion more precise, we construct regions of indistinguishability around $θ0,μ0$ such that, given $n$ samples, the process corresponding to parameters $θ0,μ0$ and a process corresponding to parameters drawn from the region of indistinguishability around $θ0,μ0$ will not be effectively distinguishable. The analysis is based on a related investigation of distinguishability for independent and identically distributed (i.i.d.) stochastic processes (Balasubramanian, 1996). Let us therefore assume that a realization of the random vector $Xδ,…,Xnδ⊤$ has been observed. $Pθ0,μ0$ corresponds to the null hypothesis, while $Pθ1,μ1$ is the alternative hypothesis. The observed random vector is drawn from either $Pθ0,μ0$ or $Pθ1,μ1$. Assuming that the type 1 error probability $αn$ is bounded from above by a constant $ε∈0,1$, $αn≤ε$, the minimum type 2 error probability,
$βnε=infAn⊆Rnαn≤εβn,$
(3.18)
with $An$ an acceptance region for the null hypothesis, is given asymptotically (via a generalized Stein's lemma; Vajda, 1989) as
$limn→∞-1nlogβnε=Dμ1,θ1||μ0,θ0.$
(3.19)
For a fixed number of samples $n$, we then find the following region of indistinguishability around $θ0,μ0$,
$κn≥Dμ1,θ1||μ0,θ0≈12θ1-θ0μ1-μ0Iθ0,μ0θ1-θ0μ1-μ0,$
(3.20)
with $κ=-logβ*+log1-ε$ and $β*$ a constant between 0 and 1. For sufficiently large $n$, $β*$ will be smaller than $βnε$, showing that the type 2 error will be greater than a certain constant. Equation 3.20 shows that the regions of indistinguishability around $θ0,μ0$ are given by ellipses whose major axes depend on the local value of the Fisher information matrix. Starting with such regions of indistinguishability, a covering of parameter space can be carried out. An illustration of such a procedure is given in Figure 2 with parameters $β*=0.95$, $ε=0.01$, and $δ=0.1$ for two different sequence lengths, $n=50$ and $n=100$.
Figure 2:

Coarse illustrative partition of parameter space by regions of indistinguishability.

Figure 2:

Coarse illustrative partition of parameter space by regions of indistinguishability.

## 4  Empirical Results with Deep Networks

The results described in sections 2 and 3 are intrinsic properties of parametric families. We first recapitulated general results of universal coding theory and derived specific results for the Ornstein-Uhlenbeck parametric family thereafter. By an empirical analysis, we show in this section that the previous statements have repercussions for machine learning as well. The choice of the specific learning algorithm is to some extent arbitrary for this task. We have hence chosen standard RNN architectures with long short-term memory (LSTM units; Hochreiter & Schmidhuber, 1997), as these are state of the art for time series prediction.

We first describe a constructive scheme to approximate the optimal solution from section 3.3 within the hypothesis space of an RNN. The approximation methods described in section 4.1 are used to verify that the chosen RNN architecture described in section 4.2 can in principle approximate closely the optimal solution. To carry out the approximations, the results from equation 3.5 and the appendix are required as the domain of the input to the RNN needs to be known.

### 4.1  Approximating the Optimal Solution through Explicit Construction

An RNN with a single hidden layer with LSTM units is used for the sequential prediction task. In order to approximate the solution based on the OLS equations discussed in section 3.3 (cf. Fuller & Hasza, 1980, for the OLS equations), each subexpression in the OLS equations is approximated through one of the units in the recurrent layer. In order to approximate the expression $x2+y$, for example, we first approximate $x$ and $y$ through two of the recurrent units, $x2$ with another unit, and finally $x2+y$ with a fourth unit. The OLS equations contain both polynomial terms of second order as well as reciprocal terms.

Three main ideas are used for the approximation of the equations with the LSTM layer. The first idea is to rescale the input to the approximately linear region of the corresponding tanh/sigmoid nonlinearity. This step requires a careful analysis of the growth behavior of the individual terms in the OLS equations. Equation 3.5 provides an upper and lower bound within finite time intervals for the strictly stationary Ornstein-Uhlenbeck process, with $C≈1$ from numerical simulations. From this, as well as a more thorough analysis of the growth behavior of terms in the OLS equations detailed in the appendix, it is possible to obtain scaling factors that ensure that the rescaled input is within the linear region for some finite time horizon. The second idea is to approximate the multiplication operation required in the OLS equations by the use of Hadamard multiplication in the LSTM update equation for the cell state. The last idea is to approximate the division operation by first approximating the inverse of the divisor and then using the multiplication approximation to multiply the dividend and the inverse of the divisor. For the approximation of the inverse, we can either train a subnetwork to approximate the operation within our range of interest or we can use a constructive approximation scheme closely based on previous work (Jones, 1990).

### 4.2  Training on Jeffreys' Prior and Uniform Prior

To elucidate the importance of sampling of the parameter space on the performance of the RNN, we train two networks with the same configuration and training conditions: one where the process parameters are sampled according to Jeffreys' prior and the other where the sampling is carried out according to a uniform prior. We choose a network with a single layer of 100 units, followed by a linear transformation to a single dimension for the prediction. This network can approximate the optimal solution closely. The network is trained with stochastic gradient descent with a learning rate of 0.001 with early stopping. The range of the parameter $μ$ for the process is $(-2,2)$, while the range for the parameter $θ$ is $(0.01,3)$. The sampling interval $δ$ is set to 10, while $n$ is arbitrarily set to 500.

Both of the trained models are tested on sequences drawn from the two priors: Jeffreys' and uniform. The results for the case of 50 parameters sampled during training are shown in Table 1. The results are averaged over 5 draws of parameter sampling and 10 random initializations of the network for each draw.

Table 1:
Comparing the Performance (MSE) of Models Trained on the Two Priors and Tested on the Two Priors.
Test Prior
UniformJeffreys'
Train Uniform $2.91±0.4$ $3.83±0.25$
Prior Jeffreys' $2.94±0.32$ $3.4±0.2$
Optimal 0.79 1.11
Test Prior
UniformJeffreys'
Train Uniform $2.91±0.4$ $3.83±0.25$
Prior Jeffreys' $2.94±0.32$ $3.4±0.2$
Optimal 0.79 1.11

Note: “Optimal” is related to the lower bounds from section 3.3.

It is observed that with an increasing number of parameter samples drawn from the parameter space, the difference in the performance of the models trained on the two priors gets smaller. This can be seen in Figure 3, in which the performance of the models trained on stochastic process realizations drawn from the two priors (Jeffreys' and uniform) and tested on Jeffreys' prior is plotted against the number of stochastic process realizations drawn.

Figure 3:

Comparing the performance (MSE) of the models trained on two priors, tested on Jeffreys' prior, with an increasing number of sampled parameters during training.

Figure 3:

Comparing the performance (MSE) of the models trained on two priors, tested on Jeffreys' prior, with an increasing number of sampled parameters during training.

## 5  Discussion

Classical machine learning theory investigates the learnability of relationships from i.i.d. samples drawn from a fixed but unknown probability distribution, as alluded to in section 1. For the non-i.i.d. case, extensions of statistical learning theory, type guarantees have been developed (cf. Kuznetsov & Mohri, 2015; McDonald, Shalizi, & Schervish, 2017, as well as references therein). Generalization is always understood to refer to the same distribution generating the training and test set.

If multiple distributions are to be learned, it is natural to require the model to do equally well on all of them. This requirement can be directly translated into the language of universal coding theory. The number of independent realizations of stochastic processes $p$ drawn independently according to some prior $w$ on the compact parameter space, as well as the length $n$ of each stochastic process realization, are, as is intuitively clear, crucial for any required theory of generalization in the parametric family context. In classical statistical learning theory, $n$, as well as the complexity of the hypothesis space, is the main focus of investigation. For finite $n$, only finitely many stochastic processes are distinguishable. Asymptotically in $n$, for the stochastic processes considered in this letter, the capacity-inducing prior will be given by Jeffreys' prior. Since the maximum number of distinguishable models is close to $eCn$, $p$ will have to be at least equal to $eCn$. In fact, since $Cn$ is in general growing with increasing $n$, the minimum number of required stochastic process realizations $p$ will depend on $n$. The dependence of $p$ on $n$ therefore implicitly reflects the fact that the number of distinguishable distributions in a parametric family grows with increasing $n$. Since the capacity-inducing prior $w*$ is the prior under which the maximum number of distributions in the parametric family are distinguishable, it follows that $p$ adapted to this prior is sufficient for any other prior. Finding a $p$ adapted to $w*$ is therefore a necessary requirement if one attempts to learn the entire parametric family. The empirical counterpart of this statement for the case of MSE loss is found in Figure 3 as well as Table 1. Training on stochastic process realizations drawn from Jeffreys' prior ensures that testing on a different prior (here the uniform prior was chosen) does not lead to an increased MSE loss. Training on the uniform prior and testing on Jeffreys' prior, however, leads to a marked increase in MSE loss.

The capacity used in the lower bound equation 2.1, as well as in the lower bound equation 3.13, is the capacity of the parametric family, not the capacity of the hypothesis space of the machine learner. Notions of capacity for the machine learner reflect the richness of the class of functions that such a learner can approximate. The capacity $Cn$ measures the richness of the parametric family.

Assume that it was known only that a set of observations could be modeled by a parametric family with $m$ free parameters, while the specific form of the parametric family was not known. In such a case, it would not be possible to obtain $p$ such that, uniformly for all possible parametric families with $m$ free parameters, $p$ would be sufficient to guarantee that any parametric family could be fully learned (in the sense that the solution found should be close to a mixture source induced by the capacity-achieving prior). If the form of the parametric family was not known, it seems reasonable to use stochastic process realizations drawn uniformly from the space of parameters. If the capacity-inducing prior, however, was very different from the uniform prior, then most of the obtained realizations from the uniform prior would not facilitate learning the parametric family fully. The ill-adapted sampling mechanism would prohibit an optimal learning of the parametric family. The testing error in Figure 3, with testing performed by drawing stochastic process realizations from Jeffreys' prior and training carried out by using either Jeffreys' or the Uniform prior, converges to the same error for increasing $p$. This behavior is expected in view of the fact that the two priors are positive everywhere within the parameter space, as can be seen in Figure 1. A more subtle analysis of this fact can be carried out by noting that the number of distinguishable distributions under both priors is not too different from one another as discussed in section 3.3 for the parametric family considered in this letter.

Equation 3.13 provides a lower bound on the sequential prediction error for the MSE loss, assuming that the form of the parametric family was known. The empirical results obtained in section 4, do not require knowledge of the specific form. By the explicit construction detailed in section 4.1, we show that a solution close to an optimal solution lies in the hypothesis space of the chosen network architecture. It is hence guaranteed that the chosen deep network is in principle well specified. The results shown in Table 1 indicate that the empirical solution found by the network does not reach the lower bounds, here denoted by ”Optimal”, implying that an inefficiency exists in the optimization procedure. A thorough analysis is outside the scope of this letter, however, as it would require an investigation of the loss landscape of the chosen deep network with stochastic process realizations drawn according to some prior $w$ as input, as well as of the optimization algorithm used.

Empirically, it was observed in the experiments that if one first trains the deep network with observations drawn from some prior $w1$ until convergence and thereafter changes the prior to some $w2$ and continues training, the previously found solution changes. This behavior is expected in view of the previous discussion, as a changed prior induces a different optimal solution. It follows that there is a close link between optimal solutions and the sampling of parameter space.

Most of the previous statements hold for more general families of distributions and not only for parametric families. Equation 2.1, as well as the statements on the capacity-achieving prior, hold in particular in more general contexts (Merhav & Feder, 1995). The simple form of the capacity, equation 2.2, as well as the fact that Jeffreys' prior is asymptotically capacity inducing are, however, not correct in a more general context. To achieve optimality, however, the sampling mechanism should still be matched to $w*$.

## Appendix

We derive some results needed for the explicit construction of the RNN used to implement the aymptotically optimal solution for the sequential prediction of the sampled strictly stationary Ornstein-Uhlenbeck process. Let us study the time integral of the strictly stationary Ornstein-Uhlenbeck process:
$Yt=∫0tXsds.$
(A.1)
$Yt$ is a gaussian process, implying that it is fully characterized by its mean and covariance function. For the mean as a function of $t$, one obtains
$EYt=E∫0tμ+σ2θe-θsWe2θsds=∫0tEμ+σ2θe-θsWe2θsds=μt,$
(A.2)
with the exchange of integration and expectation order justified by Fubini's theorem, while the covariance function is given by
$CovYt,Ys=EYtYs-μ2ts=E∫0s∫0tXaXbdadb-μ2ts=σ22θ3e-θs+e-θt-e-θt-s+2θmins,t-1.$
(A.3)
We next analyze the time integral of the squared strictly stationary Ornstein-Uhlenbeck process:
$Zt=∫0tXs2ds.$
(A.4)
The expectation of $Zt$ is given by
$EZt=E∫0tμ2+2μσθe-θsWe2θs+σ22θe-2θsWe2θs2ds=μ2+σ22θt,$
(A.5)
while the covariance function is
$CovZt,Zs=E∫0s∫0tXa2Xb2dadb-μ2+σ22θ2ts=σ48θ4e-2θs+e-2θt-e-2θt-s+4θmint,s-1+2μ2σ2θ3e-θs+e-θt-e-θt-s+2θmint,s-1.$
(A.6)
$Zt$ is not a gaussian process. We study sums of the form $∑i=1nX(i-1)δ$ with a sampling interval $δ$ and $Xt$ a strictly stationary Ornstein-Uhlenbeck process. $X0,Xδ,…,Xn-1δ⊤$ is distributed according to a multivariate normal distribution with mean vector $μ,μ,…,μ⊤$ and covariance matrix:
$Σ=σ22θ1e-θδ⋯e-θ(n-1)δe-θδ1⋯e-θ(n-2)δ⋮⋮⋱⋮e-θ(n-1)δe-θ(n-2)δ⋯1.$
(A.7)
Hence it follows that the sum $∑i=1nX(i-1)δ$ is distributed according to a gaussian distribution with mean $nμ$ and variance:
$var∑i=1nX(i-1)δ=σ22θ2e-θn-1δ-2eθδ+ne2θδ-1eθδ-12.$
(A.8)
Next, sums of the form $∑i=1nX(i-1)δ2$ are studied. We find $E∑i=1nX(i-1)δ2=μ2+σ22θ$ and
$var∑i=1nX(i-1)δ2=σ22θ8θμ2e-θn-1δ-eδθ+neδθ-ne2δθ-12nσ2+2σ2e-2n-1δθ-e2δθ+ne2δθ-ne2δθ-12+8θμ2e-θn-1δ-eδθ+neδθ-ne2δθ-12.$
(A.9)
Given that $∑i=1nX(i-1)δ$ is a gaussian random variable, $∑i=1nX(i-1)δ2$ will be a noncentral $χ2$ distribution. $∑i=1nX(i-1)δ2σ22θ2e-θn-1δ-2eθδ+ne2θδ-1eθδ-12$ is hence distributed as
$χ21,n2μ2σ22θ2e-θn-1δ-2eθδ+ne2θδ-1eθδ-122.$
(A.10)

## Acknowledgments

This work was partially supported by the European Union's Horizon 2020 research and innovation program under grant agreement 644732.

## References

Balasubramanian
,
V.
(
1996
).
A geometric formulation of Occam's razor for inference of parametric distributions
.
arXiv:9601001
.
Barron
,
A. R.
(
1985
).
The strong ergodic theorem for densities: Generalized Shannon-McMillan-Breiman theorem
.
Annals of Probability
,
13
(
4
),
1292
1303
.
Caruana
,
R.
(
1997
).
.
Machine Learning
,
28
(
1
),
41
75
.
Cover
,
T. M.
, &
Thomas
,
J. A.
(
2012
).
Elements of information theory
.
Hoboken, NJ
:
Wiley
.
Fuller
,
W. A.
, &
Hasza
,
D. P.
(
1980
).
Predictors for the first-order autoregressive process
.
Journal of Econometrics
,
13
(
2
),
139
157
.
Fuller
,
W. A.
, &
Hasza
,
D. P.
(
1981
).
Properties of predictors for autoregressive time series.
Journal of the American Statistical Association
,
76
(
373
),
155
161
.
Hochreiter
,
S.
, &
Schmidhuber
,
J.
(
1997
).
Long short-term memory
.
Neural Computation
,
9
(
8
),
1735
1780
.
Jeffreys
,
H.
(
1998
).
The theory of probability
.
New York
:
Oxford University Press
.
Jones
,
L. K.
(
1990
).
Constructive approximations for neural networks by sigmoidal functions
.
Proceedings of the IEEE
,
78
(
10
),
1586
1589
.
Kuznetsov
,
V.
, &
Mohri
,
M.
(
2015
). Learning theory and algorithms for forecasting non-stationary time series. In
C.
Cortes
,
N. D.
Lawrence
,
D. D.
Lee
,
M.
Sugiyama
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
28
(pp.
541
549
).
Red Hook, NY
:
Curran
.
McDonald
,
D. J.
,
Shalizi
,
C. R.
, &
Schervish
,
M.
(
2017
).
Nonparametric risk bounds for time-series forecasting
.
Journal of Machine Learning Research
,
18
(
32
),
1
40
.
Merhav
,
N.
, &
Feder
,
M.
(
1995
).
A strong version of the redundancy-capacity theorem of universal coding
.
IEEE Transactions on Information Theory
,
41
(
3
),
714
722
.
Merhav
,
N.
, &
Feder
,
M.
(
1998
).
Universal prediction
.
IEEE Transactions on Information Theory
,
44
(
6
),
2124
2147
.
Recht
,
B.
,
Roelofs
,
R.
,
Schmidt
,
L.
, &
Shankar
,
V.
(
2018
).
Do CIFAR-10 classifiers generalize to CIFAR-10?
arXiv:1806.00451
.
Rissanen
,
J.
(
1984
).
Universal coding, information, prediction, and estimation
.
IEEE Transactions on Information Theory
,
30
(
4
),
629
636
.
Rissanen
,
J.
(
1996
).
Fisher information and stochastic complexity
.
IEEE Transactions on Information Theory
,
42
(
1
),
40
47
.
Rissanen
,
J.
(
2007
).
Information and complexity in statistical modeling
.
New York
:
.
Vajda
,
I.
(
1989
).
Theory of statistical inference and information
.
Amsterdam
:
.
Vapnik
,
V.
(
1998
).
Statistical learning theory.
Hoboken, NJ
:
Wiley
.
Vapnik
,
V.
(
2013
).
The nature of statistical learning theory
.
New York
: