## Abstract

Surprise-based learning allows agents to rapidly adapt to nonstationary stochastic environments characterized by sudden changes. We show that exact Bayesian inference in a hierarchical model gives rise to a surprise-modulated trade-off between forgetting old observations and integrating them with the new ones. The modulation depends on a probability ratio, which we call the Bayes Factor Surprise, that tests the prior belief against the current belief. We demonstrate that in several existing approximate algorithms, the Bayes Factor Surprise modulates the rate of adaptation to new observations. We derive three novel surprise-based algorithms, one in the family of particle filters, one in the family of variational learning, and one in the family of message passing, that have constant scaling in observation sequence length and particularly simple update dynamics for any distribution in the exponential family. Empirical results show that these surprise-based algorithms estimate parameters better than alternative approximate approaches and reach levels of performance comparable to computationally more expensive algorithms. The Bayes Factor Surprise is related to but different from the Shannon Surprise. In two hypothetical experiments, we make testable predictions for physiological indicators that dissociate the Bayes Factor Surprise from the Shannon Surprise. The theoretical insight of casting various approaches as surprise-based learning, as well as the proposed online algorithms, may be applied to the analysis of animal and human behavior and to reinforcement learning in nonstationary environments.

## 1 Introduction

Animals, humans, and similarly reinforcement learning agents may safely assume that the world is stochastic and stationary during some intervals of time interrupted by change points. The position of leaves on a tree, a stock market index, or the time it takes to travel from A to B in a crowded city is often well captured by stationary stochastic processes for extended periods of time. Then sudden changes may happen, such that the distribution of leaf positions becomes different due to a storm, the stock market index is affected by the enforcement of a new law, or a blocked road causes additional traffic jams. The violation of an agent's expectation caused by such sudden changes is perceived by the agent as surprise, which can be seen as a measure of how much the agent's current belief differs from reality.

Surprise, with its physiological manifestations in pupil dilation (Nassar et al., 2012; Preuschoff, t Hart, & Einhauser, 2011) and EEG signals (Mars et al., 2008; Modirshanechi, Kiani, & Aghajan, 2019; Ostwald et al., 2012), is believed to modulate learning, potentially through the release of specific neurotransmitters (Gerstner, Lehmann, Liakoni, Corneil, & Brea, 2018; Yu & Dayan, 2005) so as to allow animals and humans to adapt quickly to sudden changes. The quick adaptation to novel situations has been demonstrated in a variety of learning experiments (Behrens, Woolrich, Walton, & Rushworth, 2007; Glaze, Kable, & Gold, 2015; Heilbron & Meyniel, 2019; Nassar et al., 2012; Nassar, Wilson, Heasly, & Gold, 2010; Yu & Dayan, 2005). The bulk of computational work on surprise-based learning can be separated into two groups. Studies in the field of computational neuroscience have focused on biological plausibility with little emphasis on the accuracy of learning (Behrens et al., 2007; Bogacz, 2017; Faraji, Preuschoff, & Gerstner, 2018; Friston, 2010; Friston, FitzGerald, Rigoli, Schwartenbeck, & Pezzulo, 2017; Nassar et al., 2012; Nassar, Wilson, Heasly, & Gold, 2010; Ryali, Reddy, & Yu, 2018; Schwartenbeck, FitzGerald, Dolan, & Friston, 2013; Yu & Dayan, 2005), whereas exact and approximate Bayesian online methods (Adams & MacKay, 2007; Fearnhead & Liu, 2007) for change point detection and parameter estimation have been developed without any focus on biological plausibility (Aminikhanghahi & Cook, 2017; Cummings, Krehbiel, Mei, Tuo, & Zhang, 2018; Lin, Sharpnack, Rinaldo, & Tibshirani, 2017; Masegosa et al., 2017; Wilson, Nassar, & Gold, 2010).

In this work, we take a top-down approach to surprise-based learning. We start with a generative model of change points similar to the one that has been the starting point of multiple experiments (Behrens et al., 2007; Findling, Chopin, & Koechlin, 2019; Glaze et al., 2015; Heilbron & Meyniel, 2019; Nassar et al., 2012, 2010; Yu & Dayan, 2005). We demonstrate that Bayesian inference on such a generative model can be interpreted as modulation of learning by surprise; we show that this modulation leads to a natural definition of surprise that is different from but closely related to the Shannon Surprise (Shannon, 1948). Moreover, we derive three novel approximate online algorithms with update rules that inherit the surprise-modulated adaptation rate of exact Bayesian inference. The overall goal of this article is to give a Bayesian interpretation for surprise-based learning in the brain and to find approximate methods that are computationally efficient and biologically plausible while maintaining a high level of learning accuracy. As a by-product, our approach provides theoretical insights on commonalities and differences among existing surprise-based and approximate Bayesian approaches. Importantly, our approach makes specific experimental predictions.

In section 2, we introduce the generative model and we present our surprise-based interpretation of Bayesian inference and our three approximate algorithms. Next, we use simulations to compare our algorithms with existing ones on two different tasks inspired by and closely related to real experiments (Behrens et al., 2007; Mars et al., 2008; Nassar et al., 2012, 2010; Ostwald et al., 2012). At the end of section 2, we formalize two experimentally testable predictions of our theory and illustrate them with simulations. A brief review of related studies as well as a few directions for further work are supplied in section 3.

## 2 Results

Given a sequence of observations $y1:t$, the agent's belief $\pi (t)(\theta )$ about the parameter $\Theta t$ at time $t$ is defined as the posterior probability distribution $P(\Theta t=\theta |y1:t)$. In the online learning setting studied here, the agent's goal is to update the belief $\pi (t)(\theta )$ to the new belief $\pi (t+1)(\theta )$, or an approximation thereof, upon observing $yt+1$.

A simplified real-world example of such an environment is illustrated in Figure 1B. Imagine that every day, a friend meets you at the coffee shop, starting after work from her office (see Figure 1B, left). To do so, she needs to cross a river via a bridge. The time of arrival of your friend ($yt$) exhibits some variability due to various sources of stochasticity (e.g., traffic and your friend's daily workload), but it has a stable average over time ($\theta t$). However, if a new bridge is opened, your friend arrives earlier, since she no longer has to take a detour (see Figure 1B, right). The moment of opening the new bridge is indicated by $ct+1=1$ in our framework and the sudden change in the average arrival time of your friend by a sudden change from $\theta t$ to $\theta t+1$. Even without any explicit discussion with your friend about this situation and only by observing her actual arrival time, you can notice the abrupt change and hence adapt your schedule to the new situation.

### 2.1 Online Bayesian Inference Modulated by Surprise

Note that if $\pi $ is the exact Bayesian belief defined as above in equation 2.5, then $P(y;\pi (t'))=P(Yt'+1=y|y1:t',ct'+1=0)$. In section 2.2 we will also use $P(y;\pi ^(t'))$ for an arbitrary $\pi ^(t')$. Two particularly interesting cases of equation 2.6 are $P(yt+1;\pi (t))$, the probability of a new observation $yt+1$ with the current belief $\pi (t)$, and $P(yt+1;\pi (0))$, the probability of a new observation $yt+1$ with the prior belief $\pi (0)$.

This definition of surprise measures how much more probable the current observation is under the naive prior $\pi (0)$ relative to the current belief $\pi (t)$ (see section 3 for further interpretation). This probability ratio is the Bayes factor (Efron & Hastie, 2016; Kass & Raftery, 1995) that tests the prior belief $\pi (0)$ against the current belief $\pi (t)$. We emphasize that our definition of surprise is not arbitrary, but essential in order to write the exact inference in equation 2.5 on the generative model in the compact recursive form indicated in the proposition that follows. Moreover, as we show later, this term can be identified in multiple learning algorithms (among them, Nassar et al., 2010, 2012), but it has never been interpreted as a surprise measure. In the following sections we establish the generality of this computational mechanism and identify it as a common feature of many learning algorithms.

$\pi B(t+1)(\theta )$ describes the incorporation of the new information into the current belief via Bayesian updating.

Using the above definitions and equation 2.5, we have for the generative model of Figure 1A and equations 2.1 to 2.4 the following proposition:

As a conclusion, our first main result is that a split as in equation 2.12 with a weighting factor (“adaptation rate” $\gamma $) as in equation 2.9 is exact and always possible for the class of environments defined by our hierarchical generative model. This surprise modulation gives rise to specific testable experimental predictions discussed later.

### 2.2 Approximate Algorithms Modulated by Surprise

Despite the simplicity of the recursive formula in equation 2.10, the updated belief $\pi (t+1)$ is generally not in the same family of distributions as the previous belief $\pi (t)$, for example, the result of averaging two normal distributions is not a normal distribution. Hence it is in general impossible to find a simple and exact update rule for, say, some sufficient statistic. As a consequence, the memory demands for $\pi (t+1)$ scale linearly in time, and updating $\pi (t+1)$ using $\pi (t)$ needs $O(t)$ operations. In the following sections, we investigate three approximations, algorithms 1 to 3, that have simple update rules and finite memory demands, so that the updated belief remains tractable over a long sequence of observations.

As our second main result, we show that all our three approximate algorithms inherit the surprise-modulated adaptation rate from the exact Bayesian approach, equations 2.9 and 2.12. The first algorithm adapts an earlier algorithm of surprise minimization learning (SMiLe, Faraji et al., 2018) to variational learning. We refer to our novel algorithm as Variational SMiLe and abbreviate it as VarSMiLe (see algorithm 1). The second algorithm is based on message passing (Adams & MacKay, 2007) restricted to a finite number of messages $N$. We refer to this algorithm as MP$N$ (see algorithm 2). The third algorithm uses the ideas of particle filtering (Gordon, Salmond, & Smith, 1993) for an efficient approximation for our hierarchical generative model. We refer to our approximate algorithm as Particle Filtering with $N$ particles and abbreviate it by pf$N$ (see algorithm 3). All algorithms are computationally efficient, have finite memory demands, and are biologically plausible; Particle Filtering has possible neuronal implementations (Huang & Rao, 2014; Kutschireiter, Surace, Sprekeler, & Pfister, 2017; Legenstein & Maass, 2014; Shi & Griffiths, 2009), MP$N$ can be seen as a greedy version of pf$N$ without sampling, and Variational SMiLe may be implemented by neo-Hebbian (Gerstner et al., 2018; Lisman, Grace, & Duzel, 2011) update rules. Simulation results show that the performance of the three approximate algorithms is comparable to and more robust across environments than other state-of-the-art approximations.

#### 2.2.1 Variational SMiLe Rule: Algorithm 1

where $\gamma t+1=\gamma SBF(yt+1;\pi ^(t)),m$ is given by equation 2.9 with a free parameter $m>0$, which can be tuned to each environment. By doing so, we still have the explicit trade-off between two terms as in equation 2.10, but in the logarithms, yet an advantageous consequence of averaging over logarithms is that if the likelihood function $PY$ is in the exponential family and the initial belief $\pi (0)$ is its conjugate prior, then $\pi ^(t+1)$ and $\pi (0)$ are members of the same family. In this particular case, we arrive at a simple update rule for the parameters of $\pi ^(t+1)$ (see algorithm 1 for pseudocode and section 4 for details). As it is common in variational approaches (Beal, 2003), the price of this simplicity is that except for the trivial cases of $pc=0$ and $pc=1$, there is no evidence other than simulations that the update rule of equation 2.13 will end up at an approximate belief close to the exact Bayesian belief.

where the bound $Bt+1\u22080,DKL[\pi ^B(t+1)(\theta )||P(\theta |yt+1)]$ is a decreasing function of the Bayes Factor Surprise $SBF(yt+1;\pi ^(t))$ (see section 4 for proof), and $P(\theta |yt+1)$ is given by equation 2.11.

Because of the similarity of the constraint optimization problem in equations 2.14 and 2.15 to the Surprise Minimization Learning rule, (SMiLe) (Faraji et al., 2018), we call this algorithm the Variational Surprise Minimization Learning rule, or for short, the Variational SMiLe rule. The differences between SMiLe and variational SMiLe are discussed in section 4.

Our variational method, and particularly its surprise-modulated adaptation rate, is complementary to earlier studies (Masegosa et al., 2017; Özkan, Šmídl, Saha, Lundquist, & Gustafsson, 2013) in machine learning that assumed different generative models and used additional assumptions and different approaches for deriving the learning rule.

#### 2.2.2 Message-Passing $N$: Algorithm 2

For a hierarchical generative model similar to ours, a message-passing algorithm has been used to perform exact Bayesian inference (Adams & MacKay, 2007), where the algorithm's memory demands and computational complexity scale linearly in time $t$. In this section, we first explain the idea of the message-passing algorithm of Adams and MacKay (2007) and its relation to our proposition. We then present our approximate version of this algorithm, which has a constant (in time) computational complexity and memory demands.

The computational complexity and memory requirements of the complete message-passing algorithm increase linearly with time $t$. To deal with this issue and to have a constant computation and memory demands over time, we implemented a message-passing algorithm of the form of equations 2.17 to 2.19, but with a fixed number $N$ of particles, chosen as those with the highest weights $wt(k)$. Therefore, our second algorithm adds a new approximation step to the full message-passing algorithm of Adams and MacKay (2007): Whenever $t>N$, after adding the new particle with the weight as in equation 2.18 and updating the previous weights as in equation 2.19, we discard the particle with the smallest weight (i.e., set its weight equal to 0) and renormalize the weights. By doing so, we always keep the number of particles with nonzero weights equal to $N$. Note that for $t\u2264N$, our algorithm is exact and identical to the message-passing algorithm of Adams and MacKay (2007). We call our modification of the message-passing algorithm of Adams and MacKay (2007) “Message Passing $N$ (MP$N$).”

To deal with the computational complexity and memory requirements, one may alternatively keep only the particles with weights greater than a cut-off threshold (Adams & MacKay, 2007). However, such a constant cut-off leads to a varying number (smaller or equal to $t$) of particles in time. Our approximation MP$N$ can therefore be seen as a variation of the thresholding algorithm in Adams and MacKay (2007) with fixed number of particles $N$, and hence a variable cut-off threshold. The work of Fearnhead and Liu (2007) follows the same principle as Adams and MacKay (2007), but employs stratified resampling to eliminate particles with negligible weights in order to reduce the total number of particles. Their resampling algorithm involves solving a complicated nonlinear equation at each time step, which makes it unsuitable for a biological implementation. In addition, we experienced that in some cases, the small errors introduced in the resampling step of the algorithm of Fearnhead and Liu (2007) accumulated and led to worse performance than our MP$N$ algorithm, which simply keeps the $N$ particles with the highest weight at each time step.

For the case where the likelihood function $PY(y|\theta )$ is in the exponential family and $\pi (0)$ is its conjugate prior, the resulting algorithm of MP$N$ has a simple update rule for the belief parameters (see algorithm 2 and section 4 for details). For comparison, we also implemented in our simulations the full message-passing algorithm of Adams and MacKay (2007) with an almost zero cut-off (machine precision), which we consider as our benchmark “Exact Bayes,” as well as the stratified optimal resampling algorithm of Fearnhead and Liu, which we call ``SOR*N*.''

#### 2.2.3 Particle Filtering: Algorithm 3

Interestingly, the above formulas entail the same surprise modulation and the same trade-off as proposed by the proposition's equation 2.10. For the weight update, there is a trade-off between an exact Bayesian update and keeping the value of the previous time step, controlled by a adaptation rate modulated exactly in the same way as in equation 2.10. Note that in contrast to equation 2.10, the trade-off for the particles' weights is not between forgetting and integrating, but between maintaining the previous knowledge and integrating. However, the change probability (see equation 2.23) for sampling is equal to the adaptation rate and is an increasing function of surprise. As a result, although the weights are updated less for surprising events, a higher surprise causes a higher probability for change, indicated by $ct+1(i)=1$, which implies forgetting, because for a particle $i$ with $ct+1(i)=1$, the associated belief $\pi ^i(t+1)=P(\Theta t+1=\theta |ct+1(i)=1,c1:t(i),y1:t+1)$ is equal to $P(\Theta t+1=\theta |ct+1(i)=1,yt+1)=P(\theta |yt+1)$ (see Figure 1A), which is equivalent to a reset of the belief as in equation 2.12. In other words, while in MP$N$ and the exact Bayesian inference in the proposition's equation 2.10, the trade-off between integration and reset is accomplished by adding at each time step a new particle with weight $\gamma t+1$, in Particle Filtering, it is accomplished via sampling. As a conclusion, the above formulas are essentially the same as the update rules of MP$N$ (see equations 2.19 and 2.18) and have the same spirit as the recursive update of the proposition's equation 2.10.

Equations 2.21 and 2.23 can be applied to the case where the likelihood function $PY(y|\theta )$ is in the exponential family and $\pi (0)$ is its conjugate prior. The resulting algorithm (algorithm 3) has a particularly simple update rule for the belief parameters (see section 4 for details).

The theory of particle filter methods is well established (Doucet et al., 2000; Gordon et al., 1993; Särkkä, 2013). Particle filters in simpler (Brown & Steyvers, 2009) or more complex (Findling et al., 2019) forms have also been employed to explain human behavior. Here we derived a simple particle filter for the general case of generative models of equations 2.2 to 2.4. Our main contribution is to show that the use of the optimal proposal distribution in this particle filter leads to a surprise-based update scheme.

#### 2.2.4 Surprise-Modulation as a Framework for Other Algorithms

Other existing algorithms (Adams & MacKay, 2007; Faraji et al., 2018; Fearnhead & Liu, 2007; Nassar et al., 2012, 2010) can also be formulated in the surprise-modulation framework of equations 2.9 and 2.12 (see section 4). Moreover, in order to allow for a transparent discussion and for fair comparisons in simulations, we extended the algorithms of Nassar et al. (2012, 2010) to a more general setting. Here we give a brief summary of the algorithms we considered. A detailed analysis is provided in section 4.5.

The algorithms of Nassar et al. (2012, 2010) were originally designed for a gaussian estimation task (see section 2.3 for details of the task) with a broad uniform prior. We extended them to the more general case of gaussian tasks with gaussian priors, and we call our extended versions Nas10$*$ and Nas12$*$ for Nassar et al. (2010) and Nassar et al. (2012), respectively (for a performance comparison between our extended algorithms and their original versions see Figures 12 and 13). Both algorithms have the same surprise-modulation as in our proposition (see equation 2.10). There are multiple interpretations of the approaches of Nas10$*$ and Nas12$*$ and links to other algorithms. One such link we identify is in relation to Particle Filtering with a single particle (pf1). More specifically, one can show that pf1 behaves in expectation similar to Nas10$*$ and Nas12$*$ (see section 4 and the appendix).

### 2.3 Simulations

With the goal of gaining a better understanding of different approximate algorithms, we evaluated the departure of their performance from the exact Bayesian algorithm in terms of mean squared error (MSE) of estimating $\Theta t$ (see section 4), on two tasks inspired by and closely related to real experiments (Behrens et al., 2007; Maheu, Dehaene, & Meyniel, 2019; Mars et al., 2008; Nassar et al., 2012, 2010; Ostwald et al., 2012): a gaussian and a categorical estimation task.

We compared our three novel algorithms VarSMiLe, Particle Filtering (pf$N$, where $N$ is the number of particles), and Message Passing with finite number of particles $N$ (MP$N$) to the online exact Bayesian Message Passing algorithm (Adams & MacKay, 2007) (Exact Bayes), which yields the optimal solution with $\Theta ^t=\Theta ^tOpt$. Furthermore, we included in the comparison the stratified optimal resampling algorithm (Fearnhead & Liu, 2007) (SOR$N$, where $N$ is the number of particles), our variant of Nassar et al. (2010) (Nas10$*$) and Nassar et al. (2012) (Nas12$*$), the Surprise-Minimization Learning algorithm of Faraji et al. (2018) (SMiLe), as well as a simple Leaky Integrator (Leaky; see section 4).

The algorithms Exact Bayes and SOR*N* come from the field of change point detection, and whereas the former has high memory demands, the latter has the same memory demands as our algorithms pf*N* and MP*N*. The algorithms Nas10$*$, Nas12$*$, and SMiLe, on the other hand, come from the human learning literature and are more biologically oriented.

#### 2.3.1 Gaussian Estimation Task

The task is a generalized version of the experiment of Nassar et al. (2012, 2010). The goal of the agent is to estimate the mean $\theta t=\mu t$ of observed samples, which are drawn from a gaussian distribution with known variance $\sigma 2$: $yt+1|\mu t+1\u223cN(\mu t+1,\sigma 2)$. The mean $\mu t+1$ is itself drawn from a gaussian distribution $\mu t+1\u223cN(0,1)$ whenever the environment changes. In other words, the task is a special case of the generative model of equations 2.2 to 2.4, with $\pi (0)(\mu t)=N(\mu t;0,1)$ and $PY(yt|\mu t)=N(yt;\mu t,\sigma 2)$. An example of the task is shown in Figure 2A.

We simulated the task for all combinations of $\sigma \u2208{0.1,0.5,1,2,5}$ and $pc\u2208{0.1,0.05,0.01,0.005,0.001,0.0001}$. For each combination of $\sigma $ and $pc$, we first tuned the free parameter of each algorithm—$m$ for SMiLe and Variational SMiLe, the leak parameter for the Leaky Integrator, and the $pc$ of Nas10$*$ and Nas12$*$—by minimizing the MSE on three random initializations of the task. For the Particle Filter (pf$N$), the Exact Bayes, the MP$N$, and the SOR$N$, we empirically checked that the true $pc$ of the environment was indeed the value that gave the best performance, and we used this value for the simulations. We evaluated the performance of the algorithms on 10 different random task instances of $105$ steps each for $pc\u2208{0.1,0.05,0.01,0.005}$ and $106$ steps each for $pc\u2208{0.001,0.0001}$ (in order to sample more change points). Note that the parameter $\sigma $ is not estimated, and its actual value is used by all algorithms except the Leaky Integrator.

In Figure 2B we show the $MSE[\Theta ^t|Rt=n]$ in estimating the parameter after $n$ steps since the last change point, for each algorithm, computed over multiple changes, for two exemplar task settings. The Particle Filter with 20 particles (pf20), the VarSMiLe, and the Nas12$*$ have an overall performance very close to that of the Exact Bayes algorithm ($MSE[\Theta ^tOpt|Rt=n]$), with much lower memory requirements. VarSMiLe sometimes slightly outperforms the other two early after an environmental change (see Figure 2B, right), but shows slightly higher error values at later phases. The MP$N$ algorithm is the closest one to the optimal solution (i.e., Exact Bayes) for low $\sigma $ (see Figure 2B, left), but its performance is much worse for the case of high $\sigma $ and low $pc$ (see Figure 2B, right). For SOR$N$ we observe a counter intuitive behavior in the regime of low $\sigma $: the inclusion of more particles leads to worse performance (see Figure 2B, left). At higher $\sigma $ levels, the performance of SOR20 is close to optimal and better than the MP20 in later time steps. This may be due to the fact that the MP$N$ discards particles in a deterministic and greedy way (i.e., the one with the lowest weight), whereas for the SOR$N$, there is a component of randomness in the process of particle elimination, which may be important for environments with higher stochasticity.

For the Leaky Integrator, we observe a trade-off between good performance in the transient phase and the stationary phase; a fixed leak value cannot fulfill both requirements. The SMiLe rule, by construction, never narrows its belief $\pi ^(\theta )$ below some minimal value, which allows it to have a low error immediately after a change but leads later to high errors. Its performance deteriorates for higher $\sigma $ (see Figure 2B, right). The Nas10$*$ performs well for low but not for higher values of $\sigma $. Despite the fact that a Particle Filter with one particle (pf1) is in expectation similar to Nas10$*$ and Nas12$*$ (see section 4), it performs worse than these two algorithms on trial-by-trial measures. Still, it performs better than the MP1 and identically to the SOR1.

In Figure 3A, we have plotted the average of $MSE[\Theta ^tOpt]$ of the Exact Bayes algorithm over the whole simulation time for each of the considered $\sigma $ and $pc$ levels. The difference between the other algorithms and this benchmark is called $\Delta MSE[\Theta ^t]$ (see section 4) and is plotted in Figures 3C to 3F. All algorithms except for the SOR20 have lower average error values for low $\sigma $ and low $pc$ than high $\sigma $ and high $pc$. The Particle Filter pf20 and the Message Passing MP20 have the smallest difference from the optimal solution. The average error of MP20 is higher than that of pf20 for high $\sigma $ and low $pc$, whereas pf20 is more robust across levels of environmental parameters. The worst-case performance for pf20 is $\Delta MSE[\Theta ^t]=0.033$ for $\sigma =5$ and $pc=0.0001$, and for SOR20, it is $\Delta MSE[\Theta ^t]=0.061$ for $\sigma =0.1$ and $pc=0.1$. The difference between these two worst-case scenarios is significant ($p-value=2.79\xd710-6$, two-sample $t$-test, 10 random seeds for each algorithm). Next in performance are the algorithms, Nas12$*$ and VarSMiLe. VarSMiLe exhibits its largest deviation from the optimal solution for high $\sigma $ and low $pc$, but is still more resilient compared to the MP$N$ algorithms for this type of environment. Among the algorithms with only one unit of memory demands—pf1, MP1, SOR1, VarSMiLe, SMiLe, Leaky, Nas10$*$ and Nas12$*$—the winners are VarSMiLe and Nas12$*$. The SOR20 has low error overall but unexpectedly high error for environmental settings that are presumably more relevant for biological agents (intervals of low stochasticity marked by abrupt changes). The simple Leaky Integrator performs well at low $\sigma $ and $pc$ but deviates more from the optimal solution as these parameters increase (see Figure 3F). The SMiLe rule performs best at lower $\sigma $, that is, more deterministic environments.

A summary graph, where we collect the $\Delta MSE[\Theta ^t]$ across all levels of $\sigma $ and $pc$, is shown in Figure 4. We can see that pf20, Nas12$*$, and VarSMiLe give the lowest worst case (lowest maximum value) $\Delta MSE[\Theta ^t]$ and are statistically better than the other eight algorithms (the error bars indicate the standard error of the mean across the ten random task instances).

#### 2.3.2 Categorical Estimation Task

The task is inspired by the experiments of Behrens et al. (2007), Maheu et al. (2019), Mars et al. (2008), and Ostwald et al. (2012). The goal of the agent is to estimate the occurrence probability of five possible states. Each observation $yt+1\u2208{1,\u2026,5}$ is drawn from a categorical distribution with parameters $\theta t+1=pt+1$, that is, $yt+1|pt+1\u223cCat(yt+1;pt+1)$. When there is a change $Ct+1=1$ in the environment, the parameters $pt+1$ are drawn from a Dirichlet distribution $Dir(s\xb71)$, where $s\u2208(0,\u221e)$ is the stochasticity parameter. In relation to the generative model of equations 2.2 to 2.4, we thus have $\pi (0)(pt)=Dir(pt;s\xb71)$ and $PY(yt|pt)=Cat(yt;pt)$. An illustration of this task is depicted in Figure 5A.

We considered the combinations of stochasticity levels $s\u2208{0.01$, 0.1, 0.14, 0.25, 1, 2, 5$}$ and change probability levels $pc\u2208{0.1$, 0.05, 0.01, 0.005, 0.001, 0.0001$}$. The algorithms of Nassar et al. (2012, 2010) were specifically developed for a gaussian estimation task and cannot be applied here. All other algorithms were first optimized for each combination of environmental parameters before an experiment starts, and then evaluated on 10 different random task instances, for $105$ steps each for $pc\u2208{0.1$, 0.05, 0.01, 0.005$}$ and $106$ steps each for $pc\u2208{0.001$, 0.0001$}$. The parameter $s$ is not estimated, and its actual value is used by all algorithms except the Leaky Integrator.

#### 2.3.3 Summary of Simulation Results

In summary, our simulation results of the two tasks collectively suggest that our Particle Filtering (pf$N$) and Message Passing (MP$N$) algorithms achieve a high level of performance, very close to the one of biologically less plausible algorithms with higher (Exact Bayes) and same (SOR$N$) memory demands. Moreover, their behavior is more consistent across tasks. Finally, among the algorithms with memory demands of one unit, VarSMiLe performs best.

#### 2.3.4 Robustness against Suboptimal Parameter Choice

In all algorithms we considered, the environment's hyperparameters are assumed to be known. We can distinguish between two types of hyperparameters in our generative model: the parameters of the likelihood function (e.g., $\sigma $ in the gaussian task), and the $pc$ and the parameters of the conjugate prior (e.g., $s$ in the categorical task). Hyperparameters of the first type can be added to the parameter vector $\theta $ and be inferred with the same algorithm. However, learning the second type of hyperparameters is not straightforward. By assuming that these hyperparameters are learned more slowly than $\theta $, one can fine-tune them after each $n$ (e.g., 10) change points, while change points can be detected by looking at the particles for the Particle Filter and at the peaks of surprise values for VarSMiLe. Other approaches to hyperparameter estimation can be found in Doucet and Tadić (2003), George and Doss (2017), Liu and West (2001), and Wilson et al. (2010).

When the hyperparameters are fixed, a mismatch between the assumed values and the true values is a possible source of errors. In this section, we investigate the robustness of the algorithms to a mismatch between the assumed and the actual probability of change points. To do so, we first tuned each algorithm's parameter for an environment with a change probability $pc$ and then tested the algorithms in environments with different change probabilities while keeping the parameter fixed. For each new environment with a different change probability, we calculated the difference between the MSE of these fixed parameters and the optimal MSE, that is, the resulting MSE for the case that the Exact Bayes' parameter is tuned for the actual $pc$.

More precisely, if we denote as $MSE[\Theta ^t;pc',pc]$ the MSE of an algorithm with parameters tuned for an environment with $pc'$, applied in an environment with $pc$, we calculated the mean regret, defined as $MSE[\Theta ^t;pc',pc]-MSE[\Theta ^tOpt,pc]$, over time; note that the second term is equal to $MSE[\Theta ^t;pc,pc]$ when the algorithm Exact Bayes is used for estimation. The lower the values and the flatter the curve of the mean regret, the better the performance and the robustness of the algorithm in the face of lacking knowledge of the environment. The slope of the curve indicates the degree of deviations of the performance as we move away from the optimally tuned parameter. We ran three random (and same for all algorithms) tasks initializations for each $pc$ level.

In Figure 8 we plot the mean regret for each algorithm for the gaussian task for four pairs of $s$ and $pc'$ levels. For $\sigma =0.1$ and $pc'=0.04$ (see Figure 8A), the Exact Bayes and the MP20 show the highest robustness (smallest regret) and are closely followed by the pf20, VarSMiLe, and Nas12$*$ (note the regret's small range of values). The lower the actual $pc$, the higher the regret, but the changes are still very small. The curves for the SMiLe and the Leaky Integrator are also relatively flat, but the mean regret is much higher. The SOR20 is the least robust algorithm.

Similar observations can be made for $\sigma =0.1$ and $pc'=0.004$ (see Figure 8B). In this case, the performance of all algorithms deteriorates strongly when the actual $pc$ is higher than the assumed one.

However, for $\sigma =5$ (see Figures 8C and 8D), the ranking of algorithms changes. The SOR20 is very robust for this level of stochasticity. The pf20 and MP20 perform similarly for $pc=0.04$, but for lower $pc'$, the pf20 is more robust and the MP20 exhibits high fluctuations in its performance. The Nas12$*$ is quite robust at this $\sigma $ level. Overall for Exact Bayes, SOR20, pf20, VarSMiLe, and Nas12$*$, a mismatch of the assumed $pc$ from the actual one does not deteriorate the performance dramatically for $\sigma =5$, $pc'=0.004$ (see Figure 8D). The SMiLe and the Leaky Integrator outperform the other algorithms for higher $pc'$ if $pc<pc'$ (see Figure 8C). A potential reason is that the optimal behavior for the Leaky Integrator (according to the tuned parameters) is to constantly integrate new observations into its belief (i.e., to act like a Perfect Integrator) regardless of the $pc'$ level. This feature makes it blind to the $pc$ and therefore very robust against the lack of knowledge of it (see Figure 8C).

### 2.4 Experimental Prediction

It has been experimentally shown that some important behavioral and physiological indicators statistically correlate with a measure of surprise or a prediction error. Examples of such indicators are the pupil diameter (Joshi & Gold, 2020; Nassar et al., 2012; Preuschoff et al., 2011), the amplitude of the P300, N400, and MMN components of EEG (Kopp & Lange, 2013; Lieder, Daunizeau, Garrido, Friston, & Stephan, 2013; Mars et al., 2008; Meyniel, Maheu, & Dehaene, 2016; Modirshanechi et al., 2019; Musiolek, Blankenburg, Ostwald, & Rabovsky, 2019; Ostwald et al., 2012), the amplitude of MEG in specific time windows (Maheu et al., 2019), BOLD responses in fMRI (Konovalov & Krajbich, 2018; Loued-Khenissi, Pfeuffer, Einhäuser, & Preuschoff, 2020), and reaction time (Huettel, Mack, & McCarthy, 2002; Meyniel et al., 2016). The surprise measure is usually the negative log probability of the observation, known as Shannon Surprise (Shannon, 1948), and denoted here as $SSh$. However, as we show in this section, as long as there is an uninformative prior over observations, Shannon Surprise $SSh$ is just an invertible function of our modulated adaptation rate $\gamma $ and hence an invertible function of the Bayes Factor Surprise $SBF$. Thus, based on the results of previous works (Meyniel et al., 2016; Modirshanechi et al., 2019; Nassar et al., 2012, 2010; Ostwald et al., 2012), which always used uninformative priors, one cannot determine whether the physiological and behavioral indicators we have noted correlate with $SSh$ or $SBF$.

In this section, we first investigate the theoretical differences between the Bayes Factor Surprise $SBF$ and Shannon Surprise $SSh$. Then, based on their observed differences, we formulate two experimentally testable predictions, with a detailed experimental protocol. Our predictions make it possible to discriminate between the two measures of surprise and determine whether physiological or behavioral measurements are signatures of $SBF$ or of $SSh$.

#### 2.4.1 Theoretical Difference between $SBF$ and $SSh$

*difference*between the Shannon Surprise of this observation under the current and the prior beliefs. Note that if the prior is uninformative then the second term in equation 2.25 is a constant with respect to observations, and $\gamma t+1$ is an invertible function of $SSh$, thus indistinguishable from $SSh$. We next exploit differences between $SBF$ and $SSh$ to formulate our experimentally testable predictions.

#### 2.4.2 Experimental Protocol

Consider the variant of the gaussian task of Nassar et al. (2012, 2010), which we used in our simulations: $PY(y|\theta )=N(y;\theta ,\sigma 2)$ and $\pi (0)(\theta )=N(\theta ;0,1)$. Human subjects are asked to predict the next observation $yt+1$ given what they have observed so far, that is, $y1:t$. The experimental procedure is as follows:

Fix the hyperparameters $\sigma 2$ and $pc$.

At each time $t$, show the observation $yt$ (produced in the aforementioned way) to the subject, and measure a physiological or behavioral indicator $Mt$, for example, pupil diameter (Nassar et al., 2012, 2010).

At each time $t$, after observing $yt$, ask the subject to predict the next observation $y^t+1$ and his or her confidence $Ct$ about that prediction.

#### 2.4.3 Prediction 1

Given an absolute prediction value $y^>0$, an absolute prediction error $\delta >0$, a confidence value $C>0$, and a sign bias $s\u2208{-1,1}$, we can compute the average of $Mt$ over time for the time points with $|y^t|\u2248y^$, $|\delta t|\u2248\delta $, $Ct\u2248C$, and $st=s$, which we denote as $M\xaf1(y^,\delta ,s,C)$; the index 1 stands for experimental prediction 1. The approximation notation $\u2248$ is used for continuous variables instead of equality due to practical limitations (i.e., for obtaining adequate number of samples for averaging). Note that for our theoretical proofs, we use equality, but in our simulation, we include the practical limitations of a real experiment, and hence, use an approximation. The formal definitions can be found in section 4. It is worth noting that the quantity $M\xaf1(y^,\delta ,s,C)$ is model independent; its calculation does not require any assumption on the learning algorithm the subject may employ. Depending on whether the measurement $M\xaf1(y^,\delta ,s,C)$ reflects $SSh$ or $SBF$, its relationship to the defined four variables ($y^$, $\delta $, $s$, $C$) is qualitatively and quantitatively different.

In order to prove and illustrate our prediction, we consider each subject as an agent enabled with one of the learning algorithms that we discussed. Similar to the above, given an absolute prediction $\theta ^>0$ (corresponding to the subjects' absolute prediction $y^$), an absolute prediction error $\delta >0$, a standard deviation $\sigma C$ (corresponding to the subjects' confidence value $C$), and a sign bias $s\u2208{-1,1}$, we can compute the average Shannon Surprise $SSh(yt;\pi ^(t-1))$ and the average Bayes Factor Surprise $SBF(yt;\pi ^(t-1))$ over time, for the time points with $|\theta ^t-1|\u2248\theta ^$, $|\delta t|\u2248\delta $, $\sigma ^t\u2248\sigma C$, and $st=s$, which we denote as $S\xafSh(\theta ^,\delta ,s,\sigma C)$ and $S\xafBF(\theta ^,\delta ,s,\sigma C)$ respectively. We can show theoretically (see section 4) and in simulations (see Figure 10B and section 4) that for any value of $\theta ^$, $\delta $, and $\sigma C$, we have $S\xafSh(\theta ^,\delta ,s=+1,\sigma C)>S\xafSh(\theta ^,\delta ,s=-1,\sigma C)$ for the Shannon Surprise, and exactly the opposite relation, $S\xafBF(\theta ^,\delta ,s=+1,\sigma C)<S\xafBF(\theta ^,\delta ,s=-1,\sigma C)$, for the Bayes Factor Surprise. Moreover, this effect increases with increasing $\delta $.

It should be noted that such an effect is due to the essential difference of $SSh$ and $SBF$ in using the prior belief $\pi (0)(\theta )$. Our experimental prediction is theoretically provable for the cases that each subject's belief $\pi ^(t)$ is a gaussian distribution, which is the case if they employ VarSMiLe, Nas10$*$, Nas12$*$, pf1, MP1, or Leaky Integrator as their learning rule (see section 4). For the cases that different learning rules (e.g., pf20) are used, where the posterior belief is a weighted sum of gaussians, the theoretical analysis is more complicated, but our simulations show the same results (see Figure 10B and section 4). Therefore, independent of the learning rule, we have the same experimental prediction on the manifestation of different surprise measures on physiological signals, such as pupil dilation. Our first experimental prediction can be summarized as the set of hypotheses shown in Table 1.

#### 2.4.4 Prediction 2

Our second prediction follows the same experimental procedure as the one for the first prediction. The main difference is that for the second prediction, we need to fit a model to the experimental data. Given one of the learning algorithms, the fitting procedure can be done by tuning the free parameters of the algorithm with the goal of minimizing the mean squared error between the model's prediction $\theta ^t$ and a subject's prediction $y^t+1$ (similar to Nassar et al., 2012, 2010) or with the goal of maximizing the likelihood of subject's prediction $\pi ^(t)(y^t+1)$. Our prediction is independent of the learning algorithm, but in an actual experiment, we recommend using model selection to find the model that fits the human data best.

Hypothesis . | Prediction . |
---|---|

The indicator reflects $SBF$ | $\Delta M\xaf1(\theta ^,\delta ,C)<0$ and $\u2202\Delta M\xaf1(\theta ^,\delta ,C)\u2202\delta <0$ |

The indicator reflects $SSh$ | $\Delta M\xaf1(\theta ^,\delta ,C)>0$ and $\u2202\Delta M\xaf1(\theta ^,\delta ,C)\u2202\delta >0$ |

The prior is not used for inference | $\Delta M\xaf1(\theta ^,\delta ,C)=0$ |

Hypothesis . | Prediction . |
---|---|

The indicator reflects $SBF$ | $\Delta M\xaf1(\theta ^,\delta ,C)<0$ and $\u2202\Delta M\xaf1(\theta ^,\delta ,C)\u2202\delta <0$ |

The indicator reflects $SSh$ | $\Delta M\xaf1(\theta ^,\delta ,C)>0$ and $\u2202\Delta M\xaf1(\theta ^,\delta ,C)\u2202\delta >0$ |

The prior is not used for inference | $\Delta M\xaf1(\theta ^,\delta ,C)=0$ |

Note: $\Delta M\xaf1(\theta ^,\delta ,C)$ stands for $M\xaf1(\theta ^,\delta ,s=+1,C)-M\xaf1(\theta ^,\delta ,s=-1,C)$.

Having a fitted model, we can compute the probabilities $P(yt+1;\pi ^(t))$ and $P(yt+1;\pi ^(0))$. For the case that these probabilities are equal, $P(yt+1;\pi ^(t))=P(yt+1;\pi ^(0))=p$, the Bayes Factor Surprise $SBF$ is equal to 1, independent of the value of $p$ (see equation 2.7). However, the Shannon Surprise $SSh$ is equal to $-logp$ and varies with $p$. Figure 11A shows a schematic for the case that both current and prior beliefs are gaussian distributions. Two cases for which we have $P(yt+1;\pi ^(t))=P(yt+1;\pi ^(0))=p$, for two different $p$ values, are marked by black dots at the intersections of the curves.

Given a probability $p>0$, we can compute the average of $Mt$ over time for the time points with $P(yt+1;\pi ^(t))\u2248p$ and $P(yt+1;\pi ^(0))\u2248p$, which we denote as $M\xaf2(p)$; the index 2 stands for experimental prediction 2. Analogous to the first prediction, the approximation notation $\u2248$ is used due to practical limitations. Then, if $M\xaf2(p)$ is independent of $p$, its behavior is consistent with $SBF$, whereas if it decreases by increasing $p$, it can be a signature of $SSh$. Our second experimental prediction can be summarized as the two hypotheses shown in Table 2. Note that in contrast to our first prediction, with the assumption that the standard deviation of the prior belief is fitted using the behavioral data, we do not consider the hypothesis that the prior is not used for inference, because this is indistinguishable from a very large variance of the prior belief.

In order to illustrate the possible results and the feasibility of the experiment, we ran a simulation and computed $S\xafBF(p)$ and $S\xafSh(p)$ for the time points with $P(yt+1;\pi ^(t))\u2248p$ and $P(yt+1;\pi ^(0))\u2248p$ (see section 4 for details). The results of the simulation are shown in Figure 11B.

## 3 Discussion

We have shown that performing exact Bayesian inference on a generative world model naturally leads to a definition of surprise and a surprise-modulated adaptation rate. We have proposed three approximate algorithms (VarSMiLe, MP$N$, and pf$N$) for learning in nonstationary environments, which all exhibit the surprise-modulated adaptation rate of the exact Bayesian approach and are biologically plausible. Empirically we observed that our algorithms achieve levels of performance comparable to approximate Bayesian methods with higher memory demands (Adams & MacKay, 2007) and are more resilient across different environments compared to methods with similar memory demands (Faraji et al., 2018; Fearnhead & Liu, 2007; Nassar et al., 2012, 2010).

Hypothesis . | Prediction . |
---|---|

The indicator reflects $SBF$ | $\u2202M\xaf2(p)\u2202p=0$ |

The indicator reflects $SSh$ | $\u2202M\xaf2(p)\u2202p<0$ |

Hypothesis . | Prediction . |
---|---|

The indicator reflects $SBF$ | $\u2202M\xaf2(p)\u2202p=0$ |

The indicator reflects $SSh$ | $\u2202M\xaf2(p)\u2202p<0$ |

Learning in a volatile environment has been studied for a long time in the fields of Bayesian learning, neuroscience, and signal processing. In the following, we discuss the biological relevance of our work, and we briefly review some of the previously developed algorithms, with particular focus on the ones that have studied environments that can be modeled with a generative model similar to the one in Figure 1. We then discuss further our results and propose directions for future work on surprise-based learning.

### 3.1 Biological Interpretation

Humans are able to quickly adapt to changes (Behrens et al., 2007; Nassar et al., 2012, 2010), but human behavior is also often observed to be suboptimal, compared to the normative approach of exact Bayesian inference (Glaze et al., 2015; Mathys, Daunizeau, Friston, & Stephan, 2011; Nassar et al., 2010; Prat-Carrabin, Wilson, Cohen, & Da Silveira, 2020; Wilson, Nassar, & Gold, 2013). In general, biological agents have limited resources and possibly inaccurate assumptions about hyperparameters, yielding suboptimal behavior, as we also see with our algorithms whose accuracies degrade with a suboptimal choice of hyperparameters. Performance also deteriorates with a decreasing number of particles in the sampling-based algorithms, which might be another possible explanation of suboptimal human behavior. Previously, Particle Filtering has been shown to explain the behavior of human subjects in changing environments: Daw and Courville (2008) use a single particle, Brown and Steyvers (2009) use a simple heuristic form of particle filtering based on direct simulation, Findling et al. (2019) combine Particle Filtering with a noisy inference, and Prat-Carrabin et al. (2020) use it for a task with temporal structure.

At the level of neuronal implementation, we do not propose a specific suggestion. However, there are several hypotheses about neural implementations of related particle filters (Huang & Rao, 2014; Kutschireiter et al., 2017; Legenstein & Maass, 2014; Shi & Griffiths, 2009) on which, a neural model of pf$N$ and—its greedy version—MP$N$ could be based. In a similar spirit, the updating scheme of Variational SMiLe may be implemented in biological neural networks (for distributions in the exponential family).

Our theoretical framework for modulation of learning by the Bayes Factor Surprise $SBF$ is related to the body of literature on neo-Hebbian three-factor learning rules (Frémaux & Gerstner, 2016; Gerstner et al., 2018; Lisman et al., 2011), where a third factor indicating reward or surprise enables or modulates a synaptic change or a belief update (Yu, 2012; Yu & Dayan, 2005). We have shown how Bayesian or approximate Bayesian inference naturally leads to such a third factor that modulates learning via the surprise modulated adaptation rate $\gamma (SBF,m)$. This may offer novel interpretations of behavioral and neurophysiological data and help in understanding how three-factor learning computations may be implemented in the brain.

### 3.2 Related Work

#### 3.2.1 Exact Bayesian Inference

As already described in section 2.2.2, for the generative model in Figure 1, it is possible to find an exact online Bayesian update of the belief using a message-passing algorithm (Adams & MacKay, 2007). The space and time complexity of the algorithm increases linearly with $t$, which makes it unsuitable for an online learning setting. However, approximations like dropping messages below a certain threshold (Adams & MacKay, 2007) or stratified resampling (Fearnhead & Liu, 2007) allow reducing the computational complexity. The former has a variable number of particles in time, and the latter needs solving a complicated nonlinear equation at each time step in order to reduce the number of particles to $N$ (called SOR$N$ in section 2).

Our message-passing algorithm with a finite number of particles (messages) $N$ (MP$N$; see algorithm 3) is closely related to these algorithms and can be seen as a biologically more plausible variant of the other two. All three algorithms have the same update rules given by equations 2.19 and 2.18. Hence the algorithms of both Adams and MacKay (2007) and Fearnhead and Liu (2007) have the same surprise modulation as our MP$N$. The difference lies in their approaches to eliminate less “important” particles.

In the literature of switching state-space models (Barber, 2012), the generative models of the kind in Figure 1 are known as “reset models,” and the message-passing algorithm of Adams and MacKay (2007) is known to be the standard algorithm for inference over these models (Barber, 2012). See Barber (2006, 2012), and Ghahramani & Hinton (2000) for other variations of switching state-space models and examples of approximate inference over them.

#### 3.2.2 Leaky Integration and Variations of Delta Rules

In order to estimate some statistics, leaky integration of new observations is a particularly simple form of a trade-off between integrating and forgetting. After a transient phase, the update of a leaky integrator takes the form of a delta rule that can be seen as an approximation of exact Bayesian updates (Heilbron & Meyniel, 2019; Meyniel et al., 2016; Ryali et al., 2018; Yu & Cohen, 2009). This update rule was found to be biologically plausible and consistent with human behavioral data (Meyniel et al., 2016; Yu & Cohen, 2009). However, Behrens et al. (2007) and Heilbron and Meyniel (2019) demonstrated that in some situations, the exact Bayesian model is significantly better than leaky integration in explaining human behavior. The inflexibility of leaky integration with a single, constant leak parameter can be overcome by a weighted combination of multiple leaky integrators (Wilson et al., 2013), where the weights are updated in a similar fashion as in the exact online methods (Adams & MacKay, 2007; Fearnhead & Liu, 2007), or by considering an adaptive leak parameter (Nassar et al., 2012, 2010). We have shown that the two algorithms of Nassar et al. (2012, 2010) can be generalized to gaussian prior beliefs (Nas10$*$ and Nas12$*$). Our results show that these algorithms also inherit the surprise modulation of the exact Bayesian inference. Our surprise-dependent adaptation rate $\gamma $ can be interpreted as a surprise-modulated leak parameter.

#### 3.2.3 Other Approaches

Learning in the presence of abrupt changes has also been considered without explicit assumptions about the underlying generative model. One approach uses a surprise-modulated adaptation rate (Faraji et al., 2018) similar to equation 2.9. The Surprise-Minimization Learning (SMiLe) algorithm of Faraji et al. (2018) has an updating rule similar to the one of VarSMiLe (see equations 2.14 and 2.15). The adaptation rate modulation, however, is based on the Confidence Corrected Surprise (Faraji et al., 2018) rather than the Bayes Factor Surprise, and the trade-off in its update rule is between resetting and staying with the latest belief rather than between resetting and integrating (see section 4).

Other approaches use different generative models, such as conditional sampling of the parameters also when there is a change (Glaze et al., 2015; Yu & Dayan, 2005), a deeper hierarchy without fixed change probability $pc$ (Wilson et al., 2010), or drift in the parameters (Gershman, Radulescu, Norman, & Niv, 2014; Mathys et al., 2011). A recent work shows that inference on a simpler version the generative model of Figure 1, with no change points but with a noisy inference style, can explain human behavior well even when the true generative model of the environment is different and more complicated (Findling et al., 2019). They develop a heuristic approach to add noise in the inference process of a Particle Filter. Their algorithm can be interpreted as a surprise-modulated Particle Filter, where the added noise scales with a measure of surprise—conceptually equivalent to Bayesian surprise (Itti & Baldi, 2006; Schmidhuber, 2010; Storck, Hochreiter, & Schmidhuber, 1995). Moreover, another recent work (Prat-Carrabin et al., 2020) shows that approximate sampling algorithms (like Particle Filtering) can explain human behavior better than their alternatives in tasks closely related to the generative model of Figure 1. The signal processing literature provides further methods to address the problem of learning in nonstationary environments with abrupt changes; see Aminikhanghahi and Cook (2017) for a review, and Cummings et al. (2018), Lin et al. (2017), Masegosa et al. (2017), and Özkan et al. (2013) for a few recent examples.

### 3.3 Surprise Modulation as a Generic Phenomenon

Learning rate modulation similar to the one in equation 2.9 has been previously proposed in the neuroscience literature with either heuristic arguments (Faraji et al., 2018) or Bayesian arguments for a particular experimental task, for example, when samples are drawn from a gaussian distribution (Nassar et al., 2012, 2010). The fact that the same form of modulation is at the heart of Bayesian inference for our relatively general generative model, that it is derived without any further assumptions, and is not a priori defined, is in our view an important contribution to the field of adaptive learning algorithms in computational neuroscience.

Furthermore, the results of our three approximate methods (Particle Filtering, Variational SMiLe, and Message Passing with a fixed $N$ number of messages) as well as some previously developed ones (Adams & MacKay, 2007; Fearnhead & Liu, 2007; Nassar et al., 2012, 2010) demonstrate that the surprise-based modulation of the learning rate is a generic phenomenon. Therefore, regardless of whether the brain uses Bayesian inference or an approximate algorithm (Bogacz, 2017, 2019; Findling et al., 2019; Friston, 2010; Gershman, 2019; Gershman et al., 2014; Mathys et al., 2011; Nassar et al., 2012, 2010; Prat-Carrabin et al., 2020), the notion of Bayes Factor Surprise and the way it modulates learning (see equations 2.12 and 2.9) look generic.

The generality of the way surprise should modulate learning depends on an agent's inductive biases about its environment and is directly associated with the assumed generative model of the world. The generative model we considered in this work involves abrupt changes. However, one can think of other realistic examples, where an improbable observation does not indicate a persistent change but a singular event or an outlier, similar to d'Acremont and Bossaerts (2016) and Nassar, Bruckner, and Frank (2019). In such situations, the belief should not be changed, and surprise should attenuate learning rather than accelerate it. Interestingly, we can show that exact and approximate Bayesian inference on such a generative model naturally leads to a surprise-modulated adaptation rate $\gamma (SBF,m)$, with the same definition of $SBF$, where the trade-off is not between integrating and resetting, but between integrating and ignoring the new observation.^{1}

An aspect that the generative model we considered does not capture is the potential return to a previous state of the environment rather than a change to a completely new situation. If in our example of Figure 1B, the bridge with the shortest path is temporarily closed for repairs, your friend would again have to take the longer detour; therefore, her arrival times will return to their previous values (i.e. increase). In such cases, an agent should infer whether the surprising observation stems from a new hidden state or from an old state stored in memory. Relevant generative models have been studied in Fox, Sudderth, Jordan, and Willsky (2011), Gershman, Monfils, Norman, and Niv (2017), Gershman et al. (2014) and are out of the scope of this article.

### 3.4 Bayes Factor Surprise as a Novel Measure of Surprise

In view of a potential application in the neurosciences, a definition of surprise should reflect how unexpected an event is and should modulate learning. Surprising events indicate that our belief is far from the real world and suggest updating our model of the world, or, for a large surprise, simply forgetting it. Forgetting is the same as returning to the prior belief. However, an observation $yt+1$ can be unexpected under both the prior $\pi (0)$ and the current beliefs $\pi (t)$. In these situations, it is not obvious whether forgetting helps. Therefore, the modulation between forgetting or not should be based on a comparison between the probability of an event under the current belief $P(yt+1;\pi (t))$ and its probability under the prior belief $P(yt+1;\pi (0))$.

The definition of the Bayes Factor Surprise $SBF$ as the ratio of $P(yt+1;\pi (t))$ and $P(yt+1;\pi (0))$ exploits this insight. The Bayes Factor Surprise appears as a modulation factor in the recursive form of the exact Bayesian update rule for a hierarchical generative model of the environment. When two events are equally probable under the prior belief, the one that is less expected under the current belief is more surprising, satisfying the first property. At the same time, when two events are equally probable under the current belief, the one that is more expected under the prior belief is more surprising, signaling that forgetting may be beneficial.

We formally discussed the connections between the Bayes Factor Surprise and Shannon Surprise (Shannon, 1948) and showed that they are closely linked. We showed that the modulated adaptation rate ($\gamma $) used in (approximate) Bayesian inference is a function of the difference between the Shannon Surprise under the current and the prior beliefs, but cannot be expressed solely by the Shannon Surprise under the current one. Our formal comparisons between these two different measures of surprise lead to specific experimentally testable predictions.

The Bayesian Surprise $SBa$ (Itti & Baldi, 2006; Schmidhuber, 2010; Storck et al., 1995) and the Confidence Corrected Surprise $SCC$ (Faraji et al., 2018) are two other measures of surprise in neuroscience. The learning modulation derived in our generative model cannot be expressed as a function of $SBa$ and $SCC$. However, one can hypothesize that $SBa$ is computed after the update of the belief to measure the information gain of the observed event and is therefore not a good candidate for online learning modulation. The Confidence Corrected Surprise $SCC$ takes into account the shape of the belief and therefore includes the effects of confidence, but it does not consider any information about the prior belief. Hence, a result of $M\xaf1(\theta ^,\delta ,s=+1,C)=M\xaf1(\theta ^,\delta ,s=-1,C)$ in our first experimental prediction would be consistent with the corresponding behavioral or physiological indicator reflecting the $SCC$.

### 3.5 Difference in Shannon Surprise, an Alternative Perspective

### 3.6 Future Directions

A natural continuation of our study is to test our experimental predictions in human behavior and physiological signals in order to investigate which measures of surprise are used by the brain. In a similar direction, our approximate learning algorithms can be evaluated on human behavioral data from experiments that use a similar generative model (Behrens et al., 2007; Glaze et al., 2015; Heilbron & Meyniel, 2019; Nassar et al., 2012, 2010; Wilson et al., 2013; Yu & Dayan, 2005) in order to assess if our proposed algorithms achieve similar or better performance in explaining data.

Finally, our methods can potentially be applied to model-based reinforcement learning in nonstationary environments. In recent years, there has been growing interest in adaptive or continually learning agents in changing environments in the form of continual learning and meta-learning (Lomonaco, Desai, Culurciello, & Maltoni, 2019; Traoré et al., 2019). Many continual learning model-based approaches make use of some procedure to detect changes (Lomonaco et al., 2019; Nagabandi et al., 2018). Integrating $SBF$ and a learning rate $\gamma (SBF)$ into a reinforcement learning agent would be an interesting future direction.

## 4 Methods

### 4.1 Proof of the proposition

### 4.2 Derivation of the Optimization-Based Formulation of VarSMiLe: Algorithm 1

To derive the optimization-based update rule for the Variational SMiLe rule and the relation of the bound $Bt+1$ with surprise, we used the same approach used in Faraji et al. (2018).

#### 4.2.1 Derivation of the Update Rule

#### 4.2.2 Proof of the Claim That $B$ Is a Decreasing Function of Surprise

### 4.3 Derivations of Message Passing $N$: Algorithm 2

For the sake of clarity and coherence, we repeat here some steps performed in section 2.

The MP$N$ algorithm uses equations 4.22, 4.28, and 4.29 for computing the belief for $t\u2264N$, which is same as the exact Bayesian inference. For $t>N$, it first updates the weights in the same fashion as equations 4.28 and 4.29, keeps the greatest $N$ weights, and sets the rest weights equal to 0. After normalizing the new weights, it uses equation 4.22 (but only over the particles with nonzero weights) to compute the belief $\pi ^(t)$. (For the particular case of exponential family, see algorithm 2 for the pseudocode.)

### 4.4 Derivation of the Weight Update for Particle Filtering: Algorithm 3

### 4.5 Surprise-Modulation as a Framework for Other Algorithms

#### 4.5.1 SMiLe Rule

Note that this becomes equal to $P(\theta |yt+1)$ if the prior belief $\pi (0)$ is a uniform distribution (see equation 2.11).

Although equation 4.45 looks very similar to equation 2.13, it signifies a trade-off between the latest belief $\pi ^(t)$ and the belief updated by only the most recent observation $P\u02dc(\theta |yt+1)$, that is, a trade-off between adherence to the current belief and reset. While SMiLe adheres to the current belief $\pi ^(t)$, Variational SMiLe integrates the new observation with the current belief to get $\pi ^B(t)$, which leads to a trade-off similar to the one of the exact Bayesian inference (see equations 2.12 and 2.10).

#### 4.5.2 Nassar's Algorithm

We next extend their approach to a more general case where the prior is a gaussian distribution with arbitrary variance, $\mu t+1\u223cN(\mu 0,\sigma 02)$. We then discuss the relation of this method to Particle Filtering. A performance comparison between our extended algorithms Nas10$*$ and Nas12$*$ and their original versions Nas10 and Nas12 is depicted in Figures 12 and 13.

#### 4.5.3 Nas10$*$ and Nas12$*$ Algorithms

We can see that the updated mean is a weighted average, with surprise-modulated weights, between integrating the new observation with the current mean $\mu ^t$ and integrating it with the prior mean $\mu 0$, in the same spirit as the other algorithms we considered here. Equation 4.49 can also be seen as a surprise-modulated weighted sum of two delta rules: one including a prediction error between the new observation and the current mean $(yt+1-\mu ^t)$ and one including a prediction error between the observed sample and the prior mean $(yt+1-\mu 0)$.

In Nassar et al. (2012, 2010), the true new mean after a change point is drawn from a uniform distribution with a range of values much larger than the width of the gaussian likelihood. Their derivations implicitly approximate the uniform distribution with a gaussian distribution with $\sigma 0\u226b\sigma $. Note that if $\sigma 0\u226b\sigma $, then $\rho \u21920$, so that the first term of equation 4.50 disappears, and $\alpha t+1=1+\gamma t+1rt1+rt$. This results in the delta rule of the original algorithm in equations 4.47 and 4.48, with $\gamma t+1=\Omega t+1$.

All of the calculations so far were done by assuming that $rt$ is known. However, for the case of a nonstationary regime with a history of change points, the time interval $rt$ is not known. Nassar et al. (2012, 2010) used the expected time interval $r^t$ as an estimate. We make a distinction here between Nassar et al. (2012) and Nassar et al. (2010):

In Nassar et al. (2010) $r^t$ is calculated recursively on each trial in the same spirit as equation 2.10: $r^t+1=(1-\gamma t+1)(r^t+1)+\gamma t+1$, that is, at each time step, there is a probability $(1-\gamma t+1)$ that $r^t$ increments by 1 and a probability $\gamma t+1$ that it is reset to 1. So $r^t+1$ is the weighted sum of these two outcomes. Hence, equation 4.50 combined with the expected time interval $r^t$ constitutes a generalization of the update rule of Nassar et al. (2010) for the case of gaussian prior $N(\mu 0,\sigma 02)$. We call this algorithm Nas10$*$ (see the appendix for pseudocode).

In Nassar et al. (2012), the variance $\sigma ^t+12=Var[\mu t+1|y1:t+1]$ is estimated given $\mu ^t$, $r^t$, and $\sigma ^t2$. Based on this variance, $r^t+1=\sigma 2\sigma ^t+12-\sigma 2\sigma 02$ is computed. The derivation of the recursive computation of $\sigma ^t+12$ for the case of gaussian priors can be found in the appendix. We call the combination of equation 4.50 with this way of computing the expected time interval $r^t$ Nas12$*$ (see the appendix for pseudocode). These two versions of calculating $r^t$ in Nassar et al. (2010) and Nassar et al. (2012) give different results, and we compare our algorithms with both Nas10$*$ and Nas12$*$ in our simulations. Note that, as discussed in section 2.1, the posterior belief at time $t+1$ does not generally belong to the same family of distributions as the belief of time $t$. However, we therefore approximate for both algorithms the posterior belief $P(\theta |y1:t+1)$ by a gaussian.

#### 4.5.4 Nassar's Algorithm and Particle Filtering with One Particle

*in expectation*, the updated belief will be

*expected value*of $R^t+1$ is

*expected*$r^t$, whereas in Particle Filtering with one particle, the belief is updated using the

*sampled*$r^t$.

In summary, the two methods will give different estimates on a trial-per-trial basis, but the same result in expectation. The pseudocode for Particle Filtering with one particle for the particular case of the gaussian estimation task is in the appendix.

### 4.6 Application to the Exponential Family

The pseudocode for Variational SMiLe, MP$N$, and Particle Filtering can be seen in algorithms 1, 2, and 3, respectively.

### 4.7 Simulation Task

In this section, we first argue why the mean squared error is a proper measure for comparing different algorithms with each other, and then we explain the version of Leaky integrator that we used for simulations.

#### 4.7.1 Mean Squared Error as an Optimality Measure

**Proof for the Algorithms without Sampling.**Consider $\Theta ^tOpt=fOpt(Y1:t)$. Then, for any other arbitrary estimator $\Theta ^t=f(Y1:t)$ (except for the ones with sampling), we have

$\u25a1$

**Proof for the Algorithms with Sampling.**For particle filtering (and any kind of estimator with sampling), the estimator is not a deterministic function of observations $Y1:t$. Rather, the estimator is a function of observations as well as a set of random variables (samples) drawn from a distribution that is also a function of observations $Y1:t$. In our case, the samples are the sequence of hidden states $C1:t$. The estimator can be written as

$\u25a1$

#### 4.7.2 Leaky Integration

### 4.8 Derivation of the Formula Relating Shannon Surprise to the Modulated Learning Rate

### 4.9 Experimental Predictions

#### 4.9.1 Setting

#### 4.9.2 Theoretical Proofs for Prediction 1

**Proof of the Second Inequality for $SBF$.**Let us define the variables

**Proof of the Second Inequality for $SSh$.**Using the functions we defined for the previous proof and after computing the partial derivative of equation 4.78, we have

#### 4.9.3 Simulation Procedure for Prediction 1

In order to relax the main assumption of our theoretical proofs (the belief is always a gaussian distribution), to include the practical difficulties of a real experiment (e.g., to use $|\delta t|\u2248\delta $ instead of $|\delta t|=\delta $), and to have an estimation of the effect size, we also performed simulations for our first experimental prediction.

For each simulated subject, the procedure of our simulation was as follows:

We fixed the hyperparameters $\sigma 2$ and $pc$ for producing samples.

We selected a learning algorithm (e.g., pf20) and fixed its corresponding tuned parameters (based on our simulations in section 2).

We applied the learning algorithm over a sequence of observations $y1:T$. Note that in a real experiment, this step can be done through a few episodes, which makes it possible to have a long sequence of observations (i.e., large $T$).

At each time $t$, we saved the values $yt$, $\theta ^t$, $\sigma ^t$, $\delta t$, $st$, $SSh(yt;\pi ^(t-1))$, and $SBF(yt;\pi ^(t-1))$.

#### 4.9.4 Simulation Procedure for Prediction 2

For our second prediction, the theoretical proof is trivial. However, in order to have a setting similar to a real experiment (e.g., to use $P(yt+1;\pi ^(t))\u2248p$ instead of $P(yt+1;\pi ^(t))=p$), and to have an estimation of the effect size, we used simulations also for our second experimental predictions.

## Appendix

### A.1 Modified Algorithm of Nassar et al. (2012, 2010): Adaptation for Gaussian Prior

#### A.1.1 Recursive Update of the Estimated Mean for Gaussian Prior

An estimate of $\theta $ is its expected value $E(\theta |y1:t+1)=\mu B,t+1$.

#### A.1.2 Recursive Update of the Estimated Variance for Gaussian Prior

### A.2 Implementation of Nas10$*$, Nas12$*$, and Particle Filtering with One Particle for the Gaussian Estimation Task

We provide here the pseudocode for the algorithms Nas10$*$, Nas12$*$, and Particle Filtering with one particle for the gaussian estimation task (see algorithms A1, A2, and A3). Observations are drawn from a gaussian distribution with known variance and unknown mean: $yt+1|\mu t+1\u223cN(\mu t+1,\sigma 2)$ and $\theta t=\mu t$. When there is a change, the parameter $\mu $ is also drawn from a gaussian distribution, $\mu \u223cN(\mu 0,\sigma 02)$. All three algorithms estimate the expected $\mu t+1$ (i.e., $\mu ^t+1$) upon observing a new sample $yt+1$.

Note that the pseudocode for pf1 provided here is a translation of algorithm 3 to the case of a single particle (where there are no weights to calculate) and for the gaussian distribution as a particular instance of the exponential family.

## Notes

^{1}

Liakoni, V., Lehmann, M. P., Modirshanechi, A., Brea, J., Herzog, M. H., Gerstner, W., & Preuschoff, K. *Dissociating brain regions encoding reward prediction error and surprise*. Manuscript in preparation.

## Acknowledgments

We thank both reviewers for their constructive and helpful comments. This research was supported by the Swiss National Science Foundation (200020 184615) and the European Union Horizon 2020 Framework Program under grant agreement 785907 (Human Brain Project, SGA2).

## References

*20*(pp.