## Abstract

Active inference offers a first principle account of sentient behavior, from which special and important cases—for example, reinforcement learning, active learning, Bayes optimal inference, Bayes optimal design—can be derived. Active inference finesses the exploitation-exploration dilemma in relation to prior preferences by placing information gain on the same footing as reward or value. In brief, active inference replaces value functions with functionals of (Bayesian) beliefs, in the form of an expected (variational) free energy. In this letter, we consider a sophisticated kind of active inference using a recursive form of expected free energy. Sophistication describes the degree to which an agent has beliefs about beliefs. We consider agents with beliefs about the counterfactual consequences of action for states of affairs *and* beliefs about those latent states. In other words, we move from simply considering beliefs about “what would happen if I did that” to “what I would *believe about* what would happen if I did that.” The recursive form of the free energy functional effectively implements a deep tree search over actions and outcomes in the future. Crucially, this search is over sequences of belief states as opposed to states per se. We illustrate the competence of this scheme using numerical simulations of deep decision problems.

## 1 Introduction

In theoretical neurobiology, active inference has proved useful in providing a generic account of motivated behavior under ideal Bayesian assumptions, incorporating both epistemic and pragmatic value (Da Costa, Parr, Sajid et al., 2020; Friston, FitzGerald, Rigoli, Schwartenbeck, & Pezzulo, 2017). This account is often portrayed as being based on first principles because it inherits from the statistical physics of random dynamical systems at nonequilibrium steady state (Friston, 2013; Hesp, Ramstead et al., 2019; Parr, Da Costa, & Friston, 2020). Active inference does not pretend to replace existing formulations of sentient behavior; it just provides a Bayesian mechanics from which most (and, arguably, all) normative optimization schemes can be derived as special cases. Generally these special cases arise when ignoring one sort of uncertainty or another. For example, if we ignore uncertainty about (unobservable) hidden states that generate (observable) outcomes, active inference reduces to conventional schemes like optimal control theory and reinforcement learning. While the latter schemes tend to focus on the maximization of value as a function of hidden states per se, active inference optimizes a functional^{1} of (Bayesian) beliefs about hidden states. This allows it to account for uncertainties surrounding action and perception in a unified, Bayes-optimal fashion.

Most current applications of active inference rest on the selection of policies (i.e., ordered sequences of actions or open-loop policies, where the sequence of future actions depends only on current states, not future states) that minimize a functional of beliefs called *expected free energy* (Da Costa, Parr, Sajid, et al., 2020; Friston, FitzGerald et al., 2017). This approach clearly has limitations, in the sense that one has to specify a priori allowable policies, each of which represents a possible path through a deep tree of action sequences. This formulation limits the scalability of the ensuing schemes because only a relatively small number of policies can be evaluated (Tschantz, Baltieri, Seth, & Buckley, 2019). In this letter, we consider active inference schemes that enable a deep tree search over all allowable sequences of action into the future. Because this involves a recursive evaluation of expected free energy—and implicit Bayesian beliefs—the resulting scheme has a sophisticated aspect (Costa-Gomes, Crawford, & Broseta, 2001; Devaine, Hollard, & Daunizeau, 2014): rolling out beliefs about beliefs.

*Sophistication* is a term from the economics literature and refers to having beliefs about one's own or another's beliefs. For instance, in game theory, an agent is said to have a level of sophistication of 1 if she has beliefs about her opponent, 2 if she has beliefs about her opponent's beliefs about her strategy, and so forth. Most people have a level of sophistication greater than two (Camerer, Ho, & Chong, 2004).

According to this view, most current illustrations of active inference can be regarded as unsophisticated or naive, in the sense that they consider only beliefs about the consequences of action, as opposed to the consequences of action for beliefs. In what follows, we try to unpack this distinction intuitively and formally using mathematical and numerical analyses. We also take the opportunity to survey the repertoire of existing schemes that fall under the Bayesian mechanics of active inference, including expected utility theory (Von Neumann & Morgenstern, 1944), Bayesian decision theory (Berger, 2011), optimal Bayesian design (Lindley, 1956), reinforcement learning (Sutton & Barto, 1981), active learning (MacKay, 1992), risk-sensitive control (van den Broek, Wiegerinck, & Kappen, 2010), artificial curiosity (Schmidhuber, 2006), intrinsic motivation (Oudeyer & Kaplan, 2007), empowerment (Klyubin, Polani, & Nehaniv, 2005), and the information bottleneck method (Tishby, Pereira, & Bialek, 1999; Tishby & Polani, 2010).

Sophisticated inference recovers Bayes-adaptive reinforcement learning (Åström, 1965; Ghavamzadeh, Mannor, Pineau, & Tamar, 2016; Ross, Chaib-draa, & Pineau, 2008) in the zero temperature limit. Both approaches perform belief state planning, where the agent maximizes an objective function by taking into account how it expects its own beliefs to change in the future (Duff, 2002) and evinces a degree of sophistication. The key distinction is that Bayes-adaptive reinforcement learning considers arbitrary reward functions, while sophisticated active inference optimizes an expected free energy that can be motivated from first principles. While both can be specified for particular tasks, the expected free energy additionally mandates the agent to seek out information about the world (Friston, 2013, 2019) beyond what is necessary for solving a particular task (Tishby & Polani, 2010). This allows inference to account for artificial curiosity (Lindley, 1956; Oudeyer & Kaplan, 2007; Schmidhuber, 1991) that goes beyond reward seeking to the gathering of evidence for an agent's existence (i.e., its marginal likelihood). This is sometimes referred to as self-evidencing (Hohwy, 2016).

The basic distinction between sophisticated and unsophisticated inference was briefly introduced in appendix 6 of Friston, FitzGerald et al. (2017). As outlined in this appendix, there is a sense in which unsophisticated formulations, which simply sum the expected free energy over future time steps based on current beliefs about the future, can be thought of as selecting policies that optimize a path integral of the expected free energy. In contrast, sophisticated schemes take account of the way in which the free energy changes as alternative paths are pursued and beliefs updated. This can be thought of as an expected path integral.

This distinction is subtle but can lead to fundamentally different kinds of behavior. A simple example illustrates the difference. Consider the following three-armed bandit problem—with a twist. The right and left arms increase or decrease your winnings. However, you do not know which arm is which. The central arm does not affect your winnings but tells you which arm pays off. Crucially, once you have committed to either the right or the left arm, you cannot switch to the other arm. This game is engineered to confound agents whose choice behavior is based on Bayesian decision theory. This follows because the expected payoff is the same for every sequence of moves. In other words, choosing the right or left arm—for the first and subsequent trials—means you are equally likely to win or lose. Similarly, choosing the middle arm (or indeed doing nothing) has the same Bayesian risk or expected utility.

However, an active inference agent, who is trying to minimize her expected free energy,^{2} will select actions that minimize the risk of losing and resolve her uncertainty about whether the right or left arm pays off. This means that the center arm acquires epistemic (uncertainty-resolving) affordance and becomes intrinsically attractive. On choosing the central arm—and discovering which arm holds the reward—her subsequent choices are informed, in the sense that she can exploit her knowledge and commit to the rewarding arm. In this example, the agent has resolved a simple exploration-exploitation dilemma^{3} by resolving ambiguity as a prelude to exploiting updated beliefs about the consequences of subsequent action. Note that because the central arm has been selected, there is no ambiguity in play, and its epistemic affordance disappears. Note further that all three arms initially have some epistemic affordance; however, the right and left arms are less informative if the payoff is probabilistic.

The key move behind this letter is to consider a sophisticated agent who evaluates the expected free energy of each move recursively. Simply choosing the central arm to resolve uncertainty does not, in and of itself, mean an epistemic action was chosen in the service of securing future rewards. In other words, the central arm is selected because all the options had the same Bayesian risk^{4} while the central arm had the greatest epistemic affordance.^{5} Now consider a sophisticated agent who imagines what she will do after acting. For each plausible outcome, she can work out how her beliefs about hidden states will be updated and evaluate the expected free energy of the subsequent move under each action and subsequent outcome. By taking the average over both, she can evaluate the expected free energy of the second move that is afforded by the first. If she repeats this process recursively, she can effectively perform a deep tree search over all ordered sequences of actions and their consequences.

Heuristically, the unsophisticated agent simply chooses the central arm because she knows it will resolve uncertainty about states of affairs. Conversely, the sophisticated agent follows through—on this resolution of ambiguity—in terms of its implications for subsequent choices. In this instance, she knows that only two things can happen if she chooses the central arm: either the right or left arm will be disclosed as the payoff arm. In either case, the subsequent choice can be made unambiguously to minimize risk and secure her reward. The average expected free energy of these subsequent actions will be pleasingly low, making a choice of the central arm more attractive than its expected free energy would otherwise suggest. This means the sophisticated agent is more confident about her choices because she has gone beyond forming beliefs about the consequences of action to consider the effects of action on subsequent beliefs and the (epistemic) actions that ensue. The remainder of this letter unpacks this recursive kind of planning, using formal analysis and simulations.

This letter is intended to introduce a sophisticated scheme for active inference and provide some intuition as to how it works in practice. We validate this scheme through reproducing simulation results from previous formulations of active inference in a simple and a more complex navigation task. This is not intended as proof of superiority of sophisticated inference over existing schemes, which we assess in a companion paper (Da Costa, Sajid et al. 2020), but to demonstrate noninferiority in some illustrative settings. Note that it is possible to show that on reward maximization tasks, sophisticated active inference will significantly outperform, as it accommodates the backward induction algorithm as a special case.

This paper has four sections. Section 2 provides a brief overview of active inference in terms of free energy minimization and the various schemes that can be used for implementation. This section starts with the basic imperative to optimize Bayesian beliefs about latent or hidden states of the world in terms of approximate Bayesian (i.e., variational) inference (Dayan, Hinton, Neal, & Zemel, 1995). It then goes on to cast planning as inference (Attias, 2003; Botvinick & Toussaint, 2012) as the minimization of an expected free energy under allowable sequences of actions or policies (Friston, FitzGerald et al., 2017). The foundations of expected free energy are detailed in an appendix from two complementary perspectives, the second of which is probably more fundamental as it rests on the first-principle account mentioned above (Friston, 2013, 2019; Parr et al., 2020). The third section considers sophisticated schemes using a recursive formulation of expected free energy. Effectively, this enables the efficient search of deep policy trees (that entail all possible outcomes under each policy or path). This search is efficient because only paths that have a sufficiently high predictive posterior probability need to be evaluated. This restricted tree search is straightforward to implement in the present setting because we are propagating beliefs (i.e., probabilities) as opposed to value functions. The fourth section provides some illustrative simulations that compare sophisticated and unsophisticated agents in the three-armed bandit (or T-maze paradigm) described above. It also considers deeper problems, using navigation and novelty seeking as an example. We conclude with a brief summary of what sophisticated inference brings to the table.

## 2 Active Inference and Free Energy Minimization

Most of the active inference literature concerns itself with partially observable Markov decision processes. In other words, it considers generative models of discrete hidden states and observable outcomes, with uncertainty about the (likelihood) mapping between hidden states and outcomes and (prior) probability transitions among hidden states. Crucially, sequential policy selection is cast as an inference problem by treating sequences of actions (i.e., policies) as random variables. Planning then simply entails optimizing posterior beliefs about the policies being pursued and selecting an action from the most likely policy.

On this view, there are just two sets of unknown variables: hidden states and policies. Belief distributions over this bipartition can then be optimized with respect to an evidence bound in the usual way, using an appropriate mean-field approximation (Beal, 2003; Winn & Bishop, 2005). In this setup, we can associate *perception* with the optimization of posterior beliefs about *hidden states*, while *action* follows from planning based on posterior beliefs about *policies*. Implicit in this formulation is a generative model: a probabilistic specification of the joint probability distribution over policies, hidden states, and outcomes. This generative model is usually factorized into the likelihood of outcomes, given hidden states, the conditional distribution over hidden states, given policies, and priors over policies. In active inference, the priors over policies are determined by their expected free energy, noting that this energy, which depends on future courses of action, furnishes an empirical prior over subsequent actions.

In brief, given some prior beliefs about the initial and final states of some epoch of active inference, the game is to find a posterior belief distribution over policies that brings the initial distribution as close as possible to the final distribution, given observations. This objective can be achieved by optimizing posterior beliefs about hidden states and policies with respect to a variational bound on (the logarithm of) the marginal likelihood of the generative model (i.e., log evidence). This evidence bound is known as a variational free energy or (negative) evidence lower bound. In what follows, we offer an overview of the formal aspects of this enactive kind of inference.

### 2.1 Discrete State-Space Models

^{6}of each policy $F(\pi )$ that depends on priors over state transitions and an expected free energy of each policy $G(\pi )$ that underwrites priors over policies. The priors over policies $lnP(\pi )=-E(\pi )-G(\pi )$ ensure the expected free energy at time $\tau $ (i.e., the policy horizon) is minimized. Here, $E(\pi )$ represents an empirical prior that is usually conditioned on hidden states at a higher level in deep (i.e., hierarchical) generative models. Note that outcomes on the horizon are random variables with a likelihood distribution, whereas outcomes in the past are realized variables. The distributions indicated by $Q$ are variational distributions that have various interpretations throughout this letter. They inherit these interpretations in virtue of when we are in time. This means they are posterior probabilities when we account for data that have already been observed but can play the role of (empirical) priors when thinking about observations that have yet to be observed.

The first equality shows that the variational free energy, expected under the posterior over policies, plays the role of an accuracy, while the complexity of posterior beliefs about policies is the divergence from prior beliefs.^{7} In other words, variational free energy scores the evidence for a particular policy that accrues from observed outcomes. The priors over policies also have the form of a free energy. For interested readers, the appendix provides a fairly comprehensive motivation of this functional form, from complementary perspectives. In addition, Table 1 provides a glossary of variables used in this letter. We now consider the role of free energy in exact, approximate, and amortized inference, before turning to active inference and policy selection.

### 2.2 Perception as Inference

^{8}(Beal, 2003; Friston, Parr, & de Vries, 2017; Parr, Markovic, Kiebel, & Friston, 2019). In this form, the free energy gradients constitute a prediction error: the difference between the posterior surprisal

^{9}and its predicted value.

Notation . | Variable . |
---|---|

$P(\xb7)$ | Probability distribution |

$Q(\xb7)$ | Variational posterior or empirical prior distribution |

$F$ | Variational free energy |

$G$ | Expected free energy |

$u\tau $ | Action at time $\tau $ |

$o=(o1,o2,\u2026,o\tau ,\u2026)$ | Observation |

$s=(s1,s2,\u2026,s\tau ,\u2026)$ | Hidden (latent) states |

$\pi $ | Policy (sequence of actions) |

$s\tau \pi $ | Expectation of state at time $\tau $ under $Q(s\tau |\pi )$ |

$s\tau u$ | Expectation of state at time $\tau $ under $Q(s\tau |u\tau )$ |

$v\tau \pi $ | Log expectation of state at time $\tau $ under $Q(s\tau |\pi )$ |

$o\tau u$ | Expectation of observation at time $\tau $ under $Q(o\tau |u<\tau )$ |

$u\tau o$ | Expectation of action at time $\tau $ under $Q(u\tau |o\tau )$ |

A | Parameters of categorical likelihood distribution |

B | Parameters of categorical transition probabilities |

C | Parameters of categorical prior preferences |

D | Parameters of categorical initial state probabilities |

H | Conditional entropy of likelihood distribution |

$a,a$ | Prior and posterior Dirichlet parameters for A |

$b,b$ | Prior and posterior Dirichlet parameters for B |

$d,d$ | Prior and posterior Dirichlet parameters for D |

$Cat(\xb7)$ | Categorical probability distribution |

$Dir(\xb7)$ | Dirichlet probability distribution |

$EP[\xb7]$ | Expectation under the subscripted probability distribution |

$H[\xb7]$ | Shannon entropy of a probability distribution |

$DKL[\xb7\u2225\xb7]$ | Kullback-Leibler divergence between probability distributions |

$\psi (\xb7)$ | Digamma function |

$\sigma (\xb7)$ | Softmax (normalized exponential) function |

Notation . | Variable . |
---|---|

$P(\xb7)$ | Probability distribution |

$Q(\xb7)$ | Variational posterior or empirical prior distribution |

$F$ | Variational free energy |

$G$ | Expected free energy |

$u\tau $ | Action at time $\tau $ |

$o=(o1,o2,\u2026,o\tau ,\u2026)$ | Observation |

$s=(s1,s2,\u2026,s\tau ,\u2026)$ | Hidden (latent) states |

$\pi $ | Policy (sequence of actions) |

$s\tau \pi $ | Expectation of state at time $\tau $ under $Q(s\tau |\pi )$ |

$s\tau u$ | Expectation of state at time $\tau $ under $Q(s\tau |u\tau )$ |

$v\tau \pi $ | Log expectation of state at time $\tau $ under $Q(s\tau |\pi )$ |

$o\tau u$ | Expectation of observation at time $\tau $ under $Q(o\tau |u<\tau )$ |

$u\tau o$ | Expectation of action at time $\tau $ under $Q(u\tau |o\tau )$ |

A | Parameters of categorical likelihood distribution |

B | Parameters of categorical transition probabilities |

C | Parameters of categorical prior preferences |

D | Parameters of categorical initial state probabilities |

H | Conditional entropy of likelihood distribution |

$a,a$ | Prior and posterior Dirichlet parameters for A |

$b,b$ | Prior and posterior Dirichlet parameters for B |

$d,d$ | Prior and posterior Dirichlet parameters for D |

$Cat(\xb7)$ | Categorical probability distribution |

$Dir(\xb7)$ | Dirichlet probability distribution |

$EP[\xb7]$ | Expectation under the subscripted probability distribution |

$H[\xb7]$ | Shannon entropy of a probability distribution |

$DKL[\xb7\u2225\xb7]$ | Kullback-Leibler divergence between probability distributions |

$\psi (\xb7)$ | Digamma function |

$\sigma (\xb7)$ | Softmax (normalized exponential) function |

### 2.3 Planning as Inference

^{10}variational, and expected free energy. This can be expressed in terms of a generalized free energy that includes the parameters of the generative model (e.g., the likelihood parameters, $A$):

^{11}

The equivalence between the expected free energy as shown in equations 2.6 and 2.7 rests on a mean-field assumption that equates the variational posterior for states and parameters with the product of their marginal posteriors. This means that policy selection minimizes risk and ambiguity. Risk, in this setting, is simply the difference between predicted and prior beliefs about final states. In other words, policies will be deemed more likely if they bring about states that conform to prior preferences. In the optimal control literature, this part of expected free energy underwrites KL control (Todorov, 2008; van den Broek et al., 2010). In economics, it leads to risk-sensitive policies (Fleming & Sheu, 2002). Ambiguity reflects the uncertainty about future outcomes, given hidden states. Minimizing ambiguity therefore corresponds to choosing future states that generate unambiguous and informative outcomes (e.g., switching on a light in the dark).

^{12}

Intrinsic value is also known as intrinsic motivation in neurorobotics (Barto et al., 2013; Oudeyer & Kaplan, 2007; Ryan & Deci, 1985), the value of information in economics (Howard, 1966), salience in the visual neurosciences, and (rather confusingly) Bayesian surprise in the visual search literature (Itti & Baldi, 2009; Schwartenbeck, Fitzgerald, Dolan, & Friston, 2013; Sun, Gomez, & Schmidhuber, 2011). In terms of information theory, intrinsic value is mathematically equivalent to the expected mutual information between hidden states in the future and their consequences, consistent with the principles of minimum redundancy or maximum efficiency (Barlow, 1961, 1974; Linsker, 1990). Finally, from a statistical perspective, maximizing intrinsic value (i.e., salience and novelty) corresponds to optimal Bayesian design (Lindley, 1956) and machine learning derivatives, such as active learning (MacKay, 1992). On this view, active learning is driven by novelty—namely, the information gain afforded to beliefs about model parameters, given future states and their outcomes. Heuristically, this curiosity resolves uncertainty about “what would happen if I did that?” (Schmidhuber, 2010). Figure 1 illustrates the compass of expected free energy, in terms of its special cases, ranging from optimal Bayesian design through to Bayesian decision theory.

## 3 Sophisticated Inference

So far, we have considered generative models of policies—namely, a fixed number of ordered action sequences. These generative models can be regarded as placing priors over actions that stipulate a small number of allowable action sequences. In what follows, we consider more general models, in which the random variables are actions at each point in time, such that policies become a prior over transitions among action or control states. If we relax this prior, such that successive actions are conditionally independent, we can simplify belief updating, and implicit planning, at the expense of having to consider a potentially enormous number of policies.

The simplification afforded by assuming actions are conditionally independent follows because both actions and states become Markovian. This means we can use belief propagation (Winn & Bishop, 2005; Yedidia, Freeman, & Weiss, 2005) to update posterior beliefs about hidden states and actions, given each new observation. In other words, we no longer need to evaluate the posterior over hidden states in the past to evaluate a posterior over policies. Technically, this is because policies introduced a semi-Markovian aspect to belief updating by inducing conditional dependencies between past and future hidden states. The upshot of this is that one can use posterior beliefs from the previous time step as empirical priors for hidden states and actions at the subsequent time step. This is formally equivalent to the forward pass in the forward-backward algorithm (Ghahramani & Jordan, 1997), where the empirical prior over hidden states depends on the preceding (i.e., realized) action. Put simply, we are implementing a Bayesian filtering scheme in which observations are generated by action at each time step. Crucially, the next action is sampled from an empirical prior based on (a free energy functional of) posterior beliefs about the current hidden state.

At this point, the cost of the Markovian assumption arises: if we choose a policy horizon that is too far into the future, the number of policies could be enormous. In other words, we could effectively induce a deep tree search over all possible sequences of future actions that would be computationally prohibitive. However, we can now turn to sophisticated schemes to finesse the combinatorics. This rests on the straightforward observation that if we propagate beliefs and uncertainty into the future, we only need to evaluate policies or paths that have a nontrivial likelihood of being pursued. This selective search over plausible paths is constrained at two levels. First, by propagating probability distributions, we can restrict the search over future outcomes—for any given action at any point in the future—that have a nontrivial posterior probability (e.g., greater than 1/16). Similarly, we only need to evaluate those policies that are likely to be pursued—namely, those with an expected free energy that renders their prior probability nontrivial (e.g., greater than 1/16).

This deep search involves evaluating all actions under all plausible outcomes so that one can perform counterfactual belief updating at each point in time (given all plausible outcomes). However, it is not necessary to evaluate outcomes per se; it is sufficient to evaluate distributions over outcomes, conditioned on plausible hidden states. This is a subtle but important aspect of finessing the combinatorics of belief propagation into the future and rests on having a generative model (that generates outcomes).

*if I did that, I would find out about this*. A sophisticated creature additionally believes that

*if I found that out, I would then do this*. An intuitive example would be in deciding whether to check the news, look at the weather forecast, read a novel, or go for a walk. The first two options might offer similar information gain and would appeal to an unsophisticated agent. Without knowing the weather, the latter two will be hard to disambiguate given preferences for walking in the sun or reading indoors if it were raining. A more sophisticated agent will find the weather forecast more salient than the news: knowing the weather will determine whether the next action will be to go for a walk or stay in and read, given that the preferred option is more likely to be chosen once the weather is known.

This sort of approach to evaluating a tree of possible policies, using a recursive form for the expected free energy, has been suggested by others (Çatal, Verbelen, Nauta, Boom, & Dhoedt, 2020; Çatal, Wauthier, et al., 2020), who have applied this in the context of robot vision and navigation. The distinction between this and the formulation presented here is the sophisticated aspect: here, each additional step into the future evaluates the expected free energy in terms of the beliefs anticipated at that time point, as opposed to beliefs held (at the present) about that time point. Despite this difference, the similarities in these approaches speak to the feasibility of scaling sophisticated inference to high-dimensional

Having established the formal basis of sophisticated planning, in terms of belief propagation, we now turn to some illustrative examples to show how it works in practice.

## 4 Simulations

In this section, we provide some simulations to compare sophisticated and unsophisticated schemes on the three-arm bandit task described in section 1. Here, we frame this paradigm in terms of a rat foraging in a three-arm T-maze, where the right and left upper arms are baited with rewards and punishments, and the bottom arm contains an instructional cue indicating whether the bait is likely to be on the right or left. In these examples, cue validity was 95%. The details of this setup have been described elsewhere (Friston et al., 2016; Friston, FitzGerald et al., 2017). In brief, the generative model comprises a likelihood mapping between hidden states and outcomes and probability transitions among states. Here, there are two outcome modalities. The first reports the experience of the rat in terms of its location (with distinct outcomes for the instructional cue location – *right* versus *left*). The second modality registered rewarding outcomes, with three levels (*none*, *reward*, and *punishment*—for example, foot shock). There were two hidden factors: the rat's location (with four possibilities) and the latent context (i.e., whether the rewarding arm was on the right or the left). With these hidden states and outcomes, we specify the generative model in terms of:

The sensory mapping

**A**, which maps from the two hidden state factors (location and context) to each of the two sensory modalities (location and reward).The transition matrices

**B**, which govern how states at one time point map onto the next, given a particular action $(ut)$. The transitions among locations are action dependent, with four actions (moving to one of the four locations), while the context did not change during any particular trial (i.e., there were no context transitions within trials).The cost vectors

**C**for each hidden state factor, which also specify the agent's preferences for each outcome modality. The latter allows for an alternative formulation that we discuss below.The priors over initial states,

**D**.

In the following simulations, the rat experienced 32 trials, each comprising two moves with three outcomes, including an initial outcome that located the rat at the start (i.e., center) location. The rat encountered the first trial with ambiguous prior beliefs about the context, that is, the reward was equally likely to be right or left.

^{13}

**H**is the conditional entropy of the likelihood distribution. The sufficient statistics are the parameters of the categorical distributions in equation 3.2, where model parameters are usually hyperparameterized in terms of the concentration parameters of Dirichlet distributions (denoted by capital and lowercase bold variables, respectively):

While crude, this works under the assumption that if one policy is 16 times less likely than alternatives given how far it has been evaluated, it is unlikely to be redeemed by evaluating it further. As such, there are savings to be had in not doing so. If there were no constraints on computational resources (temporally or thermodynamically), the pruning threshold could be set to be zero, ensuring an exhaustive evaluation of all possible policies. The principles that underwrite sophisticated inference do not depend on this specific implementational detail, and alternative methods could be used.

Other approaches to searching through policy trees include schemes like Thompson sampling (Ortega & Braun, 2010; Osband, Van Roy, Russo, & Wen, 2019; Thompson, 1933), which sample from the posterior probability for states and select policies that maximize preferences given this sample. Like the threshold we have selected, this simplifies the search through alternative policies by using samples in place of evaluating the full posterior probabilities. With enough exposure to a task, Thompson sampling ensures that the full space of plausible policies is attempted, possibly finding “optimal” policies that are discounted by early pruning under our approach. In our setting, Thompson sampling would not be appropriate because our focus is on inference (selecting the best policy within a trial) as opposed to learning a policy over many exposures to a trial. Having said this, it is worth highlighting that action selection using the sophisticated inference scheme involves sampling from the posterior distribution over actions—subject to some temperature parameter. While this parameter is typically very large so that the maximum a posteriori action is chosen, this could be relaxed to ensure the occasional selection of unlikely actions, in the spirit of Thompson sampling.

**d**). In these generative models, learning is straightforward and involves the accumulation of posterior concentration parameters (Friston et al., 2016). For example, to learn the likelihood mapping and initial hidden states, we have

^{14}

### 4.1 Exploration and Exploitation in a T-Maze

**C**) of $-$2 and 2, respectively.

^{15}In these and subsequent simulations, actions were selected as the most likely (maximum a posteriori) action. Therefore, all subsequent simulations are deterministic realizations of (Bayes's) optimal behavior based on expected free energy. The simulations start with a sophisticated agent with a planning horizon of two (this corresponds to the depth of action sequences considered into the future). In other words, it accumulates the expected free energy for all plausible paths, until the end of each trial. This enables a confident and definitive epistemic policy selection that gives way to exploitation, when the rat realizes the reward is always located in the left arm.

If we compare this performance with that of an unsophisticated rat, which looks just one move ahead, we see a similar behavior. However, there are two differences. First, the rat is less confident about its behavior because it does not evaluate the consequences of its actions in terms of belief updating. Although it finds the instructional cue more attractive, in virtue of its epistemic affordance, it is still partially compelled to remain at the central location, which ensures that it will avoid aversive outcomes. Because the unsophisticated agent underestimates the epistemic affordance of the instructional cue, it paradoxically performs better in terms of suspending its information foraging earlier and switching to exploitative behavior a few trials before the sophisticated agent (but see below).

For completeness, we show the results of an unsophisticated agent, whose behavior is predicated on Bayesian risk, that is, with no epistemic value in play. As might be anticipated, this agent exposes itself to Bayesian risk, forgoing a visit to the right or left arm, in a way that is precluded by agents who minimize expected free energy. Here, the starting and instructional cue locations are equally attractive. When the rat is lucky enough to select the lower arm, it knows what to do; however, it has no sense that this is the right kind of behavior. After a sufficient number of trials, it realizes that the reward is always on the left-hand side and starts to respond in an exploitative fashion, albeit with relatively low confidence. These results highlight the distinction between sophisticated and unsophisticated agents who predicate their policy selection on expected free energy and between unsophisticated agents using expected free energy with and without epistemic affordance.

### 4.2 Deep Planning and Navigation

The simulations show that a sophisticated belief-updating scheme enables more confident and nuanced policy selection, which translates into more efficient exploitative behavior. To illustrate how this scheme scales up to deeper policy searches, we revisit a problem that has been previously addressed using a bespoke prior, based on the graph Laplacian (Kaplan & Friston, 2018). This problem was previously framed in terms of navigation to a target location in a maze. Here, we forgo any special priors to see if the sophisticated scheme could handle deep tree searches that underwrite paradoxical behaviors, like moving away from a target to secure it later (see the mountain car problem). Crucially, in this instance, there was no ambiguity about the hidden states. However, there was ambiguity or uncertainty about the likelihood mapping that determines whether a particular location should be occupied. In other words, this example uses a more conventional foraging setup in which the rat has to learn about the structure of the maze while simultaneously pursuing its prior preferences to reach a target location. Here, exploratory behavior is driven by the intrinsic value or information gain afforded to beliefs about parameters of the likelihood model (as opposed to hidden states). Colloquially, one can think of this as epistemic affordance that is underwritten by novelty as opposed to salience (Barto et al., 2013; Parr & Friston, 2019a; Schwartenbeck et al., 2019). Having said this, we anticipated that exactly the same kind of behavior would arise and that the sophisticated scheme would be able to plan to learn and then exploit what it has learned.

In this paradigm, a rat has to navigate over the 8 $\xd7$ 8 grid maze, where each location may or may not deliver a mildly aversive stimulus (e.g., a foot shock). Navigation is motivated by prior preferences to occupy a target location—here, the center. In the simulations below, the rat starts at the entrance to the maze and has a prior preference for safe outcomes (cost of $-$1) and against aversive outcomes (cost of $+$1). Prior preferences for location depend on the distance from the current position to the target location. The generative model for this setup is simple: there was one hidden factor with 64 states corresponding to all possible locations. These hidden states generate safe or aversive (somatosensory) outcomes, depending on the location. In addition, (exteroceptive) cues are generated that directly report grid location. The five allowable actions comprise one step in any direction or staying put.

Figure 5 shows the results of typical simulations when increasing the planning horizon from 1 through to 4. The key point here is that there is a critical horizon, which enables our subject to elude local minima of expected free energy as it pursues its goal. In these simulations, our subject was equipped with full knowledge of the aversive locations and simply planned a route to its target location. However, relatively unsophisticated agents get stuck on the other side of aversive barriers that are closest to the target location. In other words, they remain in locations in which the expected free energy of leaving is always greater than staying put (Cohen, McClure, & Yu, 2007). This can happen when the planning horizon is insufficient to enable the rat to contemplate distal (and potentially preferable) outcomes (as seen in the lower left and middle panels of Figure 5). However, with a planning horizon of 4 (or more), these local minima are vitiated, and the rat easily plans—and executes—the shortest path to the target. In these simulations, the total number of moves was eight, which is sufficient to reach the target via the shortest path. This sort of behavior is reminiscent of the prospective planning required to solve things like the mountain car problem. In other words, the path of least expected free energy can often involve excursions through state (and belief) space that point away from the ultimate goal.

To aid with intuition as to the evaluation of alternative policies, we explicitly evaluated some of the policies that could be chosen with a planning horizon of two. Assuming the maze layout is known, there is little uncertainty to resolve, and preferences (i.e., costs) will be the primary determinant of behavior. Starting from the maze entrance (2,8), the options are shown in Table 2.

Here, we can see that when we consider only the first step, there is a cost of $+$2.6 associated with choosing up and a cost of $+$6.0 for choosing left. Remembering that cost is formulated as a log probability; this means up is about 30 times more likely than left and suggests we do not need to evaluate policies starting with a left (which falls below the 1/16 threshold) any further. Inspection of the options for the second step of these policies and comparison with those for the policies starting with up suggests the cost incurred at the first step cannot be compensated for at the second.

For all policies surviving the 1/16 threshold, we then have to consider the next step. For the example in Table 2, we could do this simply by taking the total cost for the second step for each action and, using a softmax operator as in equation 4.1, compute the relative probability of each action and the cost incurred on averaging under these probabilities. Adding this to the cost from the first step and repeating for all policies not eliminated by the 1/16 threshold, we arrive at the (log) probability distribution over the first action—here, favoring up.

. | . | . | Step 2 . | ||
---|---|---|---|---|---|

. | Cost (nats) . | . | Cost (nats) . | ||

Action . | Square Color . | Target Proximity . | Action . | Square Color . | Target Proximity . |

Stay at (2,8) | $-$1 | $+$4.2 | Up to (2,7) | $-$1 | $+$3.6 |

Down to (2,8) | $-$1 | $+$4.2 | |||

Left to (1,8) | $+$1 | $+$5.0 | |||

Right to (3,8) | $+$1 | $+$3.6 | |||

Stay at (2,8) | $-$1 | $+$4.2 | |||

Up to (2,7) | $-$1 | $+$3.6 | Up to (2,6) | $+$1 | $+$2.2 |

Down to (2,8) | $-$1 | $+$4.2 | |||

Left to (1,7) | $+$1 | $+$4.4 | |||

Right to (3,7) | $-$1 | $+$2.8 | |||

Stay at (2,7) | $-$1 | $+$3.6 | |||

Down to (2,8) | $-$1 | $+$4.2 | Up to (2,7) | $-$1 | $+$3.6 |

Down to (2,8) | $-$1 | $+$4.2 | |||

Left to (1,8) | $+$1 | $+$5.0 | |||

Right to (3,8) | $+$1 | $+$3.6 | |||

Stay at (2,8) | $-$1 | $+$4.2 | |||

Left to (1,8) | $+$1 | $+$5.0 | Up to (1,7) | $+$1 | $+$4.4 |

Down to (1,8) | $+$1 | $+$5.0 | |||

Left to (1,8) | $+$1 | $+$5.0 | |||

Right to (2,8) | $-$1 | $+$4.2 | |||

Stay at (1,8) | $+$1 | $+$5.0 | |||

Right to (3,8) | $+$1 | $+$2.8 | Up to (3,7) | $-$1 | $+$2.8 |

Down to (3,8) | $+$1 | $+$3.6 | |||

Left to (2,8) | $-$1 | $+$4.2 | |||

Right to (4,8) | $-$1 | $+$4.2 | |||

Stay at (3,8) | $+$1 | $+$2.8 |

. | . | . | Step 2 . | ||
---|---|---|---|---|---|

. | Cost (nats) . | . | Cost (nats) . | ||

Action . | Square Color . | Target Proximity . | Action . | Square Color . | Target Proximity . |

Stay at (2,8) | $-$1 | $+$4.2 | Up to (2,7) | $-$1 | $+$3.6 |

Down to (2,8) | $-$1 | $+$4.2 | |||

Left to (1,8) | $+$1 | $+$5.0 | |||

Right to (3,8) | $+$1 | $+$3.6 | |||

Stay at (2,8) | $-$1 | $+$4.2 | |||

Up to (2,7) | $-$1 | $+$3.6 | Up to (2,6) | $+$1 | $+$2.2 |

Down to (2,8) | $-$1 | $+$4.2 | |||

Left to (1,7) | $+$1 | $+$4.4 | |||

Right to (3,7) | $-$1 | $+$2.8 | |||

Stay at (2,7) | $-$1 | $+$3.6 | |||

Down to (2,8) | $-$1 | $+$4.2 | Up to (2,7) | $-$1 | $+$3.6 |

Down to (2,8) | $-$1 | $+$4.2 | |||

Left to (1,8) | $+$1 | $+$5.0 | |||

Right to (3,8) | $+$1 | $+$3.6 | |||

Stay at (2,8) | $-$1 | $+$4.2 | |||

Left to (1,8) | $+$1 | $+$5.0 | Up to (1,7) | $+$1 | $+$4.4 |

Down to (1,8) | $+$1 | $+$5.0 | |||

Left to (1,8) | $+$1 | $+$5.0 | |||

Right to (2,8) | $-$1 | $+$4.2 | |||

Stay at (1,8) | $+$1 | $+$5.0 | |||

Right to (3,8) | $+$1 | $+$2.8 | Up to (3,7) | $-$1 | $+$2.8 |

Down to (3,8) | $+$1 | $+$3.6 | |||

Left to (2,8) | $-$1 | $+$4.2 | |||

Right to (4,8) | $-$1 | $+$4.2 | |||

Stay at (3,8) | $+$1 | $+$2.8 |

Finally, to simulate curiosity under a task set, we reinstated prior preferences about location. In this simulation, the rat has to resolve the dual imperative to satisfy its curiosity, while at the same time realizing preferences for being at the center of the maze. In other words, it has to contextualize its goal-seeking behavior in relation to what it knows about how to realize those goals. Figure 7 shows the results of a simulation in which the rat was given five exposures to the maze, each comprising eight moves with a planning horizon of four. Within four exposures, it has learned what it needs to learn—about the aversive locations—to plan the shortest path to its target location and execute that path successfully (dotted black line in the left panel of Figure 7). In contrast to Figure 6, the exploration is now limited to preferred locations with precise likelihood mappings that are sufficient to encompass the shortest path (compare the left panels of Figures 6 and 7).

This completes our numerical analyses, in which we have looked at deep policy searches predicated on expected free energy, where expected free energy supplements Bayesian risk with epistemic affordance in terms of either salience (resolving uncertainty about hidden states) or novelty (resolving uncertainty about hidden model parameters).

## 5 Conclusion

This letter has described a recursive formulation of expected free energy that effectively instigates a deep tree search for planning as inference. The ensuing planning is sophisticated, in the sense that it entails beliefs about beliefs—in virtue of accumulating predictive posterior expectations of expected free energies down plausible paths. In other words, instead of just propagating beliefs about the consequences of successive actions, the scheme simulates belief updating in the future, based on preceding beliefs about the consequences of action. This scheme was illustrated using a simple T-maze problem and a navigation problem that required a deeper search.

In section 1, we noted that active inference may be difficult to scale, although remarkable progress has been made in this direction recently using amortized inference and sampling. For example, Ueltzhöffer (2018) parameterized both the generative model and approximate posterior with function approximators, using evolutionary schemes to minimize variational free energy when gradients were not available. Similarly, Millidge (2019) amortized perception and action by learning a parameterized approximation to expected free energy. Çatal et al. (2019) focused on learning prior preferences, using a learning-from-example approach. Tschantz et al. (2019) extended previous point-estimate models to include full distributions over parameters. This allowed them to apply active inference to continuous control problems (e.g., the mountain car problem, the inverted pendulum task, and a challenging hopper task) and demonstrate an order of magnitude increase in sampling efficiency relative to a strong model-free baseline (Lillicrap et al., 2015). (See Tschantz et al., 2019, for a full discussion and a useful deconstruction of active inference, in relation to things like model-based reinforcement learning; Schrittwieser et al., 2019.)

Note that the navigation example is an instance of planning to learn. As such, it solves the kinds of problems for which reinforcement learning and its variants usually address. In other words, we were able to solve a learning problem from first (i.e., variational) principles without recourse to backward induction or other (belief-free) schemes like Q-learning, SARSA, or successor representations (e.g., Dayan, 1993; Gershman, 2017; Momennejad et al., 2017; Russek, Momennejad, Botvinick, Gershman, & Daw, 2017). This is potentially important because predicating an optimization scheme on inference, as opposed to learning, endows it with a context sensitivity that eludes many learning algorithms (Daw, Gershman, Seymour, Dayan, & Dolan, 2011). In other words, because there are probabilistic representations of time-sensitive hidden states (and implicit uncertainty about those states), behavior is motivated by resolving uncertainty about the context in which an agent is operating. This may be the kind of (Bayesian) mechanics that licenses the notion of competent schemes that can both learn to plan and plan to learn.

The current formulation of active inference does not call on sampling or matrix inversions; the Bayes optimal belief-updating deals with uncertainty in a deterministic fashion. Conceptually, this reflects the difference between the stochastic aspects of random dynamical systems and the deterministic behavior of the accompanying density dynamics, which describe the probabilistic evolution of those systems (e.g., the Fokker-Planck equation). Because active inference works in belief spaces, that is, on statistical manifolds (Da Costa, Parr, Sengupta, et al., 2020), there is no need for sampling or random searches; the optimal paths are instead evaluated by propagating beliefs or probability distributions into the future to find the path of least variational free energy (Friston, 2013).

In the setting of deep policy searches, this approach has the practical advantage of terminating searches over particular paths when they become implausible. For example, in the navigation example, there were five actions and 64 hidden states, leading to a large number of potential paths (1.0486 $\xb7$ 10$10$ for a planning horizon of four and 1.0737 $\xb7$ 10$15$ for a planning horizon of six). However, only a tiny fraction of these paths is actually evaluated—usually several hundred, which takes a few hundred milliseconds on a personal computer. Given reasonably precise beliefs about current states and state transitions, only a small number of paths are eligible for evaluation, which leads us to our final comment on the scalability of active inference.

### 5.1 Limitations

In one sense, we have addressed scaling through the computational efficiency afforded by belief propagation using a sophisticated scheme. However, we have illustrated this scheme only on rather trivial problems. In principle, one can scale up the dimensionality of state spaces (and outcomes) with a degree of impunity. This follows from the fact that the number of plausible states (and transitions) can be substantially constrained, using the right kind of generative model—one that leverages factorizations and sparsity. For example, the factorization between hidden states and actions used above rests on the implicit assumption that every action is allowed from every state. This is a strong assumption but perfectly apt for many generative models.

One could also call on a related symmetry—namely, a hierarchical separation of temporal scales in deep models, where one Markov decision process is placed on top of another (Friston, Rosch, et al., 2017; George & Hawkins, 2009; Hesp, Smith, et al., 2019; Rikhye et al., 2019). In these models, transitions at the higher level usually unfold at a slower timescale than the level below. This engenders semi-Markovian dependencies that can generate complicated and structured behaviors. In this setting, one could consider hidden states at higher levels that generate the initial and final states of the level below. Policy optimization within each level, using a sophisticated scheme, could then realize the trajectory between the initial states (i.e., empirical priors over initial states) and final states (i.e., priors that determine the cost function and subsequent empirical priors over action).

Finally, it should be noted that in many applications, the states and actions of real-world processes are continuous, which presents a further scaling challenge for discrete state-space models However, it is possible to combine sophisticated (discrete) schemes with continuous models, provided one uses the appropriate message passing between the continuous and discrete levels. For example, Friston, Parr, et al. (2017) used a Markov decision process to drive continuous eye movements. Indeed, it would be interesting to revisit simulations of saccadic searches using sophisticated inference, especially in the context of reading.

## Appendix: Expected Free Energy

This appendix considers two lemmas that underwrite expected free energy from two complementary perspectives. The first is based on a generative model that combines the principles of optimal Bayesian design (Lindley, 1956) and decision theory (Berger, 2011), while the second is based on a principled account of self-organization (Friston, 2019; Parr et al., 2020). Finally, we consider several corollaries that speak to the notions of active inference (Friston et al., 2015), empowerment (Klyubin, Polani, & Nehaniv, 2005), information bottlenecks (Tishby et al., 1999), self-organization (Friston, 2013), and self-evidencing (Hohwy, 2016). In what follows, $Q(o\tau ,s\tau ,\pi )$ denotes a predictive distribution over future variables and policies, conditioned on initial observations, while $P(o\tau ,s\tau ,\pi )$ denotes a generative model—that is, a marginal distribution over final states and policies. For simplicity, we omit model parameters and assume policies start from the current time point, allowing us to omit the variational free energy from the generalized free energy (since observational evidence is the same for all policies).

### A.1 Objective

Our objective is to establish a generalized free energy functional that can be minimized with respect to a posterior over policies, noting that this posterior is necessary to marginalize the joint posterior over hidden states and policies to infer hidden states. To comply with Bayesian decision theory, generalized free energy can be constructed to place an upper bound on Bayesian risk, which corresponds to the divergence between the predictive distribution over outcomes and prior preferences. In other words, Bayesian risk is the expected surprisal or negative log evidence. Confusingly, Bayesian risk and expected risk are two different quantities. The former is the expected surprisal, while the latter is a KL-divergence between predicted and preferred outcomes (or states). To comply with optimal Bayesian design, one can specify priors over policies that lead to states with a precise likelihood mapping to observable outcomes.

^{16}is an upper bound on risk, under a generative model whose priors over policies lead to states with precise likelihoods:

Here, policies are treated as random variables, which means planning as inference (Attias, 2003; Botvinick & Toussaint, 2012) becomes belief updating under optimal Bayesian design priors (Lindley, 1956; MacKay, 1992). One might ask what licenses these priors above. Although they can be motivated in terms of information gain (see equation A.4), there is a more straightforward motivation that arises as a steady-state solution. We now turn to this complementary perspective that inherits from the Bayesian mechanics described in Friston (2019). Here, we are interested in situations when the predictive distribution attains its steady-state or target distribution.

It may seem odd to predicate optimal behavior on a steady-state distribution. However, the fact that action and its consequences can be expressed probabilistically implies the existence of a (steady-state) joint distribution that does not change over time. In what follows, we use the existence of this steady-state distribution to express the posterior over policies as a functional of the distribution over other variables, given a particular policy. This functional is expected free energy. This represents a deflationary approach to optimality, in the sense that optimal policies are just those that underwrite a steady state. The question now is, What kind of steady state are we interested in?

We will make a distinction between simple and general steady states in terms of the degeneracy (i.e., many-to-one mapping) of policies to any final state. Simple steady states are characterized by a unique path of least action from some initial observations to a final state. This would be appropriate for describing classical systems, such as a pendulum or planetary bodies. Conversely, a general steady state allows for multiple paths from initial observations to the final state, which means the entropy or uncertainty about which path was actually taken is high. We will be particularly interested in the autonomous behavior of systems whose steady state is maintained by multiple (degenerate) paths or policies. The ensuing distinction can be characterized by a scalar quantity corresponding to the relative entropy or precision of policies and outcomes, conditioned on final states (and initial observations). This scalar $\beta \u22650$ is not a free parameter; it just characterizes the kind of steady state at hand. Note that in this setup, the notion of optimality is replaced by (or reduces to) the existence of a steady state, which may or may not be simple.

### A.2 Objective

We seek distributions over policies that afford steady-state solutions, that is, when the final distribution does not depend on initial observations. Such solutions ensure that on average, stochastic policies lead to a steady-state or target distribution specified by the generative model. These solutions exist in virtue of conditional independencies, where the hidden states provide a Markov blanket (cf. information bottleneck) that separates policies from outcomes. In other words, policies cause final states that cause outcomes. Put simply, policies influence outcomes, but only via hidden states. We will see below that there is a family of such solutions, where the Bayes optimality solution above is a special (canonical) case. In what follows, $Q(o\tau ,s\tau ,\pi ):=P(o\tau ,s\tau ,\pi |o\u2264)$ can be read as a posterior distribution, given initial conditions.

This solution corresponds to a standard stochastic control, variously known as KL control or risk-sensitive control (van den Broek et al., 2010). In other words, one picks policies that minimize the divergence between the predictive and target distribution. In different sorts of systems, the relationship between the entropies ($\beta $) may differ. As such, different values of this parameter may be appropriate in describing these kinds of system. More generally (i.e., $\beta >0$), policies are more likely when they lead to states with a precise likelihood mapping. One perspective, on the distinction between simple and general steady states, is in terms of conditional uncertainty about policies. For example, simple (i.e., $\beta =0$) steady states preclude uncertainty about which policy led to a final state. This would be appropriate for describing classical systems (that follow a unique path of least action), where it would be possible to infer which policy had been pursued given the initial and final outcomes. Conversely, in general steady-state systems (e.g., mice and men), simply knowing that “you are here” does not tell me “how you got here,” even if I knew where you were this morning. Put another way, there are lots of paths or policies open to systems that attain a general steady state.

The treatment in Friston (2019) effectively turns the steady-state lemma on its head by assuming the steady-state in equation A.5 is stipulatively true—and then characterizes the ensuing self-organization in terms of Bayes optimal policies. In active inference, we are interested in a certain class of systems that self-organize to general steady states: those that move through a large number of probabilistic configurations from their initial state to their final (steady) state. In terms of information geometry, this means that the information distance between any initial and the final (steady) state is large. In the current setting, we could replace information distance (Crooks, 2007; Kim, 2018) by information gain (Lindley, 1956; MacKay, 1992; Still & Precup, 2012). That is, we are interested in systems that attain steady state (i.e., target distributions) with policies associated with a large information gain.^{17} Although not pursued here, general steady states with precise likelihood mappings have precise Fisher information matrices and information geometries that distinguish general forms of self-organization from simple forms (Amari, 1998; Ay, 2015; Caticha, 2015; Ikeda, Tanaka, & Amari, 2004; Kim, 2018). This perspective can be unpacked in terms of information theory with the following corollaries, which speak to active inference, empowerment, information bottlenecks, self-organisation, and self-evidencing.

(Active Inference). If a system attains a general steady state, then by the Bayes optimality lemma, it will appear to behave in a Bayes optimal fashion in terms of both optimal Bayesian design (i.e., exploration) and Bayesian decision theory (i.e., exploitation). Crucially, the loss function defining Bayesian risk is the negative log evidence for the generative model entailed by an agent. In short, systems (i.e., agents) that attain general steady states will look as if they are responding to epistemic affordances (Parr & Friston, 2017).

^{8}prior beliefs.

## Software Note

Although the generative model changes from application to application, the belief updates described in this letter are generic and can be implemented using standard routines (here, spm_MDP_VB_XX.m). These routines are available as Matlab code in the SPM academic software: http://www.fil.ion.ucl.ac.uk/spm/. The simulations in this letter can be reproduced (and customized) via a graphical user interface by typing $>>$ DEM and selecting the appropriate **(T-maze or Navigation) demo**.

## Notes

^{1}

Technically, a functional is defined as a function whose arguments (in this case, beliefs about hidden states) are themselves functions of other arguments (in this case, observed outcomes generated by hidden states).

^{2}

Expected free energy can be read as risk plus ambiguity: *risk* is taken here to be the relative entropy (i.e., KL divergence) between predicted and preferred outcomes, while *ambiguity* is the conditional entropy (i.e., conditional uncertainty) about outcomes given their causes.

^{3}

Exploration here has been associated with the resolution of ambiguity or uncertainty about hidden states, namely, the context in which the agent is operating (i.e., left or right arm payoff). More conventional formulations of exploration could remove the prior belief that the right and left arms have a complementary payoff structure, such that the agent has to learn the probabilities of winning and losing when selecting either arm. However, exactly the same principles apply: the right and left arms now acquire an epistemic affordance in virtue of resolving uncertainty about the contingencies that underlie payoffs as opposed to hidden states. We will see how this falls out of expected free energy minimization later.

^{4}

Bayesian risk is taken to be negative expected utility, that is, expected loss under some predictive posterior beliefs (about hidden states).

^{5}

Epistemic affordance is taken to be the information gain or relative entropy of predictive beliefs (about hidden states) before and after an action.

^{6}

Note that both $F[Q(s,\pi )]$ and $F(\pi )$ depend on present and past observations. However, this dependence is typically left implicit, a convention we adhere to in this letter.

^{7}

Generally log evidence is accuracy minus complexity, where accuracy is the expected log likelihood and complexity is the KL divergence between posterior and prior beliefs.

^{8}

Where $v$ can be thought of as transmembrane voltage or depolarization and $s$ corresponds to the average firing rate of a neuronal population. (Da Costa, Parr, Sengupta, & Friston, 2020).

^{9}

Surprisal is the self-information or negative log probability of outcomes. (Tribus, 1961).

^{10}

The empirical free energy is usually based on inferences at a higher level in a hierarchical generative model. For details on hierarchical generative models, see Friston, Rosch, Parr, Price, and Bowman (2017).

^{12}

Because the expected evidence bound cannot be less than zero, the expected free energy of a policy is always greater than the (negative) expected intrinsic value (i.e., log evidence) plus the intrinsic value (i.e., information gain).

^{13}

We have suppressed any tensor notation here by assuming there is only one outcome modality and one hidden factor. In practice, this assumption can be guaranteed by working with the Kronecker tensor product of hidden factors. This ensures exact Bayesian inference, because conditional dependencies among hidden factors are evaluated.

^{14}

Note that in order to accumulate beliefs about the context from trial to trial, it is necessary to carry over posterior beliefs about context from one trial as prior beliefs for the next (in the form of Dirichlet concentration parameters). For consistency with earlier formulations of this paradigm, we carry over the beliefs about the initial state on the previous trial that are evaluated using a conventional backwards pass—namely, the normalized likelihood of any given initial state, given subsequent observations—and probability transitions based on realized action.

^{15}

Because costs are specified in terms of self-information or surprisal, they have meaningful and quantitative units. For example, a differential cost of three natural units corresponds to a log odds ratio 1:20 and reflects a strong preference for one state or outcome over another. This is the same interpretation of Bayes factors in statistics (Kass & Raftery, 1995) Here, the difference between reward and punishment was four natural units.

^{16}

Equation A.1 follows from 3.1 when treating $F(\pi )$ and $E(\pi )$ as constants, that is, ignoring past observations and empirical priors over policies.

^{17}

Note that a divergence such as information gain is not a measure of distance. The information distance (a.k.a. information length) can be regarded as the accumulated divergences along a path on a statistical manifold from the initial location to the final location.

## Acknowledgments

K.J.F. was funded by the Wellcome Trust (088130/Z/09/Z). L.D. is supported by the Fonds National de la Recherche, Luxembourg (13568875). C.H. was funded by a Research Talent Grant (406.18.535) of the Netherlands Organisation for Scientific Research. We have no disclosures or conflicts of interest.

## References

*How robust are deep neural networks?*