Whence the Expected Free Energy?

The Expected Free Energy (EFE) is a central quantity in the theory of active inference. It is the quantity that all active inference agents are mandated to minimize through action, and its decomposition into extrinsic and intrinsic value terms is key to the balance of exploration and exploitation that active inference agents evince. Despite its importance, the mathematical origins of this quantity and its relation to the Variational Free Energy (VFE) remain unclear. In this paper, we investigate the origins of the EFE in detail and show that it is not simply"the free energy in the future". We present a functional that we argue is the natural extension of the VFE, but which actively discourages exploratory behaviour, thus demonstrating that exploration does not directly follow from free energy minimization into the future. We then develop a novel objective, the Free-Energy of the Expected Future (FEEF), which possesses both the epistemic component of the EFE as well as an intuitive mathematical grounding as the divergence between predicted and desired futures.

tage over other formulations that typically encourage exploration by adding ad-hoc exploratory terms to their loss function (Burda et al., 2018;Mohamed & Rezende, 2015;Oudeyer & Kaplan, 2009;Pathak, Agrawal, Efros, & Darrell, 2017). While the EFE is often described as a straightforward extension to the free energy principle that can account for prospective policies and is typically expressed in similar mathematical form Parr & Friston, 2017b, its origin remains obscure. Minimization of the EFE is sometimes motivated by a reductio-ad-absurdum argument following from the FEP (K. Friston et al., 2015; in that agents are driven to minimize the VFE, and therefore the only way they can act is to minimize their free-energy into the future. Since the future is uncertain, however, instead they must minimize the expected free energy. Central to this logic is the formal identification of the VFE with the EFE. In this paper, we set out to investigate the origin of the EFE and its relations with the VFE. We provide a broader perspective on this question, showing that the EFE is not the only way to extend the VFE to account for actionconditioned futures. We derive an objective which we believe to be a more natural analogue of the VFE, which we call the Free Energy of the Future (FEF), and make a detailed side-by-side comparison of the two functionals.Crucially, we show that the FEF actively discourages information-seeking behaviour, thus demonstrating that epistemic terms do not necessarily arise simply from extending the VFE into the future. We then investigate the origin of the epistemic term of the EFE, and show that the EFE is just the FEF minus the negative of the epistemic term in the EFE, which thus provides a straightforward perspective on the relation between the two functionals. We then propose our own mathematically principled starting point for action-selection under active inference -the divergence between desired and expected futures, and derive a novel bound on this quantity we call the Free-Energy of the Expected Future (FEEF), which has close relations to the generalized free energy ( ). We show that this objective has a natural interpretation in terms of the divergence between a veridicial and a biased generative model, that it allows use of the same functional for both inference and policy selection, and that it naturally decomposes into an extrinsic value term and an epistemic action term, thus maintaining the attractive exploratory properties of EFE-based active inference while also possessing a mathematically principled starting point and an intuitive interpretation.

The Variational Free Energy
The Variational Free Energy (VFE) is a core quantity in variational inference and constitutes a tractable bound on both the log model evidence and the KL divergence between prior and posterior (Beal et al., 2003;Blei, Kucukelbir, & McAuliffe, 2017;Fox & Roberts, 2012;Wainwright, Jordan, et al., 2008). For an in-depth motivation of the VFE and its use in variational inference, see Appendix A.
The VFE, defined at time t, denoted by F t , is given by, The agent receives observations o t and must infer the values of hidden states x t . The agent assumes that the environment evolves according to a Markov process so that the distribution over states at the current time-step only depends on the state at the previous time-step, and that the observation generated at the current time-step depends only on the state at the current time-step. Given a distribution over a trajectory of states and observations, and under Markov assumptions it can be factorised as follows: p(o 0:T , x 0:T ) = p(x 0 ) T t=1 p(o t |x t )p(x t |x t−1 ). In this paper, we also consider inference over future states and observations which have yet to be observed. Such future variables are denoted o τ or x τ where τ > t. To avoid dealing with infinite sums, agents only consider futures up to some finite time horizon, denoted T . Q(x t |o t ; φ) denotes an approximate posterior density parametrised by φ which, during the course of variational inference, is fit as closely as possible to the true posterior. Note: there is a slight difference in notation here compared to that usually used in variational inference. Normally the approximate posterior is written as Q(x t ; φ) without the dependence on o made explicit. This is because the variational posterior is not a direct function of observations, but rather the result of an optimization process which depends on the observations.Here, we make the dependence on o explicit to keep a clear distinction between the variational posterior Q(x t |o t ; φ), obtained through optimization of the variational parameters φ, and the variational prior Q(x t ) = E p(xt|xt−1) [Q(x t−1 |o t−1 ; φ)], obtained by mapping the previous posterior through the transition dynamics. Throughout this paper, we assume that inference is occurring in a discrete-time Partially-Observed Markov Decision Process (POMDP). This is to ensure compatibility with the EFE formulation later on, which is also situated within discrete-time POMDPs. 1 The utility of the VFE for inference comes from the fact that the VFE is equal to the divergence between true and approximate posteriors up to a constant: F t ≥ D KL [Q(x t |o t ; φ))||p(x t |o t )]. Thus, minimizing F t with respect to the parameters of the variational distribution makes Q(x t |o t ; φ) a good approximation of the true posterior.
One can also motivate the VFE as a technique to estimate model evidence. Log model evidence is a key quantity in Bayesian inference but is often intractable, meaning it cannot be computed directly. Intuitively, the log model evidence scores the likelihood of the data under a model, and thus provides a direct measure of the quality of a model. Under the free energy principle, minimizing the negative log model evidence (or surprisal) is the ultimate goal of self-organising systems (K. Friston & Ao, 2012a, 2012bK. Friston et al., 2006). The VFE provides an upper bound on the log model evidence. This can be shown by importance sampling the model evidence with respect to the approximate posterior, 1 It is important to note that the original FEP was formulated in continuous time with generalised coordinates (K. K. Friston et al., 2006) (where the hidden states are augmented with their temporal derivatives up to theoretically infinite order). The generalised coordinates mean that the agent is effectively performing variational inference over a Taylor-expanded future trajectory instead of a temporally-instant hidden state (K. J. . Action is derived by minimizing the gradients of the instantaneous VFE with respect to action, which requires the use of a forward model. More recent work on active inference and the FEP returns to the continuous-time formulation (K. Friston, 2019;Parr et al., 2020) and the conclusions drawn in this paper may look different in the continuous-time domain. and applying Jensen's inequality: Since the VFE is an upper bound on the log model evidence (or surprisal), as the VFE is minimized, it becomes an increasingly accurate estimate of the surprisal. To get a feel for the properties of the VFE, we showcase the following decomposition: This decomposition is the one typically used to compute the VFE in practice and has a straightforward interpretation.
Specifically, minimizing the negative accuracy (and thus maximizing accuracy) ensures that the observations are as likely as possible under the states, x t , predicted by the variational posterior while simultaneously minimizing the complexity term, which is a KL divergence between the variational posterior and the prior. Thus the goal is to keep the posterior as close to the prior as possible while still maximizing accuracy. Effectively, the complexity term acts as an implicit regulariser, reducing the risk of overfitting to any specific observation.

The Expected Free Energy
While variational inference as presented above only allows us to perform inference at the current time given observations, it is possible to extend the formalism to allow for inference over actions or policies in the future.
To achieve this extension, a variational objective is required which can be minimized contingent upon future states and policies, which will allow the problem of adaptive action selection to be reformulated as a process of variational inference. To do this, the formalism must be extended in two ways. First, the generative model is augmented to include actions a τ , and policies, which are sequences of actions π = [a 1 , a 2 ...a T ]. The action taken at the current time can affect future states, and thus future observations. In order to transform action selection into an inference problem, policies are treated as an inferred distribution Q(π) which is optimised to meet the agents goals. The second extension required is to translate the notion of an agent's goals into this probabilistic framework. Active inference encodes an agent's goals as a desired distribution over observationsp(o τ :T ). 2 This distribution is then incorporated into a biased generative model of the worldp(o τ , x τ ) ≈p(o τ )Q(x τ |o τ ) 3 , where we have additionally made the assumption that the true posterior can be well approximated with the variational posterior: p(x τ |o τ ) ≈ Q(x τ |o τ ) which simply states that the variational inference procedure was successful 4 . Active inference proceeds by inferring a variational policy distribution Q(π) that maximizes the evidence for this biased generative model. Intuitively, this approach turns the action selection problem on its head. Instead of saying: I have some goal, what do I have to do to achieve it? the active inference agent asks: Given that my goals were achieved, what would have been the most probable actions that I took?
A further complication of extending VFE into the future comes from future observations. While agents have access to current observations (or data) for planning problems, they must also reason about unknown future observations. This is dealt with by taking the expectation of the objective with respect to predicted observations o τ drawn from the generative model.
In the active inference framework, the goal is to infer a variational distribution over both hidden states and policies that maximally fit to a biased generative model of the future. The framework defines the variational objective function to be minimized, the Expected Free Energy, from time τ until the time horizon T , which is denoted G: A temporal mean-field factorisation of the approximate posterior and of the generative model is assumed such that . This factorisation neatly severs the temporal dependencies between time-steps. Given these assumptions, inferring the optimal Q(π), turns out to be relatively straightforward.
3 Some more recent work K. Friston, 2019) prefers an alternative factorisation of the biased generative model in terms of an unbiased likelihood and a biased prior state distributionp(oτ , xτ ) = p(oτ |xτ )p(xτ ). This leads to a different decomposition of the EFE in terms of risk and ambiguity (see Appendix B) but which is mathematically equivalent to the factorisation described here. 4 For additional information on the effect of this assumption, see appendix D.
Where G τ (π) = E Q(oτ ,xτ |π) [ln Q(x τ |π) − lnp(o τ , x τ )] is defined to be the EFE for a single time-step τ . From the KL-divergence above, it follows that the optimal variational policy distribution Q * (π) is simply the path integral into the future of the expected free energies for each individual time-step: where σ(x) is a softmax function. This implies that to infer the optimal policy distribution it suffices to minimize the sum of expected free energies for each time step into the future. Inference proceeds by using the generative model to rollout predicted futures, computing the EFE of those futures, and then selecting policies which minimize the sum of the expected free energies. Since under temporal mean field assumptions, trajectories decompose into a sum of time-steps, it is sufficient for the rest of the paper to only consider a single time-step τ .
To gain an intuition for the EFE, we showcase the following decomposition: While the EFE admits many decompositions, see Appendix B for a comprehensive overview, the one presented in Equation 3 is perhaps the the most important because it separates the EFE into an extrinsic, goal-directed term and an intrinsic, information-seeking term. The first term requires agents to maximize the likelihood of the desired observationsp(o τ ) under beliefs about the future. It thus directs an agent to act to maximize the probability of its desires occurring in the future. It is called the extrinsic value term since it is the term in the EFE which accounts for the agent's preferences.
The second term in equation 3 is the expected information gain, which is often termed the 'epistemic value' since it quantifies the amount of information gained by visiting a specific state. Since the information gain is negative, minimizing the EFE as a whole mandates maximizing the expected information gain. This drives the agent to maximize the divergence between its posterior and prior beliefs, thus inducing the agent to take actions which maximally inform their beliefs and reduce uncertainty. It is the combination of extrinsic and intrinsic value terms which underlies active inference's claim to have a principled approach to the exploration-exploitation dilemma (K. Friston et al., 2017aFriston et al., , 2015. The idea of maximizing expected information gain or "Bayesian surprise" (Itti & Baldi, 2009) to drive exploratory behaviour has been argued for in neuroscience (Baldi & Itti, 2010;Ostwald et al., 2012), and has been regularly proposed in reinforcement learning (Houthooft et al., 2016;Still & Precup, 2012;Sun, Gomez, & Schmidhuber, 2011;Tschantz, Millidge, Seth, & Buckley, 2020). It is important to note however that in these prior works, information gain has often been proposed as an ad-hoc addition to an already existing objective function with only the intuitive justification of boosting exploration. In contrast, expected information gain falls naturally out of the EFE formalism, arguably lending the formalism a degree of theoretical elegance.

Origins of the EFE
Given the centrality of the EFE to the active inference framework, it is important to explore the origin and nature of this quantity. The EFE is typically motivated through a reductio-ad-absurdum argument (K. Friston et al., 2015; 5 . The logic is as follows. Agents have prior beliefs over policies that drive action selection. By the FEP, all states of an organism, including those determining policies, must change so as to minimize free energy.
Thus, the only self-consistent prior belief over policies is that the agent will minimize free-energy into the future through its policy selection process. If the agent did not have such a prior belief then it would select policies which did not minimize the free-energy into the future and would thus not be a free-energy minimizing agent. This logic requires a well-defined notion of the free-energy of future states and observations given a specific policy. The active inference literature implicitly assumes that the EFE is the natural functional which fits this notion (K. Friston et al., 2017bFriston et al., , 2015. In the following section, we argue that the EFE is not in fact the only functional which can quantify the notion of the free energy of policy-conditioned futures, and indeed we propose a different functional The Free Energy of the Future, which we argue is a more natural extension of the VFE to account for future states.

The Free Energy of the Future
We argue that the natural extension of the free energy into the future must possess direct analogs to the two crucial properties of the VFE: it must be expressible as a KL-divergence between a posterior and a generative model, such that minimizing it causes the variational density to better approximate the true posterior. Secondly, it must also bound the log model evidence of future observations. Bounding the log model evidence (or surprisal) is vital since the surprisal is the core quantity which, under the FEP, all systems are driven to minimize. If the VFE extended into the future failed to bound the surprisal, then minimizing this extension would not necessarily minimize surprisal, and thus any agent which minimized such an extension would be in violation of the FEP. Here, we present a functional which we claim satisfies these desiderata -the Free Energy of the Future (FEF).
We wish to derive an expression for variational free energy at some future time τ that is conditioned on some policy π.
In other words, we wish to quantify the free energy that will occur at some future time point, given some sequence of actions. Here, we derive a form of the 'variational free energy of the future', denoted FEF τ (π), by keeping the same terms as the VFE (Equation 1), but conditioning the variational distributions on our policy of interest and rewriting for 5 An alternative motivation exists which situates the expected free energy in terms of a non-equilibrium steady state distribution K. Friston, 2019;Parr, 2019). This argument reframes everything in terms of a Gibbs free-energy, from which the EFE can be derived as a special case. The problem becomes, then, one of the motivation of the Gibbs free-energy as an objective function. the future time-point τ . Additionally, since observations in the future are unknown, we must evaluate our free energy under the expectation of our beliefs about future observations, as in the EFE. We thus define: Since this equation is simply the KL-divergence between the variational posterior and the generative model, it satisfies the first desideratum. We next investigate the properties of the FEF by showcasing one key decomposition. As with the the VFE, we can then split the FEF into an energy and an entropy or an accuracy and complexity term, which correspond to the extrinsic and epistemic action terms in the EFE: Unlike the EFE however, the expected information gain (complexity term) is positive, while in the EFE, it is negative.
Since we wish to minimize both the FEF and the EFE, the FEF mandates us to minimize the information gain while the EFE requires us to maximize it. An FEF agent thus tries to maximize its reward while trying to explore as little as possible. While this sounds surprising, it is in fact directly analogous to the complexity term in the VFE (equation 2), which mandates maximizing the likelihood of an observation, while also keeping the posterior as close as possible to the prior.

Bounds on the Expected Model Evidence
We next show how the FEF can be derived as a bound on the expected model evidence satisfying the second desidaratum. We define the expected model evidence to be a straightforward extension of the model-evidence to unknown future states. The expected log model evidence for a trajectory from the current time-step t to some time horizon T is: This objective states that we wish to maximize the probability of being in a desired trajectoryp(o t:T ), expected under the distribution of our beliefs about our likely future trajectories p(o t:T ). Given a Markov generative model , and assuming that the approximate posterior factorises Q(x 1:T |o 1:T ) = T t Q(x t |o t ), the expected model evidence factorises across time-steps, it suffices to show the derivation for a single time-step τ > t (see Appendix C for a full trajectory derivation). We further define We therefore take the expected model evidence for a single time-step, and show that the FEF is a bound on this quantity.
Crucially, this is an upper bound on expected model evidence which can be tightened by minimizing the FEF. By contrast, returning to the EFE, we see below that since KL divergences are always ≥ 0, the expected information gain is always positive, and so the EFE is a lower bound on the expected model evidence:

Expected Information Gain
Since the expected information gain is an expected KL divergence, it must be ≥ 0, and thus the negative expected information gain must be ≤ 0. Since the EFE aims to minimize negative information gain (thus maximizing positive information gain), we can see minimizing the EFE actually drives it further from the expected model evidence. 6 We further investigate the EFE and its properties as a bound in Appendix D. Additionally, in Appendix E we review other attempts in the literature to derive the EFE as a bound on the expected model evidence and discuss their shortcomings.

The EFE and the FEF
To get a stronger intuition for the subtle differences between the EFE and the FEF, we present a detailed side-by-side comparison of the two functionals.
There is a slight additional subtlety here involving the fact that there is also a posterior approximation error term which is positive. In general the EFE functions as an upper bound when the posterior error is greater than the information gain and a lower bound when the posterior error is smaller. Since the goal of variational inference is to minimize posterior error, and EFE agents are driven to maximize expected information gain, we expect this latter condition to occur rarely. For more detail see Appendix D.
While the two formulations might initially look very similar, the key difference is the variational term. The FEF, analogously to the VFE, measures the difference between a variational posterior Q(x τ |o τ ) and the generative model Q(x τ |π). The EFE, on the other hand, measures the difference between a variational prior and the generative model. It is this difference which makes the EFE not a straightforward extension to the VFE for future time-steps, and underwrites its unique epistemic value term.
We now demonstrate that both the EFE and the FEF can be decomposed into an expected likelihood, associated with extrinsic value, and an expected KL-divergence between a variational posterior and a variational prior, associated with epistemic value. We factorise the generative model in the FEF into the (biased) likelihood and a variational prior, and factorise the generative model in the EFE into an approximate posterior, and a (biased) marginal: The variational prior and variational posterior can then be combined in both the FEF and the FEF to form epistemic terms. Crucially, the epistemic value term is positive in the FEF and negative in the EFE, meaning that the FEF penalizes epistemic behavior whereas the EFE promotes it: Epistemic Value Equation 5. demonstrates that the FEF and EFE can be decomposed in similar fashion. We note that the extrinsic value term for the FEF is a likelihood and a marginal for the EFE. The most important difference, however, lies in the sign of the epistemic value term. Since we wish to minimize both the FEF and the EFE, the FEF mandates us to minimize information gain while the EFE requires us to maximize it. An FEF agent thus tries to maximize its extrinsic value while trying to explore as little as possible. A key question then arises: where does the positive information gain in the EFE come from?
While this difference in the sign of the expected information gain term may speak to some deep connection between the two quantities, here we offer a pragmatic perspective on the matter. We show that a possible route to the EFE is simply that it is the FEF minus the expected information-gain. This implies that the epistemic value term of the EFE arises not from some connection to variational inference but is present by construction: While this proof illustrates the relation between the EFE and the FEF, it is theoretically unsatisfying as an account of the origin of the EFE. A large part of the appeal of the EFE is that it purports to show that epistemic value arises 'naturally' out of minimizing free-energy into the future. In contrast, here we have shown that minimizing free-energy into the future requires no commitment to exploratory behaviour. While this does not question the usefulness of using an information gain term for exploration, or the use of the EFE as a loss function, it does raise questions about the mathematically principled nature of the objective. It is thus not straightforward to see why agents are directly mandated by the FEP to minimize the EFE specifically, as opposed to some other free-energy functional. While this fact may at first appear concerning, we believe it ultimately enhances the power of the formalism by licensing the extension of active inference to encompass other objective functions in a principled manner (Biehl, Guckelsberger, Salge, Smith, & Polani, 2018). In the following section, we propose an alternative objective to the EFE, which results in the same informationseeking epistemic value term, but derives it in a mathematically principled and intuitive way as a bound on the divergence between expected and desired futures.

Free Energy of the Expected Future
In this section, we propose a novel objective functional which we call The Free-Energy of The Expected Future (FEEF) which possesses the same epistemic value term as the EFE, while additionally possessing a more naturalistic and intuitive grounding. We begin with the intuition that, to act adaptively, agents should act so as to minimize the difference between what they predict will happen, and what they desire to happen. Put another way, adaptive action for an agent consists of forcing reality to unfold according to its' preferences. We can mathematically formulate this objective as the KL divergence between desired and expected trajectories of future observations. We thus wish to choose the policy that minimizes this divergence.
While this KL divergence is difficult to minimize directly 7 due the marginal densities, we show below that it is possible to derive the FEEF as a tractable bound on this divergence.
The FEEF can be interpreted as the divergence between a veridicial and a biased generative model, and thus furnishes a direct intuition of the goals of a FEEF-minimizing agent. The divergence objective compels the agent to bring the biased and the veridicial generative model into alignment. Since the predictions of the biased generative model are heavily biased towards the agents' a-priori preferences, the only way to achieve this alignment is to act so as to make the veridicial generative model predict desired outcomes in line with the biased generative model. The FEEF objective encompasses the standard active inference intuition of an agent acting through biased inference to maximize accuracy of a biased model. However, the maintenance of two separate generative models (one biased and one veridicial) also helps finesse the conceptual difficulty of how the agent manages to make accurate posterior inferences and future predictions about complex dynamics if all it has access to is a biased generative model. It seems straightforward that the biased model would also bias these crucial parts of inference which need to be unimpaired for the scheme to function at all. However, by keeping both a veridicial generative model (the same one used at the present time and learnt through environmental interactions), and a biased generative model (created by systematically biasing a temporary copy of the veridicial model), we elegantly separate the need for both veridicial and biased inferential components for future prediction 8 Similarly to the EFE, the FEEF objective can be decomposed into an extrinsic and an intrinsic term. We compare this directly to the EFE decomposition: Intrinsic Value The first thing to note is that the intrinsic value terms of the FEEF and the EFE are identical such that FEEF-minimizing agents will necessarily show identical epistemic behaviour to EFE-minimizing agents. Unlike the EFE, however, the FEEF also possesses a strong naturalistic grounding as a bound on a theoretically relevant quantity. The FEEF can maintain both its information-maximizing imperative and its theoretical grounding since it is derived from the minimization of a KL divergence rather than the maximization of a log model evidence.
The key difference with the EFE lies in the likelihood term. While the EFE simply tries to maximize the expected evidence of the desired observations, the FEEF minimizes the KL divergence between the likelihood of observations predicted under the veridicial generative model 9 and the marginal likelihood of observations under the biased generative model. This difference is effectively equivalent to an additional veridicial generative model likelihood entropy term H[p(o τ |x τ )] subtracted from the EFE. The extrinsic value term thus encourages the agent to choose its actions such that its predictions over states lead to observations which are close to its preferred observations, while also trying move to states whereby the entropy over observations is maximized, thus leading the agent to move towards states where the generative model is not as certain about the likely outcome. In effect, the FEEF possesses another exploratory term, in addition to the information gain, which the EFE lacks.
Another important advantage of the FEEF is that it is mathematically equivalent to the VFE in the present time with a current observation. This is because when we have a real observation, the distribution over the possible observations collapses to a delta distribution, so that the outer expectation has no effect. Simultaneously, the distribution over desired observations also collapses to a delta at the real-observation, given that (barring counterfactual reasoning capability), one cannot usefully desire things to be other than how they are at the present moment. This means that theoretically we can consider an agent to be both inferring and planning using the same objective, which is not true of the EFE. The EFE does not reduce to the VFE when observations are known, and thus requires a separate objective function to be minimized for planning compared to inference. Because of this, it is actually possible to argue that FEEF is mandated by the free-energy principle. On this view there is no distinction between present and future inference and both follow from minimizing the same objective but under different informational constraints. 9 The term 'veridicial' needs some contextualising. We simply mean that the model is not biased towards the agent's desires. The veridicial generative model is not required to be a perfectly accurate map of the agent's entire world, only of action-relevant sub-manifolds of the total space .
Since the FEEF and the EFE are identical in their intrinsic value term, and share deep similarities in their extrinsic term, we believe that the FEEF can serve as a relatively straightforward "plug-in replacement" for the EFE for many active inference agents. Moreover, it has a much more straightforward intuitive basis than the EFE, is arguably a better continuation of the VFE into the future, and possesses a strong naturalistic grounding as a bound on the divergence between predicted and desired futures.

Discussion
We believe it is valuable at this point to step back from the morass of various free-energies and take stock of what has been achieved. Firstly, we have shown that it is not possible to directly derive epistemic value from variational inference objectives which serve as a bound on model evidence. However, it is possible to derive epistemic value terms from divergences between expected and desired states. A deep intuitive understanding of why this is the case is an interesting avenue for future work. The intuition behind the FEEF as a divergence between desired and expected future observations is also similar to probabilistic formulations of the reinforcement learning problem (Attias, 2003;Kappen, 2005;Levine, 2018;Toussaint, 2009), which typically try to minimize the divergence between a controlled trajectory, and an optimal trajectory (Kappen, 2007;E. A. Theodorou & Todorov, 2012;Williams, Aldrich, & Theodorou, 2017).
These schemes also obtain some degree of (undirected) exploratory behaviour through their objective functionals which contain entropy terms and the FEEF can be seen as a way of extending these schemes to partially-observed environments. Understanding precisely how active inference and the free-energy principle relate mathematically to such schemes is another fruitful avenue for future work.
It seems intuitive that a Bayes-optimal solution to the exploration-exploitation dilemma should arise directly out of the formulation of reward maximization as inference, given that sources of uncertainty are correctly quantified. However, in this paper, we have shown that merely quantifying uncertainty in states and observations through meanfield-factorised time-steps is insufficient to derive such a principled solution to the dilemma, as seen by the explorationdiscouraging behaviour of the FEF. We therefore believe that to derive Bayes optimal exploration policies in the context of active-learning -such that we have to select actions that give us the most information now to use in the future to maximize rewards -it is likely to require both modelling multiple interconnected time-steps, as well as the mechanics of learning with parameters and update rules, and correctly quantifying the uncertainties therein. This is beyond the scope of this paper, but is a very interesting avenue for future work.
The comparison of the FEEF and the EFE also raises an interesting philosophical point about the number and types of generative models employed in the active-inference formalism. One interpretation of the FEEF is in terms of two generative models, but other interpretations are possible such as between a single unbiased generative model and a simple density of desired states and observations. It is also important to note that due to requiring different objective functions for inference and planning, the EFE also formulation appears to implicitly require two generative modelsthe generative model of future states, and the generative model of states in the future (K. Friston et al., 2015). While the mathematical formalism is relatively straightforward, the philosophical question of how to translate the mathematical objects into ontological objects called 'generative models' is unclear and progress on this front would be useful in determining the philosophical status, and perhaps even neural implementation of active inference.
The implications of our results for studies of active inference are varied. Nothing in what we have shown argues directly against the use of the EFE as a objective for an active inference agent. However, we believe we have shown that the EFE is not the necessarily the only, or even the natural, objective function to use. We thus follow (Biehl et al., 2018) in encouraging experimentation with different objective functions for active inference. We especially believe that our objective, the FEEF future has promise due its intuitive interpretation, largely equivalent terms to the EFE, its straightforward use of two generative models rather than just a single biased one, and its close connections to similar probabilistic objectives used in variational reinforcement learning, while also maintaining the crucial epistemic properties of the EFE. Moreover, while in this paper we have argued for the FEF instead of the EFE as a direct extension of the VFE into the future, the logical requirements of exactly which functional (if any) is, in fact, mandated by the free-energy principle remains open. We believe that elucidating the exact constraints which the free-energy principle places upon a theory of variational action, and understanding more deeply the relations between the various freeenergies, could shed light on deep questions regarding notions of Bayes-optimal epistemic action in self-organising systems.
Finally, it is important to note that although in this paper we have solely been concerned with the EFE and active inference in discrete-time POMDPs, the original intuitions and mathematical framework of the free-energy principle arose out of a continuous time formulation, deeply interwoven with concerns from information theory and statistical physics (K. Friston, 2019; K. Friston & Ao, 2012b; K. Friston et al., 2006;Parr et al., 2020). As such there may be deep connections between the EFE, FEF, and log model evidence which exist only in the continuous time limit, and which furnish a mathematically principled origin of epistemic action.

Conclusion
In this paper, we have examined in detail the nature and origin of the EFE. We have shown that it is not a direct analog of the VFE extended into the future. We then derived a novel objective, the FEF, which we claimed is a more natural extension and shown that it lacks the beneficial epistemic value term of the EFE. We then proved that this term arises in the EFE directly as a result of its non-standard definition since the EFE can be expressed as just the EFE minus the expected information gain. Taking this into account, we then proposed another objective, the Free Energy of the Expected Future (FEEF) which attempts to get the best of both worlds by preserving the desirable information-seeking properties of the EFE, while also maintaining a mathematically principled origin.

A Variational Inference
To motivate the variational free-energy, and variational inference more generally, we setup a standard inference problem. Let us say we are an agent that exists in a partially observed world. We have some observation o t , and from this we wish to infer the hidden state of the world x t . That is, we want to compute the posterior p(x t |o t ). While we do not know this posterior directly, we do possess a generative model of the world. This is a model that maps from hidden states to observations. Mathematically, we possess p(o t , x t ) = p(o t |x t )p(x t ). Since computing the true posterior exactly is likely intractable, the strategy in variational inference is to try to approximate this density with a tractable one Q(x t |o t ; φ) which we postulate, and thus have full control over. While the true posterior might be arbitrarily complex, we might define Q(x t |o t ; φ) to be a Gaussian distribution: Q(x t |o t ; φ) = N (x; µ φ , σ φ ), for instance. Given that we have this variational density Q, parametrised by some parameters φ, the goal is to adjust the parameters to make Q as close as possible to the true posterior p(x t |o t ). Mathematically speaking, this means we want to minimize: Where D KL [Q P ] is the Kullback-Leibler divergence. This initially doesn't seem to have bought us much. We wish to minimize the divergence between the variational density q and the true posterior p(x t |o t ). However, by assumption, we do not know the true posterior. So how can we possibly minimize this divergence if we do not know one of the parts? This is where we use the key trick of variational inference. By Bayes' theorem we know that: p(x t |o t ) = p(ot|xt)p(xt) p(ot) we we can thus substitute this into the KL divergence term.
In step 2 we have applied Bayes' theorem the the posterior. In step 3 we have simply utilized the definition of the KLdivergence D KL [Q||P ] = E Q ln( Q P ). In step 4 we have then applied the property of logs that ln(a * b) = ln(a)+ ln(b). In step 5 we then recognise that the remaining first term is now a KL divergence between the variational posterior and the generative model. We also recognise that since the ln p(o t ) term has no dependence on x or φ, the expectation E Q(xt|ot;φ) [ln p(o t )] vanishes leaving just the ln p(o t ) term alone. It is important to note that the KL term in equation 6 is now between two things we can actually compute -the variational posterior, which we control, and the generative model, which we assume that we know. The remaining ln p(o t ) term is called the log model evidence and it is incomputable in general. However, since it is not affected by the parameters φ of the variational density, then it does not affect the minimization and so for the purposes of the minimization process can be ignored. We can thus write out what we have defined as This implies that the KL divergence between the variational density and the generative model is always greater than or equal to the KL divergence between the true and variational posteriors. This first KL divergence is computable and we call it the variational free-energy F . Since F is an upper bound on the divergence between the true posterior and the variational posterior, which is what we really want to minimize, then if we minimize F , we are constantly pushing that bound lower and thus largely minimizing the divergence between the true and variational posterior. As an additional bonus, when the true and variational posteriors are approximately equal: , which means that the final value of the variational-free-energy is thus equal to the negative log model evidence. Since the log model evidence is a very useful quantity to compute for Bayesian model selection, it effectively means that once we have finished fitting our model, we are automatically left with a measure of how good our model is.
In effect the variational free energy is useful because it has two properties. The first is that it is an upper bound on the divergence between the true and approximate posterior. By adjusting our approximate posterior to minimize this bound, we drive it closer to the true posterior, thus achieving more accurate inference. Secondly, the variational free-energy is a bound on the log model evidence. This is an important term which scores the likelihood of the data observed given your model and so can be used in Bayesian model selection.
The log model evidence takes on an additional importance in terms of the free-energy principle, since the negative log model evidence − ln p(o t ) is surprisal, which all agents, it is propsed are driven to minimize (K. Friston, 2010;K. Friston & Ao, 2012b;K. Friston et al., 2006). This is because the expected log model evidence is the entropy of observations, the minimisation of which is postulated as a necessary condition for any self-sustaining organism to maintain itself as a unique system. The free-enregy minimization comes about since the VFE is, as we have seen a tractable bound on the log model evidence, or surprisal.
The VFE can be decomposed in three principle ways, which each showcases a different facet of the objective.
Posterior Divergence In the first entropy-energy decomposition, we simply split the KL divergence using the properties of logarithms so that the numerator of the fraction becomes the entropy term and the denominator becomes the energy term.. If we are seeking to minimize the variational-free-energy then this means we need to both minimize the negative entropy (since entropy is defined as −E Q(xt) [ln Q(x t )] and also minimize the negative energy (or maximize the energy) . This can be interpreted as saying we require that the variational posterior be as entropic as possible while also maximizing the likelihood that the xs proposed as probable by the variational posterior also be judged as probable under the generative model.
The second decomposition into accuracy and complexity perhaps has a more straightforward interpretation. We wish to minimize the negative accuracy (and thus maximize the accuracy), which means we want the actually observed observation to be as likely as possible under the xs predicted by the variational posterior. However, we also want to minimize the complexity term which is a KL divergence between the variational posterior and the prior. That is, we wish to keep the posterior as close to our the as possible while still maximizing accuracy. The complexity term then functions as a kind of implicit regulariser, making sure we do not overfit to any specific observation.
The final decomposition speaks to the inferential functions of the VFE. It serves as an upper bound on the log model evidence, since the posterior divergence term, as a KL divergence, is always positive. Moreover, we see that by minimizing the free-energy, we must also be minimizing the posterior divergence, which is the difference between the approximate and true posterior, and we are thus improving our variational approximation. This is because the log model evidence is constant, and so if the VFE is decreasing, it must be doing so through the minimization of the posterior divergence.

B Decompositions of the EFE
In this section we provide a comprehensive overview of the many decompositions of the EFE. The EFE is defined as: The standard decomposition is into the extrinsic term (expected log likelihood of the desired observations) and an epistemic term (the information gain, or KL divergence between variational prior and posterior from the generative model. Epistemic Value Similar to the VFE, it is also possible to split it into an energy and an entropy term. While the energy term is similar to the VFE as the expectation of the generative model (albeit an expectation over the joint instead of the posterior), the entropy term is different as it is the entropy of the variational prior, not the approximate posterior, which results.
It is also possible to decompose the biased generative model the other way around, thus in line with that of the VFE to derive: Unlike the VFE, however the divergence is between the variational prior and the true prior, rather than between the variational posterior and the true prior. Finally, the EFE can also be represented in observation space by using Bayes' rule to flip the likelihoods and priors.
It is also possible to factorised the biased generative model the other way around in terms of an unbiased likelihood and biased states:p(o τ , x τ ) = p(o τ |x τ )p(x τ ). This different factorisation leads to a new decomposition in terms of risk and ambiguity, as well as potentially different behaviour due to the change from desired observations to desired states.
Here the agent is driven to minimize the divergence between desired and prior expected states, while also trying to minimize the entropy of the observations it receives. This drives the agent to try to sample observations with a minimally complex mapping back to states.
This formulation is mathematically equivalent to the previous decompositions despite defining desired states instead of desired observations, as can be seen with the following manipulations: The risk-ambiguity formulation has very close relations to KL control (K. Rawlik et al., 2013), in that it encompasses KL control with an additional "epistemic" ambiguity term.
Active Inference

C Trajectory Derivation of the Expected Model Evidence
Here we present the derivation of the free energy of the future (FEF) from the expected model evidence for the full trajectory distribution rather than a single time-step. Importantly, we show that with a temporal mean-field approximation on the approximate posterior: p(x 1:T |o 1:T ) ≈ The trajectory derivation of the FEEF follows an almost identical scheme to that of the FEF. The only difference is that now the term inside the log also contains an additional − lnp(o), which is then combined with the likelihood from the generative model to form the extrinsic-value KL divergence.

D EFE Bound on the Negative Log Model Evidence
It is important to note that the EFE is also a bound on the negative log model evidence, but a lower bound, not an upper bound. This means that, in theory, one should want to maximize the EFE, rather than minimize it, to make the bound as tight as possible.
The fact that the EFE is an upper bound is straightforward to demonstrate, since the extrinsic value term of the EFE simply is the log model evidence.

Information Gain
This derivation assumes that the true and approximate posteriors are approximately equal p(x τ |o τ ) ≈ Q(x τ |o τ ), which effectively means that this condition is only true after the variational inference process is complete.
We wish to minimize both log model evidence, and minimize the EFE. Since the information gain term is a KL divergence, which is always ≥ 0, and we have a negative information gain term, this means that the EFE is always less than the log model evidence and so is a lower bound. However, this bound becomes tight when the information gain is 0, so to maximally tighten the bound we wish to reduce the information gain, while the EFE demands we maximize it. In effect, this means that the EFE bound is the wrong way around.
We can see this more clearly when we retrace the logic for the FEF. From equation 4, we have that the FEF is an upper bound on the negative log model evidence. This means that minimizing the FEF necessarily tightens the bound, while this is not true of the EFE lower bound, where minimizing the EFE can actually cause it to diverge from the log model evidence. We can see this even more clearly by doing an analogous decomposition of the FEF.

Posterior Approximation Error
Here, since the KL is between the generative model and the approximate posterior, and we then decompose the generative model into a true posterior and marginal, we can no longer make the assumption, made in the EFE derivation, that the true and approximate posterior are approximately equal, since that would leave us with only the model evidence.
Therefore, instead we get a posterior approximation error term which is the KL divergence between the approximate and true posteriors. When the true and approximate posterior are equal, we are just left with the log model evidence.
Since, the posterior approximation error is always ≥ 0, then the FEF is an upper bound on the negative log model evidence, and thus by minimizing the FEF, we make the bound tighter. This logic is essentially a reprise of the standard variational inference logic from a slightly different perspective.
If we do not make the assumption in the EFE that the approximate and true posterior are the same, we can derive a similar expression to the EFE which will shed more light on the relation.

Information Gain
Without the true posterior assumption, we thus find that the EFE could be both an upper or a lower bound on the log model evidence, since the two additional KL divergence terms have opposite signs. If the posterior approximation error is larger than the information gain, then the EFE functions correctly as an upper bound. However, if the information gain is larger, then the EFE will become a lower bound and could diverge from the log model evidence. Moreover, this latter situation is more likely, since the goal of variational inference is to reduce the approximation error, while EFE agents seek to maximize information gain. This means that the EFE only functions correctly as an upper bound on log model evidence during the early stages of optimization where the posterior approximation is poor. Further optimization steps likely drive the EFE further away from the model evidence. The bound is tight when the information gain equals the posterior approximation error. We can also see that the first two terms of the EFE is simply the FEF. We have thus rederived by a rather roundabout route, the fact that the EFE is simply the FEF minus the information gain.
We thus see that the EFE as a bound on the log model evidence is shaky, since it depends on the information gain always being larger or smaller than the posterior approximation error. Moreover, the bounding behaviour seems to emerge directly from the relation of the EFE to the FEF rather than the intrinsic qualities of the EFE, and it is primarily the information-seeking properties of the EFE which serve to damage the clean bounding behaviour of the FEF.

E Attempts at Naturalising the EFE
In this Appendix, we review several attempts to derive the EFE directly from the expected model evidence.
Since we have derived the FEF by importance sampling the expected model evidence with the approximate posterior, one obvious avenue would be to importance sample on the variational prior instead. Following this line of thought gives us: While this approach gets the correct form of the EFE inside the expectation, the expectation itself is the product of the two marginals rather than the joint required for the full EFE. While this may seem minor, this difference must underpin all the other differences and relations we have explored throughout this paper.
To get to the full EFE we must make some assumption to allow us to combine the expectation under two marginals into an expectation under the joint. The first and simplest assumption is that they simply are the same such that the joint factorises into the two marginals -Q(o τ , x τ |π) ≈ p(o τ )Q(x τ |π). This assumption is equivalent to assuming independence of observations and latent states, which rather defeats the point of a latent variable model.
A second approach is to assume that the variational prior equals the variational posterior Q(x τ |π) ≈ Q(x τ |o τ ). This allows you then to combine the marginal and posterior into a joint, giving the EFE as desired. However this assumption has several unfortunate consequences. Firstly, it eliminates the entire idea of inference, since the prior and posterior are assumed to be the same, thus no real inference can have taken place. This is not necessarily an issue if we separate the inference and planning stages of the algorithm, such that they optimize different objective functions. However it is more elegant, as the FEEF does, to be able to use the same objective function for both planning and inference, thus casting them as simply different facets of the same underlying process. A more serious consequence is that this assumption also eliminates the information gain term in active inference -since the prior and posterior are assumed to be the same, the divergence between them (which is the information gain), must be zero.
A slightly different approach is taken in a proof in (Parr, 2019), which begins with the KL divergence between two distributions, one encoding beliefs about future states and observations, and the other being the biased generative model. By definition, this KL divergence is always ≥ 0, which allows us to write.
Under the assumption that p(x τ |o τ ) ≈ Q(x τ |π), this becomes: This proof derives the FEF not as a bound on the expected model evidence, by our definition, but rather as a bound on the entropy of expected observations given a policy. The EFE is then derived from the FEF by assuming that the prior and posterior are the same, which comes with all the drawbacks explained above. This proof is primarily unworkable because of the assumption that the prior and the posterior are identical. While this may be arguable in the continuous time limit, where it is equivalent to the assumption that that dQ(xτ |o ] tau) dt ≈ 0, which is when the continuous-time inference has reached an equilibrium, it is definitely not true in discrete time, where although there is a relation between the prior in the current time-step and the posterior in the previous one, it must be mapped through the transition dynamics -Q(x t |π) = E Q(xt−1) [p(x t |x t−1 , π)].
One can also attempt a related proof by splitting the KL divergence the other way. This gives you: Which is just another way of showing that the FEF is a bound on the expected model evidence.

F Related Quantites
Recently a new free-energy, the generalised free energy (GFE) , has been proposed in the literature as an alternative or an extension to the EFE. The GFE shares some close similarities with the FEEF. Both fundamentally extend the EFE by proposing a unified objective function which is valid for both inferencce at the current time and planning into the future, whereas the EFE can only be used for planning. Moreover, both GFE and FEEF encode future observations as latent unobserved variables, over which posterior beliefs can be formed. Moreover agents maintain prior beliefs over these variables which encode its preferences or desires.
The generalised free energy is defined as Whereas the FEEF is defined as There are two key differences mathematically and intuitively between the GFE and the FEEF. The first is that the GFE maintains a factorised posterior over beliefs and observations, where the posterior beliefs of the two are separated by a mean field approximation and assumed to be separate. By contrast the FEEF maintains a joint approximate belief over both observations and states simultaneously. This joint in the case of the FEEF effectively functions as a veridicial generative model since Q(o τ |x τ ) = p(o τ |x τ ) and Q(x τ |π) = E Q(xτ−1) [p(x τ |x τ −1 , π)]. This means that posterior beliefs of the future are computed simply by rolling forward the generative model given the beliefs about the current time.
A second and more important differences lies in the generative models. The GFE assumes that the agent is only equipped with a single generative model with both veridicial and biased components. The preferences of an EFE agent are encoded as a separate factorisable marginal over observations. This means that the generative model of the GFE agent factorises asp(o τ , x τ ) GF E ∝ p(o τ |x τ )p(x τ )p(o τ ). This means that for the GFE the likelihood and the prior are unbiased and there is simply an additional prior preferences term in the free-energy expression. By contrast, the FEEF eschews this unusual factorisation of the generative model and instead presupposes a separate warped generative model for use in the future which is intrinsically biased. The biased generative model in the FEEF thus decomposes asp(o τ , x τ ) F EEF =p(o τ |x τ )p(x τ ), which is the standard factorisation of the joint distribution in a generative model, but where both the likelihood and prior distributions are biased towards generating more favourable states of affairs for the agent. This inherent optimism bias then drives action.
A further free-energy proposed in the literature has been the Bethe free-energy and the Bethe approximation (Schwöbel, Kiebel, & Marković, 2018). This approach eschews the standard mean field assumption on the approximate posterior in favour of a Bethe approximation from statistical physics (Yedidia, Freeman, & Weiss, 2001 which instead represents the approximate posterior as the product of pairwise marginals, thus preserving a constraint of pairwise temporal consistency which the mean-field assumption lacks. Due to this greater representation of temporal constraints (the approximate posteriors at each time-step being no longer assumed to be independent), the Bethe free-energy has the potential to be significantly more accurate than the standard mean-field variational free energy (and is, in fact, exact for factor graphs without cycles such as the standard non-hierarchical POMDP model). In this paper, we focus entirely on the standard mean-field variational free-energy used in the vast majority of active inference publications, and thus the Bethe free-energy is out of scope for this paper. However, exploring the nature of any intrinsic terms which might arise from the Bethe free-energy is an interesting avenue for future work. Although primarily focused on the Bethe free-energy, Schwöbel et al. (2018) also introduced a 'predicted free energy' functional. This functional is equivalent to the FEF as we have defined it here, and so has a complexity instead of an information gain term, leading to minimizing the prior-posterior divergence.
Finally, Biehl et al. (2018) suggested that if the EFE is not mandated by the free-energy principle, which we have argued for in this paper, then in theory any standard intrinsic measure, such as empowerment, could be used as an objective. We believe that exploring the effect of these other potential loss functions could be a area of great interest for future work.