## Abstract

The expected free energy (EFE) is a central quantity in the theory of active inference. It is the quantity that all active inference agents are mandated to minimize through action, and its decomposition into extrinsic and intrinsic value terms is key to the balance of exploration and exploitation that active inference agents evince. Despite its importance, the mathematical origins of this quantity and its relation to the variational free energy (VFE) remain unclear. In this letter, we investigate the origins of the EFE in detail and show that it is not simply ”the free energy in the future.” We present a functional that we argue is the natural extension of the VFE but actively discourages exploratory behavior, thus demonstrating that exploration does not directly follow from free energy minimization into the future. We then develop a novel objective, the free energy of the expected future (FEEF), which possesses both the epistemic component of the EFE and an intuitive mathematical grounding as the divergence between predicted and desired futures.

## 1 Introduction

The free-energy principle (FEP) (Friston, 2010; Friston & Ao, 2012; Friston, Kilner, & Harrison, 2006) is an emerging theory from theoretical neuroscience that offers a unifying explanation of the dynamics of self-organizing systems (Friston, 2019; Parr, Da Costa, & Friston, 2020). It proposes that such systems can be interpreted as embodying a process of variational inference that minimizes a single information-theoretic objective: the variational free-energy (VFE). In theoretical neuroscience, the FEP translates into an elegant account of brain function (Friston, 2003, 2005, 2008a, 2008b; Friston, Trujillo-Barreto, & Daunizeau, 2008), extending the Bayesian brain hypothesis (Deneve, 2005; Doya, Ishii, Pouget, & Rao, 2007; Knill & Pouget, 2004) by postulating that the neural dynamics of the brain perform variational inference. Under certain assumptions about the forms of the densities embodied by the agent, this theory can even be translated down to the level of neural circuits in the form of a biologically plausible neuronal process theory (Bastos et al., 2012; Friston, 2008a; Kanai, Komura, Shipp, & Friston, 2015; Shipp, 2016; Spratling, 2008).

Action is then subsumed into this formulation, under the name of *active inference* (Friston, 2011; Friston & Ao, 2012; Friston, Daunizeau, & Kiebel, 2009) by mandating that agents act so as to minimize the VFE with respect to action (Buckley, Kim, McGregor, & Seth, 2017; Friston et al., 2006). This casts action and perception as two aspects of the same imperative of free-energy minimization, resulting in a theoretical framework for control that applies to a variety of continuous-time tasks (Baltieri & Buckley, 2017, 2018; Calvo & Friston, 2017; Friston, Mattout, & Kilner, 2011; Millidge, 2019b).

Recent work has extended these ideas to account for inference over temporally extended action sequences. (Friston & Ao, 2012; Friston, FitzGerald, Rigoli, Schwartenbeck, & Pezzulo, 2017; Friston, FitzGerald, Rigoli, Schwartenbeck, & Pezzulo, 2016; Friston et al., 2015; Tschantz, Seth, & Buckley, 2019). Here it is assumed that rather than action minimizing the instantaneous VFE, sequences of actions (or policies) minimize the cumulative sum over time of a quantity called the *expected free energy* (EFE) (Friston et al., 2015). Active inference using the EFE has been applied to a wide variety of tasks and applications, from modeling human and animal choice behavior (FitzGerald, Schwartenbeck, Moutoussis, Dolan, & Friston, 2015; Friston et al., 2015; Pezzulo, Cartoni, Rigoli, Pio-Lopez, & Friston, 2016), simulating visual saccades and other “epistemic foraging behavior” (Friston, Lin, et al., 2017; Friston, Rosch, Parr, Price, & Bowman, 2018; Mirza, Adams, Mathys, & Friston, 2016; Parr & Friston, 2017a, 2018a), solving reinforcement learning benchmarks (Çatal, Verbelen, Nauta, De Boom, & Dhoedt, 2020; Millidge, 2019a, 2020; Tschantz, Baltieri, Seth, & Buckley, 2019; Ueltzhöffer, 2018; van de Laar & de Vries, 2019), to modeling psychiatric disorders as cases of aberrant inference (Cullen, Davey, Friston, & Moran, 2018; Mirza, Adams, Parr, & Friston, 2019; Parr & Friston, 2018b). Like the continuous-time formulation, active inference also comes equipped with a biologically plausible process theory with variational update equations, which have been argued to be homologous with observed neural firing patterns (Friston, FitzGerald, et al., 2017; Friston, Parr, & de Vries, 2017; Parr, Markovic, Kiebel, & Friston, 2019).

A key property of the EFE is that it decomposes into both an extrinsic, value-seeking and an intrinsic (epistemic), information-seeking term (Friston et al., 2015). The latter mandates active inference agents to resolve uncertainty by encouraging the exploration of unknown regions of the environment, a property that has been extensively investigated (Friston, FitzGerald, et al., 2017a; Friston et al., 2015; Schwartenbeck, FitzGerald, Dolan, & Friston, 2013; Schwartenbeck et al., 2019). The fact that intrinsic drives naturally emerge from this formulation is argued as an advantage over other formulations that typically encourage exploration by adding ad hoc exploratory terms to their loss function (Burda et al., 2018; Mohamed & Rezende, 2015; Oudeyer & Kaplan, 2009; Pathak, Agrawal, Efros, & Darrell, 2017). While the EFE is often described as a straightforward extension to the free energy principle that can account for prospective policies and is typically expressed in similar mathematical form (Da Costa et al., 2020; Friston, FitzGerald, et al., 2017; Friston et al., 2015; Parr & Friston, 2017b, 2019), its origin remains obscure. Minimization of the EFE is sometimes motivated by a reductio ad absurdum argument following from the FEP (Friston et al., 2015; Parr & Friston, 2019) in that agents are driven to minimize the VFE, and therefore the only way they can act is to minimize their free energy into the future. Since the future is uncertain, however, instead they must minimize the expected free energy. Central to this logic is the formal identification of the VFE with the EFE.

In this letter, we set out to investigate the origin of the EFE and its relations with the VFE. We provide a broader perspective on this question, showing that the EFE is not the only way to extend the VFE to account for action-conditioned futures. We derive an objective that we believe to be a more natural analog of the VFE, which we call the *free energy of the future* (FEF), and make a detailed side-by-side comparison of the two functionals. Crucially, we show that the FEF actively discourages information-seeking behavior, thus demonstrating that epistemic terms do not necessarily arise simply from extending the VFE into the future. We then investigate the origin of the epistemic term of the EFE and show that the EFE is just the FEF minus the negative of the epistemic term in the EFE, which thus provides a straightforward perspective on the relation between the two functionals. We propose our own mathematically principled starting point for action selection under active inference: the divergence between desired and expected futures, from which we obtain a novel functional, the free-energy of the expected future (FEEF), which has close relations to the generalized free energy (Parr & Friston, 2019). This functional has a natural interpretation in terms of the divergence between a veridical and a biased generative model; it allows use of the same functional for both inference and policy selection, and it naturally decomposes into an extrinsic value term and an epistemic action term, thus maintaining the attractive exploratory properties of EFE-based active inference while also possessing a mathematically principled starting point with an intuitive interpretation.

## 2 The Variational Free Energy

The variational free energy (VFE) is a core quantity in variational inference and constitutes a tractable bound on both the log model evidence and the Kullback-Leibler (KL) divergence between prior and posterior (Beal, 1998; Blei, Kucukelbir, & McAuliffe, 2017; Fox & Roberts, 2012; Wainwright & Jordan, 2008). (For an in-depth motivation of the VFE and its use in variational inference, see appendix A.)

The agent receives observations $ot$ and must infer the values of hidden states $xt$. The agent assumes that the environment evolves according to a Markov process so that the distribution over states at the current time step only depends on the state at the previous time step, and that the observation generated at the current time step depends only on the state at the current time step. Given a distribution over a trajectory of states and observations and under Markov assumptions, it can be factorized as follows: $p(o0:T,x0:T)=p(s0)\u220ft=0Tp(ot|st)p(st+1|st)$. In this letter, we also consider inference over future states and observations that have yet to be observed. Such future variables are denoted $o\tau $ or $x\tau $ where $\tau >t$. To avoid dealing with infinite sums, agents only consider futures up to some finite time horizon, denoted $T$. $Q(xt|ot;\varphi )$ denotes an approximate posterior density parameterized by $\varphi $, which, during the course of variational inference, is fit as closely as possible to the true posterior. Note that there is a slight difference in notation here compared to that usually used in variational inference. Normally the approximate posterior is written as $Q(xt;\varphi )$ without the dependence on $o$ made explicit. This is because the variational posterior is not a direct function of observations, but rather the result of an optimization process that depends on the observations. Here, we make the dependence on $o$ explicit to keep a clear distinction between the variational posterior $Q(xt|ot;\varphi )$, obtained through optimization of the variational parameters $\varphi $, and the variational prior $Q(xt)=Ep(st|st-1)[Q(st-1|ot-1;\varphi )]$, obtained by mapping the previous posterior through the transition dynamics. Throughout this letter, we assume that inference is occurring in a discrete-time partially observed Markov decision process (POMDP). This is to ensure compatibility with the EFE formulation later, which is also situated within discrete-time POMDPs.^{1}

The utility of the VFE for inference comes from the fact that the VFE is equal to the divergence between true and approximate posteriors up to a constant: $Ft\u2265DKL[Q(xt|ot;\varphi )||p(xt|ot)]$. Thus, minimizing $Ft$ with respect to the parameters of the variational distribution makes $Q(xt;\varphi )$ a good approximation of the true posterior.

This decomposition is the one typically used to compute the VFE in practice and has a straightforward interpretation. Specifically, minimizing the negative accuracy (and thus maximizing accuracy) ensures that the observations are as likely as possible under the states, $xt$, predicted by the variational posterior while simultaneously minimizing the complexity term, which is a KL-divergence between the variational posterior and the prior. Thus, the goal is to keep the posterior as close to the prior as possible while maximizing accuracy. Effectively, the complexity term acts as an implicit regularizer, reducing the risk of overfitting to any specific observation.

## 3 The Expected Free Energy

While variational inference as presented only allows us to perform inference at the current time given observations, it is possible to extend the formalism to allow for inference over actions or policies in the future.

To achieve this extension, a variational objective is required that can be minimized contingent on future states and policies, which will allow the problem of adaptive action selection to be reformulated as a process of variational inference. To do this, the formalism must be extended in two ways. First, the generative model is augmented to include actions $a\tau $, and policies, which are sequences of actions $\pi =[a1,a2\u2026aT]$. The action taken at the current time can affect future states, and thus future observations. In order to transform action selection into an inference problem, policies are treated as an inferred distribution $Q(\pi )$ that is optimized to meet the agent's goals. The second extension required is to translate the notion of an agent's goals into this probabilistic framework. Active inference encodes an agent's goals as a desired distribution over observations $p\u02dc(o\tau :T)$. We denote the biased distribution using a tilde over the probability density $p\u02dc$ rather than the random variable to make clear that the random variables themselves are unchanged; it is only the agent's subjective distribution over the variables that is biased.^{2} This distribution is then incorporated into a biased generative model of the world $p\u02dc(o\tau ,x\tau )\u2248p\u02dc(o\tau )Q(x\tau |o\tau )$,^{3} where we have additionally made the assumption that the true posterior can be well approximated with the variational posterior: $p(x\tau |o\tau )\u2248Q(x\tau |o\tau )$ which simply states that the variational inference procedure was successful.^{4} Active inference proceeds by inferring a variational policy distribution $Q(\pi )$ that maximizes the evidence for this biased generative model. Intuitively, this approach turns the action selection problem on its head. Instead of saying, “I have some goal; what do I have to do to achieve it?” the active inference agent asks: “Given that my goals were achieved, what would have been the most probable actions that I took?”

A further complication of extending VFE into the future comes from future observations. While agents have access to current observations (or data) for planning problems, they must also reason about unknown future observations. This is dealt with by taking the expectation of the objective with respect to predicted observations $o\tau $ drawn from the generative model.

While the EFE admits many decompositions (see appendix B for a comprehensive overview), the one presented in equation 3.1 is perhaps the the most important because it separates the EFE into an extrinsic, goal-directed term (sometimes also called instrumental value in the literature) and an intrinsic, information-seeking term.^{5} The first term requires agents to maximize the likelihood of the desired observations $p\u02dc(o\tau )$ under beliefs about the future. It thus directs an agent to act to maximize the probability of its desires occurring in the future. It is called the extrinsic value term since it is the term in the EFE that accounts for the agent's preferences.

The second term in equation 3.1 is the expected information gain, which is often termed the epistemic value since it quantifies the amount of information gained by visiting a specific state. Since the information gain is negative, minimizing the EFE as a whole mandates maximizing the expected information gain. This drives the agent to maximize the divergence between its posterior and prior beliefs, thus inducing the agent to take actions that maximally inform their beliefs and reduce uncertainty. It is the combination of extrinsic and intrinsic value terms that belies active inference's claim to have a principled approach to the exploration-exploitation dilemma (Friston, FitzGerald, et al., 2017; Friston et al., 2015).

The idea of maximizing expected information gain or “Bayesian surprise” (Itti & Baldi, 2009) to drive exploratory behavior has been argued for in neuroscience (Baldi & Itti, 2010; Ostwald et al., 2012) and has been regularly proposed in reinforcement learning (Houthooft et al., 2016; Still & Precup, 2012; Sun, Gomez, & Schmidhuber, 2011; Tschantz, Millidge, Seth, & Buckley, 2020). It is important to note, however, that in these prior works, information gain has often been proposed as an ad hoc addition to an existing objective function with only the intuitive justification of boosting exploration. In contrast, expected information gain falls naturally out of the EFE formalism, arguably lending the formalism a degree of theoretical elegance.

## 4 Origins of the EFE

Given the centrality of the EFE to the active inference framework, it is important to explore the origin and nature of this quantity. The EFE is typically motivated through a reductio ad absurdum argument (Friston et al., 2015; Parr & Friston, 2019).^{6} The logic is as follows. Agents have prior beliefs over policies that drive action selection. By the FEP, all states of an organism, including those determining policies, must change so as to minimize free energy. Thus, the only self-consistent prior belief over policies is that the agent will minimize free energy into the future through its policy selection process. If the agent did not have such a prior belief, then it would select policies that did not minimize the free energy into the future and would thus not be a free energy minimizing agent. This logic requires a well-defined notion of the free energy of future states and observations given a specific policy. The active inference literature implicitly assumes that the EFE is the natural functional that fits this notion (Friston, FitzGerald, et al., 2017; Friston et al., 2015). In the following section, we argue that the EFE is not in fact the only functional that can quantify the notion of the free energy of policy-conditioned futures, and indeed we propose a different functional, the free energy of the future, which we argue is a more natural extension of the VFE to account for future states.

### 4.1 The Free Energy of the Future

We argue that the natural extension of the free energy into the future must possess direct analogs to the two crucial properties of the VFE: it must be expressible as a KL-divergence between a posterior and a generative model, such that minimizing it causes the variational density to better approximate the true posterior, and it must also bound the log model evidence of future observations. Bounding the log model evidence (or surprisal) is vital since the surprisal is the core quantity that, under the FEP, all systems are driven to minimize. If the VFE extended into the future failed to bound the surprisal, then minimizing this extension would not necessarily minimize surprisal, and thus any agent that minimized such an extension would be in violation of the FEP. Here, we present a functional that we claim satisfies these desiderata: the free energy of the future (FEF).

^{7}

### 4.2 Bounds on the Expected Model Evidence

Since the expected information gain is an expected KL-divergence, it must be $\u22650$, and thus the negative expected information gain must be $\u22640$. Since the EFE aims to minimize negative information gain (thus maximizing positive information gain), we can see that minimizing the EFE actually drives it further from the expected model evidence.^{8}

We further investigate the EFE and its properties as a bound in appendix D. Additionally, in appendix E we review other attempts in the literature to derive the EFE as a bound on the expected model evidence and discuss their shortcomings.

### 4.3 The EFE and the FEF

While the two formulations might initially look very similar, the key difference is the variational term. The FEF, analogous to the VFE, measures the difference between a variational posterior $Q(x\tau |o\tau )$ and the generative model $Q(x\tau |\pi )$. The EFE, on the other hand, measures the difference between a variational prior and the generative model. It is this difference that makes the EFE not a straightforward extension to the VFE for future time steps, and underwrites its unique epistemic value term.

Equation 4.2. demonstrates that the FEF and EFE can be decomposed in similar fashion. We note that the extrinsic value term for the FEF is a likelihood and a marginal for the EFE. The most important difference, however, lies in the sign of the epistemic value term. Since optimizing either the FEF or the EFE requires their minimization, minimizing the FEF mandates us to minimize information gain while the EFE requires us to maximize it. An FEF agent thus tries to maximize its extrinsic value while trying to explore as little as possible. A key question then arises: Where does the negative information gain in the EFE come from?

While this proof illustrates the relation between the EFE and the FEF, it is theoretically unsatisfying as an account of the origin of the EFE. A large part of the appeal of the EFE is that it purports to show that epistemic value arises “naturally” out of minimizing free energy into the future. In contrast, here we have shown that minimizing free energy into the future requires no commitment to exploratory behavior. While this does not question the usefulness of using an information gain term for exploration, or the use of the EFE as a loss function, it does raise questions about the mathematically principled nature of the objective. It is thus not straightforward to see why agents are directly mandated by the FEP to minimize the EFE specifically, as opposed to some other free energy functional. While this fact may at first appear concerning, we believe it ultimately enhances the power of the formalism by licensing the extension of active inference to encompass other objective functions in a principled manner (Biehl, Guckelsberger, Salge, Smith, & Polani, 2018). In the following section, we propose an alternative objective to the EFE, which results in the same information-seeking epistemic value term, but derives it in a mathematically principled and intuitive way as a bound on the divergence between expected and desired futures.

## 5 Free Energy of the Expected Future

The FEEF can be interpreted as the divergence between a veridical and a biased generative model, and thus furnishes a direct intuition of the goals of a FEEF-minimizing agent. The divergence objective compels the agent to bring the biased and the veridical generative model into alignment. Since the predictions of the biased generative model are heavily biased toward the agent's a priori preferences, the only way to achieve this alignment is to act so as to make the veridical generative model predict desired outcomes in line with the biased generative model. The FEEF objective encompasses the standard active inference intuition of an agent acting through biased inference to maximize accuracy of a biased model. However, the maintenance of two separate generative models (one biased and one veridical) also helps finesse the conceptual difficulty of how the agent manages to make accurate posterior inferences and future predictions about complex dynamics if all it has access to is a biased generative model. It seems straightforward that the biased model would also bias these crucial parts of inference that need to be unimpaired for the scheme to function at all. However, by keeping both a veridical generative model (the same one used at the present time and learned through environmental interactions) and a biased generative model (created by systematically biasing a temporary copy of the veridical model), we elegantly separate the need for both veridical and biased inferential components for future prediction.^{9}

The first thing to note is that the intrinsic value terms of the FEEF and the EFE are identical, under the assumption that the variational posterior is approximately correct $Q(x\tau |o\tau )\u2248p(x\tau |o\tau )$ such that FEEF-minimizing agents will necessarily show identical epistemic behavior to EFE-minimizing agents. Unlike the EFE, however, the FEEF also possesses a strong naturalistic grounding as a bound on a theoretically relevant quantity. The FEEF can maintain both its information-maximizing imperative and its theoretical grounding since it is derived from the minimization of a KL-divergence rather than the maximization of a log model evidence.

The key difference with the EFE lies in the likelihood term. While the EFE simply tries to maximize the expected evidence of the desired observations, the FEEF minimizes the KL-divergence between the likelihood of observations predicted under the veridical generative model^{10} and the marginal likelihood of observations under the biased generative model. This difference is effectively equivalent to an additional veridical generative model expected likelihood entropy term $H[Q(o\tau |x\tau )]$ subtracted from the EFE. The extrinsic value term thus encourages the agent to choose its actions such that its predictions over states lead to observations that are close to its preferred observations, while also trying to move to states whereby the entropy over observations is maximized, thus leading the agent to move toward states where the generative model is not as certain about the likely outcome. In effect, the FEEF possesses another exploratory term, in addition to the information gain, which the EFE lacks.

Another important advantage of the FEEF is that it is mathematically equivalent to the VFE (with a biased generative model) in the present time with a current observation. This is because when we have a real observation, the distribution over the possible veridical observations collapses to a delta distribution, so that the outer expectation has no effect as $EQ(o\tau ,x\tau |\pi )=\u222bQ(x\tau |o\tau )Q(o\tau |\pi )=\u222bQ(x\tau |o\tau )\delta (o-o\xaf)=\u222bQ(x\tau |o\xaf\tau )$ when a real observation $o\xaf$ is available. Similarly, the veridical model can be factorized as $Q(o\tau ,x\tau )=Q(x\tau |o\tau )Q(o\tau )$, and when the observation is known, the entropy of the observation marginal $Q(o\tau |\pi )$ is 0, thus resulting in the VFE. Simultaneously, biased likelihood is equivalent to the veridical likelihood $p\u02dc(o\xaf\tau |x\tau )=Q(o\xaf\tau |x\tau )$, assuming that (barring counterfactual reasoning capability) one cannot usefully desire things to be other than how they are at the present moment. This means that theoretically, we can consider an agent to be both inferring and planning using the same objective, which is not true of the EFE. The EFE does not reduce to the VFE when observations are known, and thus requires a separate objective function to be minimized for planning compared to perceptual inference. Because of this, it is possible to argue that FEEF is mandated by the free-energy principle. On this view, there is no distinction between present and future inference, and both follow from minimizing the same objective but under different informational constraints.

Since the FEEF and the EFE are identical in their intrinsic value term and share deep similarities in their extrinsic term, we believe that the FEEF can serve as a relatively straightforward ”plug-in replacement” for the EFE for many active inference agents. Moreover, it has a much more straightforward intuitive basis than the EFE, is arguably a better continuation of the VFE into the future, and possesses a strong naturalistic grounding as a bound on the divergence between predicted and desired futures.

## 6 Discussion

We believe it is valuable at this point to step back from the morass of various free energies and take stock of what has been achieved. First, we have shown that it is not possible to directly derive epistemic value from variational inference objectives, which serve as a bound on model evidence. However, it is possible to derive epistemic value terms from divergences between the biased and veridical generative models. A deep intuitive understanding of why this is the case is an interesting avenue for future work. The intuition behind the FEEF as a divergence between desired and expected future observations is also similar to probabilistic formulations of the reinforcement learning problem (Attias, 2003; Kappen, 2005; Levine, 2018; Toussaint, 2009), which typically try to minimize the divergence between a controlled trajectory and an optimal trajectory (Kappen, 2007; Theodorou & Todorov, 2012; Williams, Aldrich, & Theodorou, 2017). These schemes also obtain some degree of (undirected) exploratory behavior through their objective functionals, which contain entropy terms and the FEEF can be seen as a way of extending these schemes to partially observed environments. Understanding precisely how active inference and the free-energy principle relate mathematically to such schemes is another fruitful avenue for future work.

It seems intuitive that a Bayes-optimal solution to the exploration-exploitation dilemma should arise directly out of the formulation of reward maximization as inference, given that sources of uncertainty are correctly quantified. However, in this letter, we have shown that merely quantifying uncertainty in states and observations through mean-field-factorized time steps is insufficient to derive such a principled solution to the dilemma, as seen by the exploration-discouraging behavior of the FEF. We therefore believe that to derive Bayes-optimal exploration policies in the context of active learning, such that we have to select actions that give us the most information now to use in the future to maximize rewards, it is likely to require both modeling multiple interconnected time steps, as well as the mechanics of learning with parameters and update rules, and correctly quantifying the uncertainties therein. This is beyond the scope of this letter, but is a very interesting avenue for future work.

The comparison of the FEEF and the EFE also raises an interesting philosophical point about the number and types of generative models employed in the active-inference formalism. One interpretation of the FEEF is in terms of two generative models, but other interpretations are possible, such as between a single unbiased generative model and a simple density of desired states and observations. It is also important to note that due to requiring different objective functions for inference and planning, the EFE formulation also appears to implicitly require two generative models: the generative model of future states and the generative model of states in the future (Friston et al., 2015). While the mathematical formalism is relatively straightforward, the philosophical question of how to translate the mathematical objects into ontological objects called “generative models” is unclear, and progress on this front would be useful in determining the philosophical status, and perhaps even neural implementation of active inference.

The implications of our results for studies of active inference are varied. Nothing in what we have shown argues directly against the use of the EFE as an objective for an active inference agent. However, we believe we have shown that the EFE is not necessarily the only, or even the natural, objective function to use. We thus follow Biehl et al. (2018) in encouraging experimentation with different objective functions for active inference. We especially believe that our objective, the FEEF, has promise due its intuitive interpretation, largely equivalent terms to the EFE, its straightforward use of two generative models rather than just a single biased one, and its close connections to similar probabilistic objectives used in variational reinforcement learning, while also maintaining the crucial epistemic properties of the EFE. Moreover, while in this letter we have argued for the FEF instead of the EFE as a direct extension of the VFE into the future, the logical requirements of exactly which functional (if any) is, in fact, mandated by the free-energy principle remains open. We believe that elucidating the exact constraints which the free-energy principle places on a theory of variational action, and understanding more deeply the relations between the various free energies, could shed light on deep questions regarding notions of Bayes-optimal epistemic action in self-organising systems.

Finally, it is important to note that although in this letter, we have solely been concerned with the EFE and active inference in discrete-time POMDPs, the original intuitions and mathematical framework of the free-energy principle arose out of a continuous time formulation, deeply interwoven with concerns from information theory and statistical physics (Friston, 2019; Friston & Ao, 2012; Friston et al., 2006; Parr et al., 2020). As such, there may be deep connections between the EFE, FEF, and log model evidence that exist only in the continuous time limit and that furnish a mathematically principled origin of epistemic action.

## 7 Conclusion

In this letter, we have examined in detail the nature and origin of the EFE. We have shown that it is not a direct analog of the VFE extended into the future. We then derived a novel objective, the FEF, which we claimed is a more natural extension and shown that it lacks the beneficial epistemic value term of the EFE. We then proved that this term arises in the EFE directly as a result of its nonstandard definition since the EFE can be expressed as just the FEF minus the expected information gain. Taking this into account, we then proposed another objective, the free energy of the expected future (FEEF), which attempts to get the best of both worlds by preserving the desirable information-seeking properties of the EFE, while also maintaining a mathematically principled origin.

## Appendix A: Variational Inference

*q*, parameterized by some parameters $\varphi $, the goal is to adjust the parameters to make

*q*as close as possible to the true posterior $p(xt|ot)$. Mathematically, this means we want to minimize

In effect, the variational free energy is useful because it has two properties. The first is that it is an upper bound on the divergence between the true and approximate posterior. By adjusting our approximate posterior to minimize this bound, we drive it closer to the true posterior, thus achieving more accurate inference. Second, the variational free energy is a bound on the log model evidence. This is an important term that scores the likelihood of the data observed given the model and so can be used in Bayesian model selection.

The log model evidence takes on additional importance in terms of the free-energy principle, since the negative log model evidence $-lnp(ot)$ is surprisal, which all agents, it is proposed, are driven to minimize (Friston et al., 2006). This is because the expected log model evidence is the entropy of observations, the minimization of which is postulated as a necessary condition for any self-sustaining organism to maintain itself as a unique system. The free-energy minimization comes about since the VFE is, as we have seen, a tractable bound on the log model evidence, or surprisal.

In the first entropy-energy decomposition, we simply split the KL-divergence using the properties of logarithms so that the numerator of the fraction becomes the entropy term and the denominator becomes the energy term. If we are seeking to minimize the variational free energy, we need to minimize both the negative entropy (since entropy is defined as $-EQ(x)lnQ(x)$ and the negative energy (or maximize the energy) $EQ(xt|ot;\varphi )[lnp(ot,xt)]$. This can be interpreted as saying we require that the variational posterior be as entropic as possible while also maximizing the likelihood that the $x$s proposed as probable by the variational posterior also be judged as probable under the generative model.

The second decomposition into accuracy and complexity perhaps has a more straightforward interpretation. We wish to minimize the negative accuracy (and thus maximize the accuracy), which means we want the observed observation to be as likely as possible under the $xs$ predicted by the variational posterior. However, we also want to minimize the complexity term, which is a KL-divergence between the variational posterior and the prior. That is, we wish to keep your posterior as close to our prior as possible while still maximizing accuracy. The complexity term then functions as a kind of implicit regularizer, making sure we do not overfit to any specific observation.

The final decomposition speaks to the inferential functions of the VFE. It serves as an upper bound on the log model evidence, since the posterior divergence term, as a KL-divergence, is always positive. Moreover, we see that by minimizing the free energy, we must also be minimizing the posterior divergence, which is the difference between the approximate and true posterior, and we are thus improving our variational approximation.

## Appendix B: Decompositions of the EFE

*prior*, not the approximate posterior, which results:

^{11}

## Appendix C: Trajectory Derivation of the Expected Model Evidence

^{12}only dependent on the past through the prior term $p(xt)=EQ(xt-1|ot-1p(xt|xt-1)$. We name this final approximation the factorization approximation, and it simply states that the prior at the current time step is based on the posterior of the previous time step mapped through the transition dynamics $p(xt|xt-1)$:

The trajectory derivation of the FEEF follows an almost identical scheme to that of the FEF. The only difference is that now the term inside the log also contains an additional $-lnp\u02dc(o)$, which is then combined with the likelihood from the generative model to form the extrinsic-value KL-divergence.

## Appendix D: EFE Bound on the Negative Log Model Evidence

It is important to note that the EFE is also a bound on the negative log model evidence, but a lower bound, not an upper bound. This means that in theory, one should want to maximize the EFE, instead of minimize it, to make the bound as tight as possible.

*is*the log model evidence:

This derivation assumes that the true and approximate posteriors are approximately equal, $p(x\tau |o\tau )\u2248Q(x\tau |o\tau )$, such that this is true only after a variational inference procedure is completed.

We wish to minimize both log model evidence and minimize the EFE. Since the information gain term is a KL-divergence, which is always $\u22650$, and we have a negative information gain term, this means that the EFE is always less than the log model evidence and so is a lower bound. However, this bound becomes tight when the information gain is 0, so to maximally tighten the bound, we wish to reduce the information gain while the EFE demands we maximize it. In effect, this means that the EFE bound is the wrong way around.

Here, since the KL is between the generative model and the approximate posterior, and then decompose the generative model into a true posterior and marginal, we can no longer make the assumption, made in the EFE derivation, that the true and approximate posterior are approximately equal, since that would leave us with only the model evidence. Therefore, instead we get a posterior approximation error term, which is the KL-divergence between the approximate and true posteriors. When the true and approximate posteriors are equal, we are left with the log model evidence. Since the posterior approximation error is always $\u22650$, the FEF is an upper bound on the negative log model evidence, and thus by minimizing the FEF, we make the bound tighter. This logic is essentially a reprise of the standard variational inference logic from a slightly different perspective.

Without the true posterior assumption, we thus find that the EFE could be an upper or a lower bound on the log model evidence, since the two additional KL-divergence terms have opposite signs. If the posterior approximation error is larger than the information gain, the EFE functions correctly as an upper bound. However, if the information gain is larger, the EFE will become a lower bound and could diverge from the log model evidence. This latter situation is more likely since the goal of variational inference is to reduce the approximation error, while EFE agents seek to maximize information gain. This means that the EFE functions correctly as an upper bound on log model evidence only during the early stages of optimization when the posterior approximation is poor. Further optimization steps likely drive the EFE further away from the model evidence. The bound is tight when the information gain equals the posterior approximation error. We can also see that the first two terms of the EFE are simply the FEF; we have thus rederived by a rather round-about route the fact that the EFE is simply the FEF minus the information gain.

We thus see that the status of the EFE as a bound on the log model evidence is shaky, since it depends on the information gain always being larger or smaller than the posterior approximation error. Moreover, the bounding behavior seems to emerge directly from the relation of the EFE to the FEF rather than the intrinsic qualities of the EFE, and it is primarily the information-seeking properties of the EFE that serve to damage the clean bounding behavior of the FEF.

It can be argued that although the mathematical justification of the EFE as a bound may be shaky, the additional information gain term may be beneficial, and the bound may be recovered in the long run, since as a result of short-term actions to maximize the EFE, the epistemic value itself goes to 0, and thus the EFE exactly approximates the bound, while also potentially increasing the ultimate expected reward achieved. This argument is valid heuristically and is identical to the standard justifications for ad hoc intrinsic measures terms in the literature (Oudeyer & Kaplan, 2009)—namely, that exploration hurts in the short run but helps in the long run. We do not dispute that argument in this letter; instead we simply show that the EFE cannot straightforwardly be justified mathematically as being a result of variational inference into the future or as a bound on model evidence. We do not argue at all against its heuristic use to encourage exploration of the environment and thus (we hope) better performance overall.

## Appendix E: Attempts at Naturalizing the EFE

In this appendix, we review several attempts to derive the EFE directly from the expected model evidence.

While this approach gets the correct form of the EFE inside the expectation, the expectation itself is the product of the two marginals rather than the joint required for the full EFE. While this may seem minor, this difference must underpin all the other differences and relations we have explored throughout this letter.

To get to the full EFE, we must make some assumptions to allow us to combine the expectation under two marginals into an expectation under the joint. The first and simplest assumption is that they are the same, such that the joint factorizes into the two marginals: $Q(o\tau ,x\tau |\pi )\u2248Q(o\tau |\pi )Q(x\tau |\pi )$. This assumption is equivalent to assuming independence of observations and latent states, which rather defeats the point of a latent variable model.

A second approach is to assume that the variational prior equals the variational posterior $Q(x\tau |\pi )\u2248Q(x\tau |o\tau )$. This allows one to combine the marginal and posterior into a joint, giving the EFE as desired. However this assumption has several unfortunate consequences. First, it eliminates the entire idea of inference, since the prior and posterior are assumed to be the same; thus, no real inference can have taken place. This is not necessarily an issue if we separate the inference and planning stages of the algorithm, such that they optimize different objective functions; however, the FEEF approach is more elegant as it enables the optimization of the same objective function for both inference and planning, thus casting them as different facets of the same underlying process. Moreover, a more serious issue is that this assumption also eliminates the information gain term in active inference; since the prior and posterior are the same, the divergence between them (which is the information gain), must be zero.

This proof derives the FEF as a bound on not the expected model evidence by our definition, but on the entropy of expected observations given a policy. The EFE is then derived from the FEF by assuming that the prior and posterior are the same, which comes with all the drawbacks explained above. This proof is primarily unworkable because of the assumption that the prior and the posterior are identical. While this may be arguable in the continuous time limit, where it is equivalent to the assumption that that $dQ(x|o)dt\u22480$, which is when the continuous-time inference has reached an equilibrium, it is definitely not true in discrete time; although there is a relation between the prior in the current time step and the posterior in the previous one, it must be mapped through the transition dynamics – $Q(xt|\pi )=EQ(xt-1|\pi )[p(xt|xt-1,\pi )]$.

## Appendix F: Related Quantites

Recently a new free energy, the generalized free energy (GFE) (Parr & Friston, 2019), has been proposed in the literature as an alternative or an extension to the EFE. The GFE shares some close similarities with the FEEF. Both fundamentally extend the EFE by proposing a unified objective function, which is valid for both inference at the current time and planning into the future, whereas the EFE can only be used for planning. Moreover, both GFE and FEEF encode future observations as latent unobserved variables, over which posterior beliefs can be formed. Moreover agents maintain prior beliefs over these variables, which encode its preferences or desires.^{13}

There are two key differences mathematically and intuitively between the GFE and the FEEF. The first is that the GFE maintains a factorized posterior over beliefs and observations, where the posterior beliefs of the two are separated by a mean-field approximation and assumed to be separate. By contrast, the FEEF maintains a joint approximate belief over both observations and states simultaneously. This joint in the case of the FEEF effectively functions as a veridical generative model since $Q(o|x)=p(o|x)$ and $Q(x)=EQ(xt-1|\pi )p(xt|xt-1)$. This means that posterior beliefs of the future are computed simply by rolling forward the generative model given the beliefs about the current time.

A second and more important differences lies in the generative models. The GFE assumes that the agent is only equipped with a single generative model with both veridical and biased components. The preferences of an EFE agent are encoded as a separate factorizable marginal over observations. This means that the generative model of the GFE agent factorizes as $p\u02dc(o,x)GFE\u221dp(o|x)p(x)p\u02dc(o)$. This means that for the GFE, the likelihood and the prior are unbiased, and there is simply an additional prior preferences term in the free-energy expression. By contrast, the FEEF eschews this unusual factorization of the generative model and instead presupposes a separate warped generative model for use in the future that is intrinsically biased. The FEEF generative model thus decomposes as $p\u02dc(o,x)FEEF=p\u02dc(o|x)p\u02dc(x)$, which is the standard factorization of the joint distribution in a generative model, but where both the likelihood and prior distributions are biased toward generating more favorable states of affairs for the agent. This inherent optimism bias then drives action.

A further free energy proposed in the literature has been the Bethe free energy and the Bethe approximation (Schwöbel et al., 2018). This approach eschews the standard mean-field assumption on the approximate posterior in favor of a Bethe approximation from statistical physics (Yedidia, Freeman, & Weiss, 2001, 2005), which instead represents the approximate posterior as the product of pairwise marginals, thus preserving a constraint of pairwise temporal consistency that the mean-field assumption lacks. Due to this greater representation of temporal constraints (the approximate posteriors at each time step being no longer assumed to be independent), the Bethe free energy has the potential to be significantly more accurate than the standard mean-field variational free energy (and is, in fact, exact for factor graphs without cycles such as the standard nonhierarchical POMDP model). In this letter, we focus entirely on the standard mean-field variational free energy used in the vast majority of active inference publications, and thus the Bethe free energy is out of scope for this article. However, exploring the nature of any intrinsic terms that might arise from the Bethe free energy is an interesting avenue for future work. Although primarily focused on the Bethe free energy, Schwöbel et al. (2018) also introduced a “predicted free energy” functional. This functional is equivalent to the FEF as we have defined it here, and so has a complexity instead of an information gain term, leading to minimizing the prior-posterior divergence.

Finally, Biehl et al. (2018) suggested that if the EFE is not mandated by the free-energy principle, which we have argued for in this letter, then in theory any standard intrinsic measure, such as empowerment, could be used as an objective. We believe that exploring the effect of these other potential loss functions could be a area of great interest for future work.

## Notes

^{1}

It is important to note that the original FEP was formulated in continuous time with generalized coordinates (Friston, 2008a; Friston et al., 2006) (where the hidden states are augmented with their temporal derivatives up to a theoretically infinite order). The generalized coordinates mean that the agent is effectively performing variational inference over a Taylor-expanded future trajectory instead of a temporally instant hidden state (Friston, 2008a; Friston et al., 2008). Action is derived by minimizing the gradients of the instantaneous VFE with respect to action, which requires the use of a forward model. More recent work on active inference and the FEP returns to the continuous-time formulation (Friston, 2019; Parr, Da Costa, & Friston, 2020) and the conclusions drawn in this article may look different in the continuous-time domain.

^{2}

It is important to note that this encoding of preferences through a biased generative model is unique to active inference. Other variational control schemes (Levine, 2018; Rawlik, Toussaint, & Vijayakumar, 2013; Rawlik, 2013; Theodorou, Buchli, & Schaal, 2010; Theodorou & Todorov, 2012) instead encode desires through binary optimality variables and optimize the posterior given that the optimal path was taken. The relation between these frameworks is explored further in Millidge, Tschantz, Seth, and Buckley (2020).

^{3}

Some more recent work (Da Costa et al., 2020; Friston, 2019) prefers an alternative factorization of the biased generative model in terms of an unbiased likelihood and a biased prior state distribution $p\u02dc(o\tau ,x\tau )=p(o\tau |x\tau )p\u02dc(x\tau )$. This leads to a different decomposition of the EFE in terms of risk and ambiguity (see appendix B) but which is mathematically equivalent to the factorization described here.

^{4}

For additional information on the effect of this assumption, see appendix D.

^{5}

The approximation in the final line of equation 3.1 is that we assume that the true and approximate posteriors are the same $Q(x\tau |o\tau )\u2248p(x\tau |o\tau )$. Without this assumption, you obtain an additional KL-divergence between the true and approximate posterior, which exactly quantifies the discrepancy between them (see appendices B and D for more detail).

^{6}

An alternative motivation exists that situates the expected free energy in terms of a nonequilibrium steady-state distribution (Da Costa et al., 2020; Friston, 2019; Parr, 2019). This argument reframes everything in terms of a Gibbs free energy, from which the EFE can be derived as a special case. The problem becomes, then, one of the motivation of the Gibbs free energy as an objective function.

^{7}

An objective functional equivalent to the FEF—the predicted free energy—has also been proposed in Schwöbel, Kiebel, and Marković (2018). See appendix F for more details.

^{8}

There is a slight additional subtlety here involving the fact that there is also a posterior approximation error term that is positive. In general, the EFE functions as an upper bound when the posterior error is greater than the information gain and a lower bound when the posterior error is smaller. Since the goal of variational inference is to minimize posterior error, and EFE agents are driven to maximize expected information gain, we expect this latter condition to occur rarely. For more detail, see appendix D.

^{9}

This approach bears a resemblance to that taken in Friston (2019), which separates the evolving dynamical policy-dependent density of the agent and a desired steady-state density that is policy invariant. This approach arises from deep thermodynamic considerations in continuous time, while ours is applicable to discrete time reinforcement learning frameworks.

^{10}

The term *veridical* needs some contextualizing. We simply mean that the model is not biased toward the agent's desires. The veridical generative model is not required to be a perfectly accurate map of the agent's entire world, only of action-relevant submanifolds of the total space (Tschantz, Seth et al., 2019).

^{11}

For further detail on this factorization see Da Costa et al. (2020).

^{12}

We assume discrete time so there is a sum over time steps. We also assume continuous states so there is an integral over states $x$. However, the derivation is identical in the case of discrete states where the integral is simply replaced with a sum.

^{13}

To help make clear the similarity between the GFE and the FEEF, we have defined the veridical generative model as $Q(o\tau ,x\tau )$.

## Acknowledgments

B.M. is supported by an EPSRC-funded PhD studentship. A.T. is funded by a PhD studentship from the Dr. Mortimer and Theresa Sackler Foundation and the School of Engineering and Informatics at the University of Sussex. C.L.B. is supported by BBRSC grant BB/P022197/1. A.T. is grateful to the Dr. Mortimer and Theresa Sackler Foundation, which supports the Sackler Centre for Consciousness Science.