Abstract
Active inference is a probabilistic framework for modeling the behavior of biological and artificial agents, which derives from the principle of minimizing free energy. In recent years, this framework has been applied successfully to a variety of situations where the goal was to maximize reward, often offering comparable and sometimes superior performance to alternative approaches. In this article, we clarify the connection between reward maximization and active inference by demonstrating how and when active inference agents execute actions that are optimal for maximizing reward. Precisely, we show the conditions under which active inference produces the optimal solution to the Bellman equation, a formulation that underlies several approaches to model-based reinforcement learning and control. On partially observed Markov decision processes, the standard active inference scheme can produce Bellman optimal actions for planning horizons of 1 but not beyond. In contrast, a recently developed recursive active inference scheme (sophisticated inference) can produce Bellman optimal actions on any finite temporal horizon. We append the analysis with a discussion of the broader relationship between active inference and reinforcement learning.
1 Introduction
1.1 Active Inference
Active inference is a normative framework for modeling intelligent behavior in biological and artificial agents. It simulates behavior by numerically integrating equations of motion thought to describe the behavior of biological systems, a description based on the free energy principle (Barp et al., 2022; Friston et al., 2022). Active inference comprises a collection of algorithms for modeling perception, learning, and decision making in the context of both continuous and discrete state spaces (Barp et al., 2022; Da Costa et al., 2020; Friston et al., 2021, 2010; Friston, Parr, et al., 2017). Briefly, building active inference agents entails (1) equipping the agent with a (generative) model of the environment, (2) fitting the model to observations through approximate Bayesian inference by minimizing variational free energy (i.e., optimizing an evidence lower bound Beal, 2003; Bishop, 2006; Blei et al., 2017; Jordan et al., 1998) and (3) selecting actions that minimize expected free energy, a quantity that that can be decomposed into risk (i.e., the divergence between predicted and preferred paths) and ambiguity, leading to context-specific combinations of exploratory and exploitative behavior (Millidge, 2021; Schwartenbeck et al., 2019). This framework has been used to simulate and explain intelligent behavior in neuroscience (Adams et al., 2013; Parr, 2019; Parr et al., 2021; Sajid et al., 2022), psychology and psychiatry (Smith, Khalsa, et al., 2021; Smith, Kirlic, Stewart, Touthang, Kuplicki, Khalsa, et al., 2021; Smith, Kirlic, Stewart, Touthang, Kuplicki, McDermott, et al., 2021; Smith, Kuplicki, Feinstein, et al., 2020; Smith, Kuplicki, Teed, et al., 2020; Smith, Mayeli, et al., 2021; Smith, Schwartenbeck, Stewart, et al., 2020; Smith, Taylor, et al., 2022), machine learning (Çatal et al., 2020; Fountas et al., 2020; Mazzaglia et al., 2021; Millidge, 2020; Tschantz et al., 2019; Tschantz, Millidge, et al., 2020), and robotics (Çatal et al., 2021; Lanillos et al., 2020; Oliver et al., 2021; Pezzato et al., 2020; Pio-Lopez et al., 2016; Sancaktar et al., 2020; Schneider et al., 2022).
1.2 Reward Maximization through Active Inference?
In contrast, the traditional approaches to simulating and explaining intelligent behavior—stochastic optimal control (Bellman, 1957; Bertsekas & Shreve, 1996) and reinforcement learning (RL; Barto & Sutton, 1992)—derive from the normative principle of executing actions to maximize reward scoring the utility afforded by each state of the world. This idea dates back to expected utility theory (Von Neumann & Morgenstern, 1944), an economic model of rational choice behavior, which also underwrites game theory (Von Neumann & Morgenstern, 1944) and decision theory (Berger, 1985; Dayan & Daw, 2008). Several empirical studies have shown that active inference can successfully perform tasks that involve collecting reward, often (but not always) showing comparative or superior performance to RL (Cullen et al., 2018; Marković et al., 2021; Mazzaglia et al., 2021; Millidge, 2020; Paul et al., 2021; Sajid, Ball, et al., 2021; Smith, Kirlic, Stewart, Touthang, Kuplicki, Khalsa, et al., 2021; Smith, Kirlic, Stewart, Touthang, Kuplicki, McDermott, et al., 2021; Smith, Schwartenbeck, Stewart, et al., 2020; Smith, Taylor, et al., 2022; van der Himst & Lanillos, 2020) and marked improvements when interacting with volatile environments (Marković et al., 2021; Sajid, Ball, et al., 2021). Given the prevalence and historical pedigree of reward maximization, we ask: How and when do active inference agents execute actions that are optimal with respect to reward maximization?
1.3 Organization of Paper
In this article, we explain (and prove) how and when active inference agents exhibit (Bellman) optimal reward-maximizing behavior.
For this, we start by restricting ourselves to the simplest problem: maximizing reward on a finite horizon Markov decision process (MDP) with known transition probabilities—a sequential decision-making task with complete information. In this setting, we review the backward-induction algorithm from dynamic programming, which forms the workhorse of many optimal control and model-based RL algorithms. This algorithm furnishes a Bellman optimal state-action mapping, which means that it provides provably optimal decisions from the point of view of reward maximization (see section 2).
We then introduce active inference on finite horizon MDPs (see section 3)—a scheme consisting of perception as inference followed by planning as inference, which selects actions so that future states best align with preferred states.
In section 4, we show how and when active inference maximizes reward in MDPs. Specifically, when the preferred distribution is a (uniform mixture of) Dirac distribution(s) over reward-maximizing trajectories, selecting action sequences according to active inference maximizes reward (see section 4.1). Yet active inference agents, in their standard implementation, can select actions that maximize reward only when planning one step ahead (see section 4.2). It takes a recursive, sophisticated form of active inference to select actions that maximize reward—in the sense of a Bellman optimal state-action mapping—on any finite time-horizon (see section 4.3).
In section 5, we introduce active inference on partially observable Markov decision processes with known transition probabilities—a sequential decision-making task where states need to be inferred from observations—and explain how the results from the MDP setting generalize to this setting.
In section 6, we step back from the focus on reward maximization and briefly discuss decision making beyond reward maximization, learning unknown environments and reward functions, and outstanding challenges in scaling active inference. We append this with a broader discussion of the relationship between active inference and reinforcement learning in appendix A.
Our findings are summarized in section 7.
All of our analyses assume that the agent knows the environmental dynamics (i.e., transition probabilities) and reward function. In appendix A, we discuss how active inference agents can learn their world model and rewarding states when these are initially unknown—and the broader relationship between active inference and RL.
2 Reward Maximization on Finite Horizon MDPs
In this section, we consider the problem of reward maximization in Markov decision processes (MDPs) with known transition probabilities.
2.1 Basic Definitions
MDPs are a class of models specifying environmental dynamics widely used in dynamic programming, model-based RL, and more broadly in engineering and artificial intelligence (Barto & Sutton, 1992; Stone, 2019). They are used to simulate sequential decision-making tasks with the objective of maximizing a reward or utility function. An MDP specifies environmental dynamics unfolding in discrete space and time under the actions pursued by an agent.
, a finite set of states.
, a finite set that stands for discrete time. is the temporal horizon (a.k.a. planning horizon).
, a finite set of actions.
, the probability that action in state at time will lead to state at time . are random variables over that correspond to the state being occupied at time .
, the probability of being at state at the start of the trial.
, the finite reward received by the agent when at state .
(On the Definition of Reward). More generally, the reward function can be taken to be dependent on the previous action and previous state: is the reward received after transitioning from state to state due to action (Barto & Sutton, 1992; Stone, 2019). However, given an MDP with such a reward function, we can recover our simplified setting by defining a new MDP where the new states comprise the previous action, previous state, and current state in the original MDP. By inspection, the resulting reward function on the new MDP depends only on the current state (i.e., ).
(Admissible Actions). In general, it is possible that only some actions can be taken at each state. In this case, one defines to be the finite set of (allowable) actions from state . All forthcoming results concerning MDPs can be extended to this setting.
To formalize what it means to choose actions in each state, we introduce the notion of a state-action policy.
(Time-Dependent State-Action Policies). The way an agent chooses actions at the end of its life is usually going to be very different from the way it chooses them when it has a longer life ahead of it. In finite horizon decision problems, state-action policies should generally be considered to be time-dependent, as time-independent optimal state-action policies may not exist. To see this, consider the following simple example: (integers mod 5), . Optimal state-action policies are necessarily time-dependent as the reward-maximizing trajectory from state 2 at time 0 consists of reaching state 4, while the optimal trajectory from state 2 at time 1 consists of reaching state 1. This is particular to finite-horizon decisions, as, in infinite-horizon (discounted) problems, optimal state-action policies can always be taken to be time-independent (Puterman, 2014, theorem 6.2.7).
(Conflicting Terminologies: Policy in Active Inference). In active inference, a policy is defined as a sequence of actions indexed in time.1 To avoid terminological confusion, we use action sequences to denote policies under active inference.
2.2 Bellman Optimal State-Action Policies
A state-action policy is Bellman optimal if it is better than all alternatives.
It is important to verify that this concept is not vacuous.
(Existence of Bellman Optimal State-Action Policies). Given a finite horizon MDP as specified in definition 1, there exists a Bellman optimal state-action policy .
A proof is found in appendix B.1. Note that the uniqueness of the Bellman optimal state-action policy is not implied by proposition 1; indeed, multiple Bellman optimal state-action policies may exist (Bertsekas & Shreve, 1996; Puterman, 2014).
Now that we know that Bellman optimal state-action policies exist, we can characterize them as a return-maximizing action followed by a Bellman optimal state-action policy.
(Characterization of Bellman Optimal State-Action Policies). For a state-action policy , the following are equivalent:
is Bellman optimal.
is both
- Bellman optimal when restricted to . In other words, state-action policy and
- At time 0, selects actions that maximize return:(2.1)
2.3 Backward Induction
Proposition 2 suggests a straightforward recursive algorithm to construct Bellman optimal state-action policies known as backward induction (Puterman, 2014). Backward induction has a long history. It was developed by the German mathematician Zermelo in 1913 to prove that chess has Bellman optimal strategies (Zermelo, 1913). In stochastic control, backward induction is one of the main methods for solving the Bellman equation (Adda & Cooper, 2003; Miranda & Fackler, 2002; Sargent, 2000). In game theory, the same method is used to compute subgame perfect equilibria in sequential games (Fudenberg & Tirole, 1991).
Backward induction entails planning backward in time, from a goal state at the end of a problem, by recursively determining the sequence of actions that enables reaching the goal. It proceeds by first considering the last time at which a decision might be made and choosing what to do in any situation at that time in order to get to the goal state. Using this information, one can then determine what to do at the second-to-last decision time. This process continues backward until one has determined the best action for every possible situation or state at every point in time.
A proof is in appendix B.3.
Intuitively, the backward induction algorithm 2.2 consists of planning backward, by starting from the end goal and working out the actions needed to achieve the goal. To give a concrete example of this kind of planning, backward induction would consider the following actions in the order shown:
Desired goal: I would like to go to the grocery store.
Intermediate action: I need to drive to the store.
Current best action: I should put my shoes on.
Proposition 3 tells us that to be optimal with respect to reward maximization, one must plan like backward induction. This will be central to our analysis of reward maximization in active inference.
3 Active Inference on Finite Horizon MDPs
We now turn to introducing active inference agents on finite horizon MDPs with known transition probabilities. We assume that the agent's generative model of its environment is given by the previously defined finite horizon MDP (see definition 1). We do not consider the case where the transitions have to be learned but comment on it in appendix A.2 (see also Da Costa et al., 2020; Friston et al., 2016).
3.1 Perception as Inference
In active inference, perception entails inferences about future, past, and current states given observations and a sequence of actions. When states are partially observed, this is done through variational Bayesian inference by minimizing a free energy functional also known as an evidence bound (Beal, 2003; Bishop, 2006; Blei et al., 2017; Wainwright & Jordan, 2007).
3.2 Planning as Inference
Now that the agent has inferred future states given alternative action sequences, we must assess these alternative plans by examining the resulting state trajectories. The objective that active inference agents optimize—in order to select the best possible actions—is the expected free energy (Barp et al., 2022; Da Costa et al., 2020; Friston et al., 2021). Under active inference, agents minimize expected free energy in order to maintain themselves distributed according to a target distribution over the state-space encoding the agent's preferences.
Expression 3.5 is much easier to handle: for each action sequence , one evaluates the summands sequentially , and if and when the sum up to becomes significantly higher than the lowest expected free energy encountered during planning, is set to an arbitrarily high value. Setting to a high value is equivalent to pruning away unlikely trajectories. This bears some similarity to decision tree pruning procedures used in RL (Huys et al., 2012). It finesses exploration of the decision tree in full depth and provides an Occam's window for selecting action sequences.
Complementary approaches can help make planning tractable. For example, hierarchical generative models factorize decisions into multiple levels. By abstracting information at a higher-level, lower levels entertain fewer actions (Friston et al., 2018), which reduces the depth of the decision tree by orders of magnitude. Another approach is to use algorithms that search the decision tree selectively, such as Monte Carlo tree search (Champion, Bowman, et al., 2021; Champion, Da Costa, et al., 2021; Fountas et al., 2020; Maisto et al., 2021; Silver et al., 2016) and amortizing planning using artificial neural networks (i.e., learning to plan) (Çatal et al., 2019; Fountas et al., 2020; Millidge, 2019; Sajid, Tigas, et al., 2021).
4 Reward Maximization on MDPs through Active Inference
Here, we show how active inference solves the reward maximization problem.
4.1 Reward Maximization as Reaching Preferences
From the definition of expected free energy, equation 3.2, active inference on MDPs can be thought of as reaching and remaining at a target distribution over state-space.
The (inverse temperature) parameter scores how motivated the agent is to occupy reward-maximizing states. Note that states that maximize the reward maximize and minimize for any .
We now show how reaching preferred states can be formulated as reward maximization:
4.2 Reward Maximization on MDPs with a Temporal Horizon of 1
In this section, we first consider the case of a single-step decision problem (i.e., a temporal horizon of ) and demonstrate how the standard active inference scheme maximizes reward on this problem in the limit . This will act as an important building block for when we subsequently consider more general multistep decision problems.
In summary, this scheme selects the first action within action sequences that, on average, maximize their exponentiated negative expected free energies. As a corollary, if the first action is in a sequence with a very low expected free energy, this adds an exponentially large contribution to the selection of this particular action. We summarize this scheme in Table 1.
Process . | Computation . |
---|---|
Perceptual inference | |
Planning as inference | |
Decision making | |
Action selection |
Process . | Computation . |
---|---|
Perceptual inference | |
Planning as inference | |
Decision making | |
Action selection |
A proof is in appendix B.5. Importantly, the standard active inference scheme, equation 4.4, falls short in terms of Bellman optimality on planning horizons greater than one; this rests on the fact that it does not coincide with backward induction. Recall that backward induction offers a complete description of Bellman optimal state-action policies (see proposition 3). In contrast, active inference plans by adding weighted expected free energies of each possible future course of action. In other words, unlike backward induction, it considers future courses of action beyond the subset that will subsequently minimize expected free energy, given subsequently encountered states.
4.3 Reward Maximization on MDPs with Finite Temporal Horizons
Process . | Computation . |
---|---|
Perceptual inference | |
Planning as inference | |
Decision making | |
Action selection |
Process . | Computation . |
---|---|
Perceptual inference | |
Planning as inference | |
Decision making | |
Action selection |
Equation 4.7 is strikingly similar to the backward induction algorithm (proposition 3), and indeed we recover backward induction in the limit .
Note that maximizing the entropy of future states keeps the agent's options open (Klyubin et al., 2008) in the sense of committing the least to a specified sequence of states. A proof of theorem 2 is in appendix B.6.
5 Generalization to POMDPs
Partially observable Markov decision processes (POMDPs) generalize MDPs in that the agent observes a modality , which carries incomplete information about the current state , as opposed to the current state itself.
(Finite Horizon POMDP). A finite horizon POMDP is an MDP (see definition 1) with the following additional data:
is a finite set of observations.
is the probability that the state at time will lead to the observation at time . are random variables over that correspond to the observation being sampled at time .
5.1 Active Inference on Finite Horizon POMDPs
We briefly introduce active inference agents on finite horizon POMDPs with known transition probabilities (for more details, see Da Costa et al., 2020; Parr et al., 2022; Smith, Friston, et al., 2022). We assume that the agent's generative model of its environment is given by POMDP (see definition 5).4
5.1.1 Perception as Inference
5.1.2 Planning as Inference
5.2 Maximizing Reward on POMDPs
Crucially, our reward maximization results translate to the POMDP case. To make this explicit, we rehearse lemma 1 in the context of POMDPs.
In other words, they least commit to a prespecified sequence of future states and ensure that their expected observations are maximally informative of states. Of course, when inferences are inexact, the extent to which proposition 4 holds depends on the accuracy of the approximation, equation 5.3. A proof of proposition 4 is in appendix B.7.
The schemes of Tables 1 and 2 exist in the POMDP setting, (e.g., Barp et al., 2022, section 5, and Friston et al., 2021, respectively). Thus, in POMDPs with known transition probabilities, provided that inferences are exact (see equation 5.2) and in the zero temperature limit (see equation 4.3), standard active inference (Barp et al., 2022, section 5) maximizes reward on temporal horizons of one but not beyond, and a recursive scheme such as sophisticated active inference (Friston et al., 2021) maximizes reward on finite temporal horizons. Note that for computational tractability, the sophisticated active inference scheme presented in Friston et al. (2021) does not generally perform exact inference; thus, the extent to which it will maximize reward in practice will depend on the accuracy of its inferences. Nevertheless, our results indicate that sophisticated active inference will vastly outperform standard active inference in most reward-maximization tasks.
6 Discussion
In this article, we have examined a specific notion of optimality, namely, Bellman optimality, defined as selecting actions to maximize future expected rewards. We demonstrated how and when active inference is Bellman optimal on finite horizon POMDPs with known transition probabilities and reward function.
These results highlight important relationships among active inference, stochastic control, and RL, as well as conditions under which they would and would not be expected to behave similarly (e.g., environments with multiple reward-maximizing trajectories, those affording ambiguous observations). We refer readers to appendix A for a broader discussion of the relationship between active inference and reinforcement learning.
6.1 Decision Making beyond Reward Maximization
More broadly, it is important to ask if reward maximization is the right objective underwriting intelligent decision making. This is an important question for decision neuroscience. That is, do humans optimize a reward signal, expected free energy, or other planning objectives? This can be addressed by comparing the evidence for these competing hypotheses based on empirical data (Smith, Kirlic, Stewart, Touthang, Kuplicki, Khalsa, et al., 2021; Smith, Kirlic, Stewart, Touthang, Kuplicki, McDermott, et al., 2021; Smith, Schwartenbeck, Stewart, et al., 2020; Smith, Taylor, et al., 2022). Current empirical evidence suggests that humans are not purely reward-maximizing agents; they also engage in both random and directed exploration (Daw et al., 2006; Gershman, 2018; Mirza et al., 2018; Schulz & Gershman, 2019; Wilson et al., 2021, 2014; Xu et al., 2021) and keep their options open (Schwartenbeck, FitzGerald, Mathys, Dolan, Kronbichler, et al., 2015). As we have illustrated, active inference implements a clear form of directed exploration through minimizing expected free energy. Although not covered in detail here, active inference can also accommodate random exploration by sampling actions from the posterior belief over action sequences, as opposed to selecting the most likely action as presented in Tables 1 and 2.
Note that behavioral evidence favoring models that do not solely maximize reward within reward-maximization tasks—that is, where “maximize reward” is the explicit instruction—is not a contradiction. Rather, gathering information about the environment (exploration) generally helps to reap more reward in the long run, as opposed to greedily maximizing reward based on imperfect knowledge (Cullen et al., 2018; Sajid, Ball, et al., 2021). This observation is not new, and many approaches to simulating adaptive agents employed today differ significantly from their reward-maximizing antecedents (see appendix A.3).
6.2 Learning
When the transition probabilities or reward function are unknown to the agent, the problem becomes one of reinforcement learning (RL; Shoham et al., 2003 as opposed to stochastic control. Although we did not explicitly consider it above, this scenario can be accommodated by active inference by simply equipping the generative model with a prior and updating the model using variational Bayesian inference to best fit observed data. Depending on the specific learning problem and generative model structure, this can involve updating the transition probabilities and/or the target distribution . In POMDPs it can also involve updating the probabilities of observations under each state. We refer to appendix A.2 for discussion of reward learning through active inference and connections to representative RL approaches, and Da Costa et al. (2020) and Friston et al. (2016) for learning transition probabilities through active inference.
6.3 Scaling Active Inference
When comparing RL and active inference approaches generally, one outstanding issue for active inference is whether it can be scaled up to solve the more complex problems currently handled by RL in machine learning contexts (Çatal et al., 2020, 2021; Fountas et al., 2020; Mazzaglia et al., 2021; Millidge, 2020; Tschantz et al., 2019). This is an area of active research.
One important issue along these lines is that planning ahead by evaluating all or many possible sequences of actions is computationally prohibitive in many applications. Three complementary solutions have emerged: (1) employing hierarchical generative models that factorize decisions into multiple levels and reduce the size of the decision tree by orders of magnitude (Çatal et al., 2021; Friston et al., 2018; Parr et al., 2021); (2) efficiently searching the decision tree using algorithms like Monte Carlo tree search (Champion, Bowman, et al., 2021; Champion, Da Costa, et al., 2021; Fountas et al., 2020; Maisto et al., 2021; Silver et al., 2016); and (3) amortizing planning using artificial neural networks (Çatal et al., 2019; Fountas et al., 2020; Millidge, 2019; Sajid, Tigas, et al., 2021).
Another issue rests on learning the generative model. Active inference may readily learn the parameters of a generative model; however, more work needs to be done on devising algorithms for learning the structure of generative models themselves (Friston, Lin, et al., 2017; Smith, Schwartenbeck, Parr, et al., 2020). This is an important research problem in generative modeling, called Bayesian model selection or structure learning (Gershman & Niv, 2010; Tervo et al., 2016).
Note that these issues are not unique to active inference. Model-based RL algorithms deal with the same combinatorial explosion when evaluating decision trees, which is one primary motivation for developing efficient model-free RL algorithms. However, other heuristics have also been developed for efficiently searching and pruning decision trees in model-based RL (Huys et al., 2012; Lally et al., 2017). Furthermore, model-based RL suffers the same limitation regarding learning generative model structure. Yet RL may have much to offer active inference in terms of efficient implementation and the identification of methods to scale to more complex applications (Fountas et al., 2020; Mazzaglia et al., 2021).
7 Conclusion
In summary, we have shown that under the specification that the active inference agent prefers maximizing reward, equation 4.3:
On finite horizon POMDPs with known transition probabilities, the objective optimized for action selection in active inference (i.e., expected free energy) produces reward-maximizing action sequences when state estimation is exact. When there are multiple reward-maximizing candidates, this selects those sequences that maximize the entropy of future states—thereby keeping options open—and that minimize the ambiguity of future observations so that they are maximally informative. More generally, the extent to which action sequences will be reward maximizing will depend on the accuracy of state estimation.
The standard active inference scheme (e.g., Barp et al., 2022, section 5) produces Bellman optimal actions for planning horizons of one when state estimation is exact but not beyond.
A sophisticated active inference scheme (e.g., Friston et al., 2021) produces Bellman optimal actions on any finite planning horizon when state estimation is exact. Furthermore, this scheme generalizes the well-known backward induction algorithm from dynamic programming to partially observed environments. Note that for computational efficiency, the scheme presented in Friston et al. (2021), does not generally perform exact state estimation; thus, the extent to which it will maximize reward in practice will depend on the accuracy of its inferences. Nevertheless, it is clear from our results that sophisticated active inference will vastly outperform standard active inference in most reward-maximization tasks.
Note that for computational tractability, the sophisticated active inference scheme presented in Friston et al. (2021) does not generally perform exact inference; thus, the extent to which it will maximize reward in practice will depend on the accuracy of its inferences. Nevertheless, it is clear from these results that sophisticated active inference will vastly outperform standard active inference in most reward-maximization tasks.
In conclusion, the sophisticated active inference scheme should be the method of choice when applying active inference to optimally solve the reward-maximization problems considered here.
Appendix A: Active Inference and Reinforcement Learning
This article considers how active inference can solve the stochastic control problem. In this appendix, we discuss the broader relationship between active inference and RL.
Loosely speaking, RL is the field of methodologies and algorithms that learn reward-maximizing actions from data and seek to maximize reward in the long run. Because RL is a data-driven field, algorithms are selected based on how well they perform on benchmark problems. This has produced a plethora of diverse algorithms, many designed to solve specific problems, each with its own strengths and limitations. This makes RL difficult to characterize as a whole. Thankfully, many approaches to model-based RL and control can be traced back to approximating the optimal solution to the Bellman equation (Bellman & Dreyfus, 2015; Bertsekas & Shreve, 1996) (although this may become computationally intractable in high dimensions; Barto & Sutton, 1992). Our results showed how and when decisions under active inference and such RL approaches are similar.
This appendix discusses how active inference and RL relate and differ more generally. Their relationship has become increasingly important to understand, as a growing body of research has begun to (1) compare the performance of active inference and RL models in simulated environments (Cullen et al., 2018; Millidge, 2020; Sajid, Ball, et al., 2021), (2) apply active inference to model human behavior on reward learning tasks (Smith, Kirlic, Stewart, Touthang, Kuplicki, Khalsa, et al., 2021; Smith, Kirlic, Stewart, Touthang, Kuplicki, McDermott, et al., 2021; Smith, Schwartenbeck, Stewart, et al., 2020; Smith, Taylor, et al., 2022), and (3) consider the complementary predictions and interpretations each offers in computational neuroscience, psychology, and psychiatry (Cullen et al., 2018; Huys et al., 2012; Schwartenbeck, FitzGerald, Mathys, Dolan, & Friston, 2015; Schwartenbeck et al., 2019; Tschantz, Seth, et al., 2020).
A.1 Main Differences between Active Inference and Reinforcement Learning
A.1.1 Philosophy
Active inference and RL differ profoundly in their philosophy. RL derives from the normative principle of maximizing reward (Barto & Sutton, 1992), while active inference describes systems that maintain their structural integrity over time (Barp et al., 2022; Friston et al., 2022). Despite this difference, these frameworks have many practical similarities. For example, recall that behavior in active inference is completely determined by the agent's preferences, determined as priors in their generative model. Crucially, log priors can be interpreted as reward functions and vice versa, which is how behavior under RL and active inference can be related.
A.1.2 Model Based and Model Free
Active inference agents always embody a generative (i.e., forward) model of their environment, while RL comprises both model-based and simpler model-free algorithms. In brief, “model-free” means that agents learn a reward-maximizing state-action mapping, based on updating cached state-action pair values through initially random actions that do not consider future state transitions. In contrast, model-based RL algorithms attempt to extend stochastic control approaches by learning the dynamics and reward function from data. Recall that stochastic control calls on strategies that evaluate different actions on a carefully handcrafted forward model of dynamics (i.e., known transition probabilities) to finally execute the reward-maximizing action. Under this terminology, all active inference agents are model-based.
A.1.3 Modeling Exploration
Exploratory behavior—which can improve reward maximization in the long run—is implemented differently in the two approaches. In most cases, RL implements a simple form of exploration by incorporating randomness in decision making (Tokic & Palm, 2011; Wilson et al., 2014), where the level of randomness may or may not change over time as a function of uncertainty. In other cases, RL incorporates ad hoc information bonuses in the reward function or other decision-making objectives to build in directed exploratory drives (e.g., upper-confidence-bound algorithms or Thompson sampling). In contrast, directed exploration emerges naturally within active inference through interactions between the risk and ambiguity terms in the expected free energy (Da Costa et al., 2020; Schwartenbeck et al., 2019). This addresses the explore-exploit dilemma and confers the agent with artificial curiosity (Friston, Lin, et al., 2017; Schmidhuber, 2010; Schwartenbeck et al., 2019; Still & Precup, 2012), as opposed to the need to add ad hoc information bonus terms (Tokic & Palm, 2011). We expand on this relationship in appendix A.3.
A.1.4 Control and Learning as Inference
Active inference integrates state estimation, learning, decision making, and motor control under the single objective of minimizing free energy (Da Costa et al., 2020). In fact, active inference extends previous work on the duality between inference and control (Kappen et al., 2012; Rawlik et al., 2013; Todorov, 2008; Toussaint, 2009) to solve motor control problems via approximate inference (i.e., planning as inference: Attias, 2003; Botvinick & Toussaint, 2012; Friston et al., 2012, 2009; Millidge, Tschantz, Seth, et al., 2020). Therefore, some of the closest RL methods to active inference are control as inference, also known as maximum entropy RL (Levine, 2018; Millidge, Tschantz, Seth, et al., 2020; Ziebart, 2010), though one major difference is in the choice of decision-making objective. Loosely speaking, these aforementioned methods minimize the risk term of the expected free energy, while active inference also minimizes ambiguity.
Useful Features of Active Inference
Active inference allows great flexibility and transparency when modeling behavior. It affords explainable decision making as a mixture of information- and reward-seeking policies that are explicitly encoded (and evaluated in terms of expected free energy) in the generative model as priors, which are specified by the user (Da Costa, Lanillos, et al., 2022). As we have seen, the kind of behavior that can be produced includes the optimal solution to the Bellman equation.
Active inference accommodates deep hierarchical generative models combining both discrete and continuous state-spaces (Friston, Parr, et al., 2017; Friston et al., 2018; Parr et al., 2021).
The expected free energy objective optimized during planning subsumes many approaches used to describe and simulate decision making in the physical, engineering, and life sciences, affording it various interesting properties as an objective (see Figure 3 and Friston et al., 2021). For example, exploratory and exploitative behavior are canonically integrated, which finesses the need for manually incorporating ad hoc exploration bonuses in the reward function (Da Costa, Tenka, et al., 2022).
Active inference goes beyond state-action policies that predominate in traditional RL to sequential policy optimization. In sequential policy optimization, one relaxes the assumption that the same action is optimal given a particular state and acknowledges that the sequential order of actions may matter. This is similar to the linearly solvable MDP formulation presented by Todorov (2006, 2009), where transition probabilities directly determine actions and an optimal policy specifies transitions that minimize some divergence cost. This way of approaching policies is perhaps most apparent in terms of exploration. Put simply, it is clearly better to explore and then exploit than the converse. Because expected free energy is a functional of beliefs, exploration becomes an integral part of decision making—in contrast with traditional RL approaches that try to optimize a reward function of states. In other words, active inference agents will explore until enough uncertainty is resolved for reward-maximizing, goal-seeking imperatives to start to predominate.
Such advantages should motivate future research to better characterize the environments in which these properties offer useful advantages—such as where performance benefits from learning and planning at multiple temporal scales and from the ability to select policies that resolve both state and parameter uncertainty.
A.2 Reward Learning
The update rule consisting of accumulating state-observation counts in the likelihood matrix (see equation A.2) (i.e., not incorporating Dirichlet priors) bears some similarity to off-policy learning algorithms such as Q-learning. In Q-learning, the objective is to find the best action given the current observed state. For this, the Q-learning agent accumulates values for state-action pairs with repeated observation of rewarding or punishing action outcomes—much like state-observation counts. This allows it to learn the Q-value function that defines a reward maximizing policy.
As always in partially observed environments, we cannot guarantee that the true likelihood mapping will be learned in practice. Smith et al. (2019) provides examples where, although not in an explicit reward-learning context, learning the likelihood can be more or less successful in different situations. Learning the true likelihood fails when the inference over states is inaccurate, such as when using too severe a mean-field approximation to the free energy (Blei et al., 2017; Parr et al., 2019; Tanaka, 1999), which causes the agent to misinfer states and thereby accumulate Dirichlet parameters in the wrong locations. Intuitively, this amounts to jumping to conclusions too quickly.
If so desired, reward learning in active inference can also be equivalently formulated as learning transition probabilities . In this alternative setup (as exemplified in Sales et al. (2019)), mappings between reward states and reward outcomes in are set as identity matrices, and the agent instead learns the probability of transitioning to states that deterministically generate preferred (rewarding) observations given the choice of each action sequence. The transition probabilities under each action are learned in a similar fashion as above (see equation A.1), by accumulating counts on a Dirichlet prior over . See Da Costa et al., 2020, appendix, for details.
Given the model-based Bayesian formulation of active inference, more direct links can be made between the active inference approach to reward learning described above and other Bayesian model-based RL approaches. For such links to be realized, the Bayesian RL agent would be required to have a prior over a prior (e.g., a prior over the reward function prior or transition function prior). One way to implicitly incorporate this is through Thompson sampling (Ghavamzadeh et al., 2016; Russo & Van Roy, 2014, 2016; Russo et al., 2017). While that is not the focus of this article, future work could further examine the links between reward learning in active inference and model-based Bayesian RL schemes.
A.3 Solving the Exploration-Exploitation Dilemma
An important distinction between active inference and reinforcement learning schemes is how they solve the exploration-exploitation dilemma.
The exploration-exploitation dilemma (Berger-Tal et al., 2014) arises whenever an agent has incomplete information about its environment, such as when the environment is partially observed or the generative model has to be learned. The dilemma is then about deciding whether to execute actions aiming to collect reward based on imperfect information about the environment or to execute actions aiming to gather more information—allowing the agent to reap more reward in the future. Intuitively, it is always best to explore and then exploit, but optimizing this trade-off can be difficult.
Active inference balances exploration and exploitation through minimizing the risk and ambiguity inherent in the minimization of expected free energy. This balance is context sensitive and can be adjusted by modifying the agent's preferences (Da Costa, Lanillos, et al., 2022). In turn, the expected free energy is obtained from a description of agency in biological systems derived from physics (Barp et al., 2022; Friston et al., 2022).
Modern RL algorithms integrate exploratory and exploitative behavior in many different ways. One option is curiosity-driven rewards to encourage exploration. Maximum entropy RL and control-as-inference make decisions by minimizing a KL divergence to the target distribution (Eysenbach & Levine, 2019; Haarnoja et al., 2017, 2018; Levine, 2018; Todorov, 2008; Ziebart et al., 2008), which combines reward maximization with maximum entropy over states. This is similar to active inference on MDPs (Millidge, Tschantz, Seth, et al., 2020). Similarly, the model-free Soft Actor-Critic (Haarnoja et al., 2018) algorithm maximizes both expected reward and entropy. This outperforms other state-of-the-art algorithms in continuous control environments and has been shown to be more sample efficient than its reward-maximizing counterparts (Haarnoja et al., 2018). Hyper (Zintgraf et al., 2021) proposes reward maximization alongside minimizing uncertainty over both external states and model parameters. Bayes-adaptive RL (Guez et al., 2013a, 2013b; Ross et al., 2008, 2011; Zintgraf et al., 2020) provides policies that balance exploration and exploitation with the aim of maximizing reward. Thompson sampling provides a way to balance exploiting current knowledge to maximize immediate performance and accumulating new information to improve future performance (Russo et al., 2017). This reduces to optimizing dual objectives, reward maximization and information gain, similar to active inference on POMDPs. Empirically, Sajid, Ball, et al. (2021) demonstrated that an active inference agent and a Bayesian model-based RL agent using Thompson sampling exhibit similar behavior when preferences are defined over outcomes. They also highlighted that when completely removing the reward signal from the environment, the two agents both select policies that maximize some sort of information gain.
In general, the way each of these approaches the exploration-exploitation dilemma differs in theory and in practice remains largely unexplored.
Appendix B: Proofs
B.1 Proof of Proposition 1
To show that is an upper bound, take any in the original chain of state-action policies, equation B.1. Then by the definition of an increasing sub-sequence, there exists an index such that : . Since limits commute with finite sums, we have for any . Thus, by Zorn's lemma, there exists a Bellman, optimal state-action policy .
B.2 Proof of Proposition 2
B.3 Proof of Proposition 3
We first prove that state-action policies defined as in equation 2.2 are Bellman optimal by induction on .
We now show that any Bellman optimal state-action policy satisfies the backward induction algorithm equation 2.2.
B.4 Proof of Lemma 1
The inclusion follows from the fact that, as , a minimizer of the expected free energy has to maximize . Among such action sequences, the expected free energy minimizers are those that maximize the entropy of future states .
B.5 Proof of Theorem 1
B.6 Proof of Theorem 2
We prove this result by induction on the temporal horizon of the MDP.
The proof of the theorem when can be seen from the proof of theorem 1. Now suppose that is finite and that the theorem holds for MDPs with a temporal horizon of .
B.7 Proof of Proposition 4
The inclusion follows from the fact that as , a minimizer of the expected free energy has first and foremost to maximize . Among such action sequences, the expected free energy minimizers are those that maximize the entropy of (beliefs about) future states and resolve ambiguity about future outcomes by minimizing .
Notes
These are analogous to temporally extended actions or options introduced under the options framework in RL (Stolle & Precup, 2002).
The surprise (also known as self-information or surprisal) of states— is information-theoretic nomenclature (Stone, 2015) that scores the extent to which an observation is unusual under . It does not imply that the agent experiences surprise in a subjective or declarative sense.
Acknowledgments
We thank Dimitrije Markovic and Quentin Huys for providing helpful feedback during the preparation of the manuscript.
Funding Information
L.D. is supported by the Fonds National de la Recherche, Luxembourg (Project code: 13568875). N.S. is funded by the Medical Research Council (MR/S502522/1) and 2021-2022 Microsoft PhD Fellowship. K.F. is supported by funding for the Wellcome Centre for Human Neuroimaging (Ref: 205103/Z/16/Z), a Canada-U.K. Artificial Intelligence Initiative (Ref: ES/T01279X/1), and the European Union's Horizon 2020 Framework Programme for Research and Innovation under the Specific Grant Agreement 945539 (Human Brain Project SGA3). R.S. is supported by the William K. Warren Foundation, the Well-Being for Planet Earth Foundation, the National Institute for General Medical Sciences (P20GM121312), and the National Institute of Mental Health (R01MH123691). This publication is based on work partially supported by the EPSRC Centre for Doctoral Training in Mathematics of Random Systems: Analysis, Modelling and Simulation (EP/S023925/1).
Author Contributions
L.D.: conceptualization, proofs, writing: first draft, review and editing. N.S., T.P., K.F., R.S.: conceptualization, writing: review and editing.