Abstract
In reinforcement learning (RL), artificial agents are trained to maximize numerical rewards by performing tasks. Exploration is essential in RL because agents must discover information before exploiting it. Two rewards encouraging efficient exploration are the entropy of action policy and curiosity for information gain. Entropy is well established in the literature, promoting randomized action selection. Curiosity is defined in a broad variety of ways in literature, promoting discovery of novel experiences. One example, prediction error curiosity, rewards agents for discovering observations they cannot accurately predict. However, such agents may be distracted by unpredictable observational noises known as curiosity traps. Based on the free energy principle (FEP), this letter proposes hidden state curiosity, which rewards agents by the KL divergence between the predictive prior and posterior probabilities of latent variables. We trained six types of agents to navigate mazes: baseline agents without rewards for entropy or curiosity and agents rewarded for entropy and/or either prediction error curiosity or hidden state curiosity. We find that entropy and curiosity result in efficient exploration, especially both employed together. Notably, agents with hidden state curiosity demonstrate resilience against curiosity traps, which hinder agents with prediction error curiosity. This suggests implementing the FEP that may enhance the robustness and generalization of RL models, potentially aligning the learning processes of artificial and biological agents.
1 Introduction
Reinforcement learning (RL) is a machine learning algorithm for training artificial agents to perform tasks by awarding or punishing their actions with numerical rewards (Barto et al., 1983; Watkins & Dayan, 1992; Mnih et al., 2015). This can be interpreted as akin to biological agents learning through evolution. Extrinsic rewards are awarded at human discretion based on the tasks at hand. Intrinsic rewards are generated by agents themselves (with human-provided hyperparameters) based on other goals such as exploration. Exploration is an important but difficult aspect of RL because an agent can exploit knowledge only after learning that knowledge. An exploration phase may be implemented just by selecting random actions for the agent, but this can be inefficient, especially with high-dimensional continuous state-action spaces and sparse extrinsic rewards. Hence, we study two intrinsic rewards for efficient exploration: entropy in the action-space for control as inference (Millidge et al., 2020) and curiosity about the environment for active inference (Tschantz et al., 2020, 2023). Meanwhile, Friston’s free energy principle (FEP) describes biological neuroscience with Bayesian statistics applicable to machine learning and AI (Kaplan & Friston, 2018; Parr & Friston, 2019). Our goal is to share a novel definition of curiosity derived from the FEP and demonstrate its robust usefulness in exploration.
Curiosity in RL has been presented in many ways, typically based on an agent’s ability to predict future observations. This leverages an adversarial relationship between the agent’s predictive accuracy using a forward or generative model, also known as a transitioner or world model, and its pursuit of observations that challenge this model’s accuracy. For example, Schmidhuber (2010) critiqued that curiosity gauged by “mean squared prediction error or similar measures,” which Oudeyer and Kaplan (2007) called predictive novelty motivation and we call prediction error curiosity, “may fail whenever high prediction errors do not imply expected prediction progress, e.g., in noisy environments.” Alternative forms of curiosity from these sources included estimating likelihoods of events, measuring prediction errors probabilistically, or assessing improvement in predictions, but these methods can have great computational costs. Moreover, Oudeyer and Kaplan (2007) note “in certain application contexts . . . intrinsic openness is a weakness” and counterproductive.
Pathak et al. (2017) elaborated on Schmidhuber’s critique of prediction error curiosity in noisy environments, which may be an example of what Oudeyer and Kaplan called a weakness of intrinsic openness. Pathak et al. asked readers to consider an agent that could observe tree leaves randomly dancing in the wind. The agent’s forward model would never be able to perfectly predict such observations, so the agent might become fixated on these leaves like a moth attracted to a lamp. Thus, such observational noises or expected uncertainties are called curiosity traps. To remedy this, Pathak et al. trained an inverse dynamics model to predict the agent’s action between two consecutive observations, thereby encoding observations into a latent space without noisy details, relevant only to the agent’s actions. Then, the agent’s forward model could be trained to predict these refined latent states instead of chaotic observations. Prediction error curiosity based on that forward model could ignore the environment’s irrelevant noise and thus be minimally affected by curiosity traps, instead focusing on unexpected uncertainties. However, this could not function well with rarely experienced interactions. Pathak et al. suggested storing events in a memory buffer for experience replay, which we implement.
Schwartenbeck et al. (2019) derived two intrinsic rewards for exploration directly from the FEP. An agent with these intrinsic rewards uses Bayesian inference to minimize free energy, meaning not only maximizing extrinsic rewards but also developing a thorough understanding of the environment. The first intrinsic reward, parameter exploration, encourages active learning: the agent seeks to resolve uncertainty about how its actions are rewarded. The second intrinsic reward, hidden state exploration, encourages active inference: the agent seeks to take actions that reveal uncertain observations. Together these intrinsic rewards establish curiosity for both the environment and the task at hand. However, Schwartenbeck et al. assumed agents in T-mazes knew the mazes had two arms, the left providing a constant extrinsic reward, the right providing an uncertain extrinsic reward. Agents in RL typically must infer such knowledge from observations, so these intrinsic rewards are not easily applied to RL directly.
Kawahara et al. (2022) derived both entropy and an RL-applicable definition of curiosity from the FEP. They constructed a forward model as a Bayesian neural network (BNN) such that the weight parameters are not fixed but are drawn from a multivariate gaussian distribution using the reparameterization trick for stochastic optimization (Blundell et al., 2015; Kingma et al., 2015). Kawahara et al. gauged curiosity values based on Kullback–Leibler divergence comparing the weights’ distribution before and after learning to predict each new observation with minimal free energy. This probabilistic approach must consider uncertainty in observational noise, so like the curiosity of Pathak et al. (2017), this curiosity should be able to effectively explore without negative influence from curiosity traps. However, a BNN’s computational cost is high, and Kawahara et al. only worked in terms of a Markov decision process (MDP) where states are completely observed. Additionally, this method produces just a single curiosity value per training update, which, if applied to an entire batch, would only produce one curiosity value for the batch as a whole, overlooking the contributions of each observation individually. Identifying which parts of the batch are important for exploration and which are not is a computationally demanding task because each observation must be evaluated one at a time.
In this letter, we overcome the problem of prediction error curiosity using hidden state curiosity defined in section 3. Like the curiosity of Kawahara et al., hidden state curiosity is derived from the FEP and gauged by the Kullback-Leibler divergence between predictive prior and posterior over future states (under a particular policy). Like the curiosity of Pathak et al. (2017), those states are efficiently encoded as latent variables. We train six types of agents: a baseline with no intrinsic rewards, entropy-driven, prediction error curious, hidden state curious, and two hybrids combining entropy with each form of curiosity. The abilities of these agents to find goals in a biased T-maze or an expanding T-maze (first a T-maze, then a double T-maze, and then a triple T-maze) will test the following two hypotheses:
Entropy and curiosity improve agent exploration, especially when both are implemented together as implied by the FEP.
Prediction error curiosity can be negatively influenced by observational noise also known as curiosity traps, while hidden state curiosity can be more resilient to such curiosity traps.
The results in section 4 are presented to evaluate the validity of these hypotheses, contributing to our understanding of artificial intelligence exploration in complex environments.
2 Prior Studies
2.1 Reinforcement Learning
In reinforcement learning (RL), an agent experiences an episode as a sequence of transitions of the form . Variables and are the agent’s observations at times and equal to (in a Markov decision process, MDP) or derived from (in a partially observable Markov decision process, POMDP) the complete environmental states and . Variable is the agent’s action performed at time . Variable is the extrinsic reward the agent obtained by performing that action. And variable is one if time was the final step in the episode or zero otherwise.
In the actor-critic method (Barto et al., 1983), an agent has at least two neural networks using parameters and to instantiate implicit (i.e., amortize) mappings: an actor network , also known as a policy, which chooses actions based on observations, and a critic network , which predicts future rewards based on observations and actions to estimate the state-action value function. The critic’s target value is plus future rewards through bootstrapping (see equation 2.3, similar to Bellman’s equation) predicted by a target critic . The target critic begins with parameters starting equal to the critic’s, then slowly learns alongside the critic with Polyax averaging such that with hyperparameter being the soft update coefficient.
An ensemble of multiple critics (and an equal number of target critics) can be implemented, in which case the actor’s loss function (see equation 2.1) utilizes the lowest predicted value among all critics. Furthermore, the actor’s training can be staggered with a delay of epochs relative to the critics’ training, ensuring less frequent actor training for stability. Regardless of these implementations, the agent’s actor and critics shall be trained off-policy with experience replay by randomly sampling a batch of transitions from a memory buffer.
2.2 Recurrent RL
with denoting concatenation. To train recurrent models with experience replay, a recurrent memory buffer must contain whole episodes beginning to end for temporal context. Episodes may vary in length, so batches sampled from recurrent memory buffers have all episodes standardized to the maximum episode’s length using zero-filled transitions. Therefore, transitions of the batch include another variable, , which is one if the transition was actually within the episode or zero if the transition was added for standard length.
(a) Implementing recurrent layers in an actor model and a critic model. Notice the previous action is implicitly included in . (b) Implementing the forward model’s hidden state in an actor model and a critic model. Black arrows indicate forward computations. Red arrows indicate loss functions for backpropagation.
(a) Implementing recurrent layers in an actor model and a critic model. Notice the previous action is implicitly included in . (b) Implementing the forward model’s hidden state in an actor model and a critic model. Black arrows indicate forward computations. Red arrows indicate loss functions for backpropagation.
We use this method and another method described in section 3 depicted in Figure 1b. Using either of these methods enables the agent to maintain a continuity of experience, thus giving the agent a temporal edge in learning and decision making. This may be essential for navigating environments where states are not directly observable.
2.3 Curiosity Derived from the FEP
Among various definitions of curiosity such as those from Oudeyer and Kaplan (2007), Schmidhuber (2010), Pathak et al. (2017), and Schwartenbeck et al. (2019), the most relevant to this letter is that of Kawahara et al. (2022) because it is directly derived from the FEP for RL. Here we first describe the FEP as used by Kawahara et al. and then describe their definition of curiosity in that framework.
Thus, Kawahara et al. (2022) used the FEP’s framework to define curiosity such that an observation has a low curiosity value if the forward model does not need to change much to accommodate it or a high curiosity value if the forward model must change drastically. This encourages an adversarial relationship between the agent’s forward model and actor: the forward model trains to improve its weights representing a probabilistic interpretation of the environment, but the critic rewards the actor for finding information that substantially alters the forward model’s weights. This active inference complements entropy’s control as inference. Importantly, the agent’s probabilistic interpretation of the environment should account for observational noise, so observing anticipated noise should not alter it much; hence, the free energy–based curiosity defined by Kawahara et al. should be able to effectively explore without negative influence from curiosity traps. However, this definition of curiosity is constrained regarding batch processing: individual transitions within a batch may differ in exploratory importance, but comparing the forward model’s weights before and after training with the entire batch as a whole returns only one curiosity value. Investigating the significance of each transition individually requires great computational cost. That limitation, and the restriction to fully observable MDP, suggest there are opportunities for further development.
3 Proposed Model
Motivated by Kawahara et al. (2022) and Pathak et al. (2017), we define hidden state curiosity using a forward model with the architecture of a variational RNN (VRNN) (Chung et al., 2016). Our forward model , pictured in Figure 2 and described by algorithm 1, is recurrent using hidden state , enabling the accounting of temporal dependencies and uncertainties in a partially observable Markov decision process (POMDP). This offers a second method for providing the actor and critic temporal knowledge: replacing observations in the models’ inputs with (see Figure 1b). We apply this method to the actor, with a training delay of , while two critics use their own recurrent layers, as shown in Figure 1a (see section 2.1 regarding delayed actor training and multiple critics). Unlike the Bayesian neural network (BNN) used by Kawahara et al., our forward model does not have probabilistic weights; instead, we use the reparameterization trick in the style of a SAC actor or variational Bayes autoencoder (VAE) to sample prior and posterior inner states and from corresponding probability distributions and . These distributions are derived from the previous hidden state , the previous action , and, in the case of the posterior inner state, the current observation . Posterior inner state and previous hidden state are used to generate hidden state .
Forward model’s architecture based on VRNN. Black arrows indicate forward computations. Red arrows indicate errors for backpropagation.
Forward model’s architecture based on VRNN. Black arrows indicate forward computations. Red arrows indicate errors for backpropagation.
In practice, the complexity term is multiplied by a nonnegative hyperparameter describing its relative importance to accuracy.
4 Simulation Experiments
As established in section 1, we designed our experiments to investigate these two hypotheses:
Entropy and curiosity improve agent exploration, especially when both are implemented together as implied by the FEP.
Prediction error curiosity can be negatively influenced by observational noise, also known as curiosity traps, while hidden state curiosity can be more resilient to such curiosity traps.
To this end, our experiments feature six types of agents training to find goals in various mazes. These will be baseline agents devoid of intrinsic rewards, agents motivated by either entropy or one form of curiosity (prediction error or hidden state), and agents motivated by a combination of entropy and one type of curiosity. See Table 1 for details about these six types. All agents will share the same architecture with , , , (see section 2.1), (see equation 3.1), and learning rate with Adam optimizers. No agents have periods of forced investigation with random actions, highlighting the importance of motivating exploration. See Tables 2 through 4 in appendix B for details about models’ architectures in PyTorch. Each table illustrates the parameters of a model layer by layer.
Hyperparameters for Six Types of Agents.
Name (and Acronym) . | . | . | . |
---|---|---|---|
No Entropy, No Curiosity (N) | 0 | 0 | None |
Entropy (E) | None | 0 | None |
Prediction Error Curiosity (P) | 0 | 1 | |
Entropy and Prediction Error Curiosity (EP) | None | 1 | |
Hidden State Curiosity (H) | 0 | 1 | |
Entropy and Hidden State Curiosity (EH) | None | 1 |
Name (and Acronym) . | . | . | . |
---|---|---|---|
No Entropy, No Curiosity (N) | 0 | 0 | None |
Entropy (E) | None | 0 | None |
Prediction Error Curiosity (P) | 0 | 1 | |
Entropy and Prediction Error Curiosity (EP) | None | 1 | |
Hidden State Curiosity (H) | 0 | 1 | |
Entropy and Hidden State Curiosity (EH) | None | 1 |
Regarding the first hypothesis, we predict the baseline agent will perform the least efficient exploration, while agents rewarded for both entropy and curiosity will outperform the rest, regardless of which kind of curiosity. However, regarding the second hypothesis, we predict that if we train agents in mazes with curiosity traps, agents with prediction error curiosity will be attracted to those traps, showcasing its susceptibility, while agents with hidden state curiosity are able to ignore them.
4.1 Experiment Design
In these experiments, we employ the PyBullet physics engine to simulate an RL agent embodied as a duck. The agent’s observations have two parts: the agent’s current speed and an 8 8 4 image of what is in front of it with the four channels being red, green, blue, and distance (see Figure 3.). The agent’s actions also have two parts: adjusting its yaw up to 90 degrees left or right and choosing a speed between 0 meters per time step and a speed limit (with the blocks constructing the mazes having side-length of one meter).
An agent’s observation includes its current speed in meters per time step and an 8 by 8 by 4 image of what is in front of it. The image’s four channels are red, green, blue (left), and distance (right). This is the agent’s first observation in the biased T-maze; see Figure 4a.
An agent’s observation includes its current speed in meters per time step and an 8 by 8 by 4 image of what is in front of it. The image’s four channels are red, green, blue (left), and distance (right). This is the agent’s first observation in the biased T-maze; see Figure 4a.
Each simulated episode is terminated when the agent exits the maze; the agent’s choice of exit will earn an extrinsic reward or punishment. If no exit is chosen within 30 steps, the episode ends with a punishment of . Colliding with a maze wall at any step will also punish the agent with . Any positive extrinsic rewards are multiplied by , encouraging haste.
In each epoch, the agent will carry out one episode. Memory of that episode’s transitions will be saved in that agent’s recurrent replay buffer; if that replay buffer contains memory of more than 250 episodes, the oldest episode will be deleted. Then a batch of 32 episodes will be sampled from the replay buffer to train the agent’s forward model, actor, and critics. In this manner, we will use different random seeds to train 360 agents of each of the six types described above. This will be carried out both with and without implementing curiosity traps by randomly changing colors of walls near inferior exits with every step, investigating which types of agents are disadvantaged by observational noise.
4.1.1 Biased T-Maze
In the biased T-maze simulation seen in Figure 4a, agents have a speed limit of 1 meter per time step. The biased T-maze has two exits. The exit out of the T’s left arm is easily accessible (nearby and unobstructed) and provides an extrinsic reward of . The exit out of the T’s right arm is difficult to access (farther away and behind an obstacle) and provides an inconsistent extrinsic reward, which is equally likely to be or with an expected value . Intuitively for human readers, the option with the highest expected extrinsic value is the exit to the right despite its distance, so we will call this the correct exit. An agent, however, must explore to the right despite readily available reward on the left to discover the higher value of the correct exit. We trained agents for 500 epochs using the six sets of hyperparameters described in Table 1 with or without curiosity traps near the incorrect exit.
Agent starts where shown. Correct and incorrect exits are marked and . With curiosity traps, blocks with ? change colors each step. Experiment 1 uses (a) biased T-maze. Experiment 2 uses (b) T-maze, (c) double T-maze, and (d) triple T-maze.
Agent starts where shown. Correct and incorrect exits are marked and . With curiosity traps, blocks with ? change colors each step. Experiment 1 uses (a) biased T-maze. Experiment 2 uses (b) T-maze, (c) double T-maze, and (d) triple T-maze.
We predict agents will discover and exploit the higher value of the correct exit more often when encouraged with entropy or curiosity, especially both at once. We also expect curiosity traps to have a negative impact on performance of agents with prediction error curiosity, while agents with hidden state curiosity are able to ignore them.
4.1.2 Expanding T-Maze
In the expanding T-maze simulation, agents have a speed limit of 2 meters per time step, first in the T-maze seen in Figure 4b, then the double T-maze seen in Figure 4c, and then the triple T-maze seen in Figure 4d. In each of the three mazes, only one exit is deemed correct, with its location alternating between successive mazes to challenge the agents’ ability to override previously learned habitual behaviors. If the agent takes the correct exit, it will be rewarded with , but if the agent takes any other exit, it will be punished with . We trained agents for 500 epochs in the T-maze, then for 2000 epochs in the double T-maze, and then for 4000 epochs in the triple T-maze. We trained agents using the six sets of hyperparameters described in Table 1 with or without curiosity traps near the incorrect exits in the mazes’ bottom left portions.
We predict all agents, even those without intrinsic rewards for exploration, to easily discover and exploit the correct exit on the right side of the T-maze. Then, when relocated to the double T-maze, we predict all agents to first move to the right, away from the correct exit now on the left side. We predict agents without intrinsic rewards for exploration will have difficulty extinguishing that learned behavior, while agents intrinsically rewarded with entropy or curiosity, especially both, are able to begin exploring again to find the new correct exit. Finally, relocation to the triple T-maze will present this challenge again on a larger scale. We also predict curiosity traps to have a negative impact on the performance of agents with prediction error curiosity, while agents with hidden state curiosity are able to ignore them.
4.2 Results
In Figures 5 and 6, see the trajectories of agents trained in the biased T-maze and expanding T-maze, depicting their behaviors. Find an example of an agent’s forward model predicting observations in the biased T-maze in Figure 7. In Figures 10 and 11, in appendix A, see the proportions of agents choosing each exit in each epoch. Find videos of how agent trajectories changed over time at github.com/oist-cnru/curious_maze.
Trajectories of agents after training in the biased T-maze. The correct exit of each maze is marked with a , while each incorrect exit is marked with an . If curiosity traps are applied, blocks marked with a ? will change to random colors with every step.
Trajectories of agents after training in the biased T-maze. The correct exit of each maze is marked with a , while each incorrect exit is marked with an . If curiosity traps are applied, blocks marked with a ? will change to random colors with every step.
Trajectories of agents after training in the expanding T-maze. The correct exits are marked with a , while incorrect exits are marked with an . If curiosity traps are applied, blocks marked with a ? will change to random colors with every step.
Trajectories of agents after training in the expanding T-maze. The correct exits are marked with a , while incorrect exits are marked with an . If curiosity traps are applied, blocks marked with a ? will change to random colors with every step.
Predictions of agent choosing correct exit in biased T-maze, trained with EH (see Table 1) without curiosity traps. Left: Actual observations. Middle: Predictions based on hidden state , made with prior inner state . Right: Predictions based on hidden state , made with posterior inner state .
Predictions of agent choosing correct exit in biased T-maze, trained with EH (see Table 1) without curiosity traps. Left: Actual observations. Middle: Predictions based on hidden state , made with prior inner state . Right: Predictions based on hidden state , made with posterior inner state .
Biased T-maze results. (a) Bars show rate agents chose the correct exit in their last 10 episodes after training using hyperparameters labeled with acronyms from Table 1. Error bars show 99% confidence interval. (b) In each pair of bars, if the left bar is taller than the right bar with confidence of 99%, the left bar is green and the right bar is red, showing negative impact of curiosity traps.
Biased T-maze results. (a) Bars show rate agents chose the correct exit in their last 10 episodes after training using hyperparameters labeled with acronyms from Table 1. Error bars show 99% confidence interval. (b) In each pair of bars, if the left bar is taller than the right bar with confidence of 99%, the left bar is green and the right bar is red, showing negative impact of curiosity traps.
Statistic results of these experiments are displayed in Figures 8 and 9, which show how often agents trained with the hyperparameters described in Table 1 reached the correct exit in their final 10 episodes.
4.2.1 Biased T-Maze
In Figure 8a, the leftmost bar shows that agents trained using the hyperparameters labeled “No Entropy, No Curiosity” (acronym N) were the least successful agents in the biased T-maze. The three bars between the dotted lines show that agents trained using “Entropy” (E), “Prediction Error Curiosity” (P), or “Hidden State Curiosity” (H) all performed as well as or better than N using one intrinsic reward. The rightmost bars show agents trained with two intrinsic rewards using “Entropy and Prediction Error Curiosity” (EP) or “Entropy and Hidden State Curiosity” (EH) performed best of all, demonstrating the importance of combining these intrinsic rewards.
In Figure 8b, to the left of the dotted line, note that agents trained using P or EP performed significantly worse when trained with curiosity traps. In contrast, right of the dotted line, agents trained using H or EH have no negative impact from curiosity traps. This demonstrates that hidden state curiosity can mitigate pitfalls, which can entrap prediction error curiosity.
These results can be visually confirmed in Figures 5 and 10. In Figure 5, some agents trained using EP with curiosity traps only travel in circles to fixate on the randomly changing walls, revealing the distraction caused by observational noise. Figure 10, in appendix A, shows how agents trained using N commit to the first exit they encounter, as the rates of exit choice increase epoch to epoch but never decrease. In contrast, other agents select the incorrect exit at an increasing rate until a peak at approximately the 100th epoch, at which point selection of the correct exit increases instead, as if the agents became bored with the easy exit and explored instead.
4.2.2 Expanding T-Maze
Figure 9a displays the performance of agents at the end of training in the T-maze, then the double T-maze, and finally the triple T-maze. Consistent to all three mazes and just like in the biased T-maze, the leftmost bar shows that agents trained using the hyperparameters labeled “No Entropy, No Curiosity” (acronym N) were the least successful agents. The three bars between the dotted lines show that agents trained using “Entropy” (E), “Prediction Error Curiosity,” (P), or “Hidden State Curiosity” (H) all performed as well as or better than N with one intrinsic reward. The rightmost bars show agents trained with two intrinsic rewards, using “Entropy and Prediction Error Curiosity” (EP) or “Entropy and Hidden State Curiosity” (EH), performed best of all, demonstrating the importance of combining these intrinsic rewards.
In Figure 9b, to the left of the dotted line, note that agents trained using P performed worse when trained with curiosity traps in all three of these mazes, and agents trained using EP were deeply influenced by curiosity traps in the T-maze and triple T-maze. In contrast, right of the dotted line, agents trained using H or EH have no negative impact from curiosity traps. This demonstrates that hidden state curiosity can mitigate pitfalls, which can entrap prediction error curiosity.
These results can be visually confirmed in Figures 6 and 11. In Figure 6, the impact of curiosity traps is clear when some agents trained using P or EP with curiosity traps are attracted to randomly colored walls in all three of the T-mazes. In Figure 11, we see that many agents trained using N learned to reach the correct exit to the right on the T-maze, but many of them continued to select exits on the right side when relocated into the double T-maze even though the correct exit was now on the left. Likewise, many agents trained using E learned to reach the correct exit in the double T-maze with a left turn and then a right turn, but continued to select exits on the left side when relocated into the triple T-maze even through the correct exit was now on the right. In contrast, agents trained using H or EH swiftly stopped choosing incorrect exits when relocated, whether with or without curiosity traps.
5 Discussion
As described in section 4, our experiments corroborated the hypotheses in the introduction: action entropy and curiosity improve agent exploration, especially when both are implemented together as implied by the FEP; prediction error curiosity can be negatively influenced by observational noise, also known as curiosity traps, while hidden state curiosity can be more resilient to such curiosity traps. These results indicate that applying the FEP can significantly benefit RL, encouraging agents to investigate and comprehend causal structures that would otherwise be difficult or impossible to understand. This could be beneficial for robots in dynamic environments, or interactive systems automatically personalizing content delivery for its users, or researchers seeking recommendations of directions to survey. However, we have not yet attempted transferring behaviors learned in simulation to physical agents, and only identified the optimized hyperparameters through extensive brute-force testing.
The hidden state curiosity (and action entropy) foregrounded in this work inherit from decompositions of expected free energy. Expected free energy in this setting simply refers to a free energy functional of distributions over (random) variables expected under a particular policy or path into the future. The expected free energy could be regarded as a universal objective function from the perspective of the physics of self-organizing agents that have well-defined or characteristic-attracting sets. (For a recent derivation of expected free energy—from the perspective of statistical physics—see Friston et al., 2023.) One interesting interpretation of expected free energy is in terms of a dual-aspect Bayes optimality. This follows from our decomposition of expected free energy into expected information gain and expected extrinsic reward. These are exactly the objective functions that underwrite the principles of optimal Bayesian experimental design (Lindley, 1956; MacKay, 1992) and decision theory (Berger, 2011), respectively. On this view, intrinsic and extrinsic reward are two sides of the same coin, where information and value have exactly the same currency (i.e., natural units). The implication here is that one can think about rewards in terms of information and, conversely, think about the value of information (Howard, 1966) as intrinsically rewarding.
The expected information gain can be applied to any latent variables (i.e., states or parameters) of a forward model. When applied to the latent states of a generative model, the implicit intrinsic reward or motivation is sometimes referred to as salience (Itti & Baldi, 2009). Conversely, when applied to the parameters of a forward model, the expected information gain is sometimes referred to as the novelty that underwrites curious behavior (Baldassarre et al., 2014; Da Costa et al., 2020; Schmidhuber, 2010; Schwartenbeck et al., 2019). This is important because we have absorbed the implicit active inference and learning into a reinforcement learning scheme based on the actor-critic model. This kind of reinforcement learning identifies state-action policies in the sense that an optimal policy is identified for every given state. This means one cannot select policies that maximize information gain about latent states (because each policy is conditioned on being in a particular state, as opposed to having posterior beliefs about latent states). However, it is still possible to learn policies that on average are information seeking, especially about the parameters of a generative model—as in our case. The parameters in question here are the transition parameters of the forward model. This leads to the interesting notion that one can learn state-action policies that are information seeking in exactly the way we have demonstrated with the above numerical experiments. These are sometimes referred to as epistemic habits (Friston et al., 2016), such as habitually watching a certain news channel in the evening to seek information about what happened during the day. This observation is potentially important because it suggests there are lawful and learnable ways of foraging changing environments for information, such as mazes that feature curiosity traps and change over time.
Looking forward, there are multiple ways future research can explore hidden state curiosity. For example, the PV-RNN architecture (Ahmadi & Tani, 2019) can introduce hierarchical processing within a VRNN (Chung et al., 2016). Each layer of such a framework could generate hidden states with different multiple timescale RNNs (MTRNNs; Yamashita & Tani, 2008; Jian et al., 2023), allowing agents to access both long-term and short-term memories. Using this architecture, agents could have curiosity about the environment in multiple temporal contexts.
Moreover, just as the hyperparameter can be a parameter that dynamically adjusts to satisfy target entropy (see section 2.1), it may be possible to adjust the hyperparameter dynamically to satisfy target curiosity . For example, just as is optimized to minimize , could be optimized to minimize . This may refine the agent’s engagement with curiosity over time and help users select optimal hyperparameters.
Also, future research should investigate how the choice of , , and the sizes of observations and inner states affect hidden state curiosity’s reaction to curiosity traps. Although the choice of and can be considered configurable customization, it can be challenging to fine-tune optimal pairings for ignoring useless noise in the task at hand.1
Finally, the intrinsic reward for imitation shown by Kawahara et al. (2022) in equation 2.11 should be investigated. This could enable agents to learn from human demonstrations, particularly for rare or complex situations.
In conclusion, emulating behaviors associated with biological agents like curiosity-driven exploration appears to be a promising frontier in advancing AI. Although current RL agents can have great computational power, they cannot yet achieve understandings as nuanced as inquisitive humans. The FEP offers a principled way to bring such useful organic practices to artificial agents. Our future hope is to apply hidden state curiosity to more intricate 3D agents training to perform compositional action goals, delving into the FEP’s influence on embodied cognition and the emergence of communication.
Find our code at github.com/oist-cnru/curious_maze.
Appendix A: Proportions of Exit in Biased T-Maze
Proportion of agents taking each exit in the T-maze. The correct exit is colored light gray.
Proportion of agents taking each exit in the T-maze. The correct exit is colored light gray.
Proportion of agents taking each exit in the expanding T-maze. The correct exit is colored light gray.
Proportion of agents taking each exit in the expanding T-maze. The correct exit is colored light gray.
Appendix B: Proportions of Exit Choice in Expanding T-Maz
These tables illustrate details about models’ architectures in PyTorch. Each table illustrates the parameters of a model layer by layer. PReLU refers to LeakyReLU with the leak coefficient as a trainable parameter.
Architecture of the Forward Model.
Portion . | Layer Type . | Details . |
---|---|---|
Image In | Convolution | channels in=4, channels out=16 |
kernel_size=(3, 3), stride=(1, 1), | ||
padding=(1, 1), padding_mode=reflect | ||
PReLU | num_parameters=1 | |
Average Pooling | kernel_size=(3, 3), stride=(2, 2), padding=(1, 1) | |
Convolution | channels in=16, channels out=16, | |
kernel_size=(3, 3), stride=(1, 1), | ||
padding=(1, 1), padding_mode=reflect | ||
PReLU | num_parameters=1 | |
Average Pooling | kernel_size=(3, 3), stride=(2, 2), padding=(1, 1) | |
Flatten | ||
Linear | in_features=64, out_features=32, bias=True | |
PReLU | num_parameters=1 | |
Speed In | Linear | in_features=1, out_features=32, bias=True |
PReLU | num_parameters=1 | |
Action In | Linear | in_features=2, out_features=32, bias=True |
PReLU | num_parameters=1 | |
( Mean) | Linear | in_features=64, out_features=32, bias=True |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=32, bias=True | |
Tanh | ||
( STD) | Linear | in_features=64, out_features=32, bias=True |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=32, bias=True | |
Softplus | beta=1, threshold=20 | |
( Mean) | Linear | in_features=128, out_features=32, bias=True |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=32, bias=True | |
Tanh | ||
( STD) | Linear | in_features=128, out_features=32, bias=True |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=32, bias=True | |
Softplus | beta=1, threshold=20 | |
GRU | Gated RNN | input size=32, hidden size=32 |
Image Out | Linear | in_features=64, out_features=16, bias=True |
PReLU | num_parameters=1 | |
Reshape | ||
Convolution | channels in=4, channels out=16, | |
kernel_size=(3, 3), stride=(1, 1), | ||
padding=(1, 1), padding_mode=reflect | ||
PReLU | num_parameters=1 | |
Upsampling | scale_factor=2, mode=‘bilinear’ | |
Convolution | channels in=16, channels out=16, | |
kernel_size=(3, 3), stride=(1, 1), | ||
padding=(1, 1), padding_mode=reflect | ||
PReLU | num_parameters=1 | |
Upsampling | scale_factor=2, mode=‘bilinear’ | |
Convolution | channels in=16, channels out=16, | |
kernel_size=(3, 3), stride=(1, 1), | ||
padding=(1, 1), padding_mode=reflect | ||
PReLU | num_parameters=1 | |
Convolution | channels in=16, channels out=4, | |
kernel_size=(1, 1), stride=(1, 1) | ||
Speed Out | Linear | in_features=64, out_features=32, bias=True |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=32, bias=True | |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=1, bias=True |
Portion . | Layer Type . | Details . |
---|---|---|
Image In | Convolution | channels in=4, channels out=16 |
kernel_size=(3, 3), stride=(1, 1), | ||
padding=(1, 1), padding_mode=reflect | ||
PReLU | num_parameters=1 | |
Average Pooling | kernel_size=(3, 3), stride=(2, 2), padding=(1, 1) | |
Convolution | channels in=16, channels out=16, | |
kernel_size=(3, 3), stride=(1, 1), | ||
padding=(1, 1), padding_mode=reflect | ||
PReLU | num_parameters=1 | |
Average Pooling | kernel_size=(3, 3), stride=(2, 2), padding=(1, 1) | |
Flatten | ||
Linear | in_features=64, out_features=32, bias=True | |
PReLU | num_parameters=1 | |
Speed In | Linear | in_features=1, out_features=32, bias=True |
PReLU | num_parameters=1 | |
Action In | Linear | in_features=2, out_features=32, bias=True |
PReLU | num_parameters=1 | |
( Mean) | Linear | in_features=64, out_features=32, bias=True |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=32, bias=True | |
Tanh | ||
( STD) | Linear | in_features=64, out_features=32, bias=True |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=32, bias=True | |
Softplus | beta=1, threshold=20 | |
( Mean) | Linear | in_features=128, out_features=32, bias=True |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=32, bias=True | |
Tanh | ||
( STD) | Linear | in_features=128, out_features=32, bias=True |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=32, bias=True | |
Softplus | beta=1, threshold=20 | |
GRU | Gated RNN | input size=32, hidden size=32 |
Image Out | Linear | in_features=64, out_features=16, bias=True |
PReLU | num_parameters=1 | |
Reshape | ||
Convolution | channels in=4, channels out=16, | |
kernel_size=(3, 3), stride=(1, 1), | ||
padding=(1, 1), padding_mode=reflect | ||
PReLU | num_parameters=1 | |
Upsampling | scale_factor=2, mode=‘bilinear’ | |
Convolution | channels in=16, channels out=16, | |
kernel_size=(3, 3), stride=(1, 1), | ||
padding=(1, 1), padding_mode=reflect | ||
PReLU | num_parameters=1 | |
Upsampling | scale_factor=2, mode=‘bilinear’ | |
Convolution | channels in=16, channels out=16, | |
kernel_size=(3, 3), stride=(1, 1), | ||
padding=(1, 1), padding_mode=reflect | ||
PReLU | num_parameters=1 | |
Convolution | channels in=16, channels out=4, | |
kernel_size=(1, 1), stride=(1, 1) | ||
Speed Out | Linear | in_features=64, out_features=32, bias=True |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=32, bias=True | |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=1, bias=True |
Architecture of the Actor Model.
Portion . | Layer Type . | Details . |
---|---|---|
h In | Linear | in_features=32, out_features=32, bias=True |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=32, bias=True | |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=32, bias=True | |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=32, bias=True | |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=2, bias=True | |
Linear | in_features=32, out_features=2, bias=True | |
Softplus | beta=1, threshold=20 |
Portion . | Layer Type . | Details . |
---|---|---|
h In | Linear | in_features=32, out_features=32, bias=True |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=32, bias=True | |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=32, bias=True | |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=32, bias=True | |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=2, bias=True | |
Linear | in_features=32, out_features=2, bias=True | |
Softplus | beta=1, threshold=20 |
Note: The final action is , and the log probability of that action is .
Architecture of the Critic Model.
Portion . | Layer Type . | Details . |
---|---|---|
Image In | Convolution | channels in=4, channels out=16, |
kernel_size=(3, 3), stride=(1, 1), | ||
padding=(1, 1), padding_mode=reflect | ||
PReLU | num_parameters=1 | |
Average Pooling | kernel_size=(3, 3), stride=(2, 2), padding=(1, 1) | |
Convolution | channels in=16, channels out=16, | |
kernel_size=(3, 3), stride=(1, 1), | ||
padding=(1, 1), padding_mode=reflect | ||
PReLU | num_parameters=1 | |
Average Pooling | kernel_size=(3, 3), stride=(2, 2), padding=(1, 1) | |
Flatten | ||
Linear | in_features=64, out_features=32, bias=True | |
PReLU | num_parameters=1 | |
Speed In | Linear | in_features=1, out_features=32, bias=True |
PReLU | num_parameters=1 | |
Action In | Linear | in_features=2, out_features=32, bias=True |
PReLU | num_parameters=1 | |
GRU | Gated RNN | input size=96, hidden size=32 |
Q Out | Linear | in_features=32, out_features=32, bias=True |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=1, bias=True |
Portion . | Layer Type . | Details . |
---|---|---|
Image In | Convolution | channels in=4, channels out=16, |
kernel_size=(3, 3), stride=(1, 1), | ||
padding=(1, 1), padding_mode=reflect | ||
PReLU | num_parameters=1 | |
Average Pooling | kernel_size=(3, 3), stride=(2, 2), padding=(1, 1) | |
Convolution | channels in=16, channels out=16, | |
kernel_size=(3, 3), stride=(1, 1), | ||
padding=(1, 1), padding_mode=reflect | ||
PReLU | num_parameters=1 | |
Average Pooling | kernel_size=(3, 3), stride=(2, 2), padding=(1, 1) | |
Flatten | ||
Linear | in_features=64, out_features=32, bias=True | |
PReLU | num_parameters=1 | |
Speed In | Linear | in_features=1, out_features=32, bias=True |
PReLU | num_parameters=1 | |
Action In | Linear | in_features=2, out_features=32, bias=True |
PReLU | num_parameters=1 | |
GRU | Gated RNN | input size=96, hidden size=32 |
Q Out | Linear | in_features=32, out_features=32, bias=True |
PReLU | num_parameters=1 | |
Linear | in_features=32, out_features=1, bias=True |
Note: Not pictured: concatenation of Image In, Speed In, and Action In for GRU input.
Acknowledgments
Supported by the Japan Society for the Promotion of Science KAKENHI Grant Numbers JP23H04975 to KD. We greatly thank an anonymous reviewer.
Note
Generally, in FEP-based schemes, a free hyperparameter can, in principle, be optimized with respect to variational free energy. For simple hyperparameters, this is usually best achieved with a line search over the hyperparameter to minimize the path integral of variational free energy—as a bound on log marginal likelihood or model evidence—accumulated over the time period in question.