Abstract

The adaptive regulation of the balance between exploitation and exploration is critical for the optimization of behavioral performance. Animal research and computational modeling have suggested that changes in exploitative versus exploratory control state in response to changes in task utility are mediated by the neuromodulatory locus coeruleus–norepinephrine (LC–NE) system. Recent studies have suggested that utility-driven changes in control state correlate with pupil diameter, and that pupil diameter can be used as an indirect marker of LC activity. We measured participants' pupil diameter while they performed a gambling task with a gradually changing payoff structure. Each choice in this task can be classified as exploitative or exploratory using a computational model of reinforcement learning. We examined the relationship between pupil diameter, task utility, and choice strategy (exploitation vs. exploration), and found that (i) exploratory choices were preceded by a larger baseline pupil diameter than exploitative choices; (ii) individual differences in baseline pupil diameter were predictive of an individual's tendency to explore; and (iii) changes in pupil diameter surrounding the transition between exploitative and exploratory choices correlated with changes in task utility. These findings provide novel evidence that pupil diameter correlates closely with control state, and are consistent with a role for the LC–NE system in the regulation of the exploration–exploitation trade-off in humans.

INTRODUCTION

Imagine you are in a restaurant and you are faced with the decision of what food to order. One option is to choose a familiar dish that you know and like. Alternatively, you could try an unfamiliar dish, and take the risk that you might not like it. However, it is also possible that the unfamiliar dish turns out to become your new favorite, which you would never have discovered if you stuck to the familiar dish. This example illustrates the dilemma between exploiting well-known options and exploring new ones. The trade-off between exploitation and exploration plays an important role in all kinds of decisions, especially in unfamiliar or changing environments. Although there has been a recent rise in studies investigating the strategies that are used to handle this trade-off and the neural mechanisms involved (for a review, see Cohen, McClure, & Yu, 2007), these issues are still poorly understood.

One relevant line of research that addresses this issue suggests that the locus coeruleus–norepinephrine (LC–NE) neuromodulatory system plays an important role in regulating the balance between exploitation and exploration (Aston-Jones & Cohen, 2005; Usher, Cohen, Servan-Schreiber, Rajkowski, & Aston-Jones, 1999). Aston-Jones and Cohen have proposed that exploitative and exploratory control states are mediated by two modes of LC activity, the “phasic” and the “tonic” modes, respectively. The phasic LC mode is characterized by an intermediate level of LC baseline activity and large phasic increases in activity in response to task-relevant stimuli. The ensuing phasic release of NE in cortical areas temporarily increases the responsivity (or gain) of these areas to their afferent input, selectively potentiating the processing of these task-relevant stimuli (Berridge & Waterhouse, 2003; Doya, 2002; Servan-Schreiber, Printz, & Cohen, 1990). Conversely, the tonic LC mode is characterized by an elevated level of LC baseline activity and tonic NE release, and the absence of phasic responses.1

According to the adaptive gain theory (Aston-Jones & Cohen, 2005), the two LC modes promote exploitation and exploration by adaptively adjusting the responsivity of cortical neurons: The phasic mode produces selective increases in neuronal responsivity in response to task-related stimuli, thereby optimizing performance in the current task (i.e., exploitation). In contrast, the tonic mode produces a more enduring and less discriminative increase in neuronal responsivity. Although this degrades performance within the current task, it facilitates the disengagement of attention from this task and the processing of other non-task-related stimuli and/or behaviors (i.e., exploration). A second assumption of the theory is that transitions between the phasic and tonic LC modes and corresponding control states are driven by on-line assessments of task-related utility carried out in ventral and medial frontal structures (Aston-Jones & Cohen, 2005). Consistent with this hypothesis, anatomical studies have shown that the primary neocortical projections to LC come from orbitofrontal and anterior cingulate cortex (Zhu, Iba, Rajkowski, & Aston-Jones, 2004; Aston-Jones et al., 2002; Rajkowski, Lu, Zhu, Cohen, & Aston-Jones, 2000)—areas known to be responsive to task-related rewards and costs of performance (Botvinick, 2007; Ridderinkhof, Ullsperger, Crone, & Nieuwenhuis, 2004). In order to adaptively regulate the balance between exploitation and exploration, utility assessments are integrated over both short (e.g., seconds) and longer (e.g., tens of seconds) timescales. If long-term utility is high, temporary decreases in utility augment the phasic LC mode, in order to restore task performance. Conversely, long-term decreases in utility drive the LC toward the tonic mode, which facilitates disengagement from the current task and exploration of alternative behaviors.

The adaptive gain theory has been supported mainly by computational modeling studies (Usher et al., 1999) and neurophysiological studies in monkeys that have used relatively simple tasks (Aston-Jones & Cohen, 2005). In contrast, with one notable exception (Gilzenrat, Nieuwenhuis, Jepma, & Cohen, 2010), there have been no tests of this theory in humans yet. In order to test the theory in humans, a noninvasive measure of LC activity is required. There is preliminary evidence that pupil diameter can provide such a measure: Although it does not appear to be under direct control of the LC, pupil diameter is correlated with LC activity, and thus, may be useful as a “reporter variable” (Nieuwenhuis, de Geus, & Aston-Jones, 2011). Rajkowski, Kubiak, and Aston-Jones (1993), for example, found a strong correlation in monkeys between baseline pupil diameter and tonic LC firing rate over the course of 90 min of performance in a target-detection task. Furthermore, a recent study that investigated how pupil diameter is related to experimental manipulations of task-related utility and behavioral indices of task (dis)engagement showed that pupil diameter varied in a way consistent with predicted LC dynamics (Gilzenrat et al., 2010). Specifically, this study showed that decreases in long-term utility and behavioral indices of task disengagement were associated with increased baseline pupil diameters and decreased pupil dilations, mirroring the high tonic and low phasic activity associated with the tonic LC mode. However, although this study assessed pupil effects associated with task (dis)engagement, it did not explicitly investigate the exploitation–exploration trade-off as participants were not given the opportunity to explore different task options.

Inspired by the recent evidence that pupil diameter might be used as an indirect index of LC activity, we measured participants' pupil diameter while they performed a “four-armed bandit” task with a gradually changing payoff structure in which the trade-off between exploitation and exploration is a central component (Daw, O'Doherty, Dayan, Seymour, & Dolan, 2006; Figure 1; Supplemental Information). Optimal performance in this task requires a delicate balance between exploitative and exploratory choices. We examined whether the relationship between pupil diameter, control state, and task-related utility was consistent with the two main assumptions of the adaptive gain theory, namely, that LC mode regulates the trade-off between exploitative and exploratory control states, and that transitions between LC modes are driven by assessments of task-related utility. The first assumption predicts that exploratory choices will be associated with a larger baseline pupil diameter, possibly reflecting a more tonic LC mode, than exploitative choices. In addition, this assumption suggests that individual differences in overall pupil diameter might be correlated with individual differences in exploratory choice behavior: Participants with larger overall pupil diameters, perhaps suggestive of a more tonic LC mode, may make more exploratory choices. The second assumption predicts that changes in utility surrounding the transition between control states will be accompanied by specific changes in baseline pupil diameter: A steady increase in baseline pupil diameter as decreasing utility drives the participant toward exploration; a monotonic decrease in baseline pupil diameter as utility increases after the participant has started a new series of exploitative choices.

Figure 1. 

The four-armed bandit task. Participants made repeated choices between four slot machines. Unlike standard slots, the mean payoffs of the four machines changed gradually and independently from trial to trial (four colored lines). Participants were encouraged to earn as many points as possible during the experiment. After the experiment, each choice was classified as exploitative or exploratory using a computational model of reinforcement learning.

Figure 1. 

The four-armed bandit task. Participants made repeated choices between four slot machines. Unlike standard slots, the mean payoffs of the four machines changed gradually and independently from trial to trial (four colored lines). Participants were encouraged to earn as many points as possible during the experiment. After the experiment, each choice was classified as exploitative or exploratory using a computational model of reinforcement learning.

METHODS

Participants

Seventeen volunteers participated (11 women; aged 18–33 years; mean age = 22.4). The experiment was approved by the local ethics review board and conducted according to the principles expressed in the Declaration of Helsinki. Informed consent was obtained from all participants.

Stimuli and Procedure

Participants performed a “four-armed bandit” task while their pupil diameters were continuously measured. The task was a slightly modified version of the task used by Daw et al. (2006). Participants were presented with pictures of four different colored slot machines (of equal luminance) on a medium gray background. The slot machines stayed on the screen during the entire experiment. Each trial started with a 4-sec interval during which the slot machines were displayed, but participants could not select a machine yet. After this, a black fixation cross appeared in the center of the screen, indicating that participants could select one of the slot machines by pressing the “Q,” “W,” “A,” or “S” key. Participants had a maximum of 1.5 sec in which to make their choice; if no choice was made during that interval, a “TIME OUT” message appeared in the center of the screen for 3 sec to signal a missed trial (average number of missed trials = 1.7). If participants responded within 1.5 sec, the lever of the chosen slot machine was lowered and the number of points earned was displayed in the chosen machine. These points were displayed until the end of the trial, which was 7 sec after trial onset. Importantly, the number of points paid off by the four slot machines gradually and independently changed from trial to trial (Figure 1; Supplemental Information).

The experiment was conducted at a slightly dimmed illumination level (room illumination = 100 lux). We recorded pupil diameter at 60 Hz using a Tobii T120 eye tracker, which is integrated into a 17-in. TFT monitor (Tobii Technology, Stockholm, Sweden). Participants were seated at a distance of approximately 60 cm from the monitor. Prior to the start of the experimental session, participants viewed visually presented instructions, including an instruction that the payoffs of the machines would change throughout the experiment, and were given 24 practice trials to familiarize them with the task. After the practice trials, participants were instructed that the machines had been reset for the experimental session. The experimental session consisted of 180 trials, and lasted about 20 min. We instructed the participants that they would be paid according to how many points they had earned during the experimental session. We also instructed them that, on average, participants earned €2.50 in this experiment. However, we did not tell participants how the number of points was converted into euros, or what their cumulative point total was. At the end of the experiment, each participant was paid €3.

Data Analysis

Behavioral Analysis

In order to classify each choice as exploitative or exploratory, we fitted a reinforcement-learning model to the data of each participant. We used the same model as that used by Daw et al. (2006). This model consists of a mean-tracking rule that estimates the mean payoff of each machine, and a choice rule that selects a machine based on these estimations (Supplemental Information). The choice rule used was the “softmax” rule. This rule assumes that choices between different options are made in a probabilistic manner, such that the probability that a particular machine is chosen depends on its relative estimated payoff. The exploitation–exploration balance is adjusted by a parameter referred to as gain, or inverse temperature: With higher gain, action selection is determined more by the relative estimated payoffs of the different options, whereas with lower gain, action selection is more evenly distributed across the different options. We classified each choice as exploitative or exploratory according to whether the chosen slot machine was the one with the maximum estimated payoff (exploitation) or not (exploration).

Pupil Analysis

Pupil data were processed and analyzed using the Brain Vision Analyzer (Brain Products, Gilching, Germany). Artifacts and blinks were removed using a linear interpolation algorithm. We assessed the baseline pupil diameter prior to the selection of a slot machine, as well as the magnitude of the pupil dilation following the selection of a slot machine. To determine baseline pupil diameter, we averaged the pupil data in the period from 2.5 to 0.5 sec before the keypress. The pupil data during the 0.5 sec immediately preceding the keypress were not included in the baseline period because most participants showed an anticipatory increase in pupil diameter starting about 0.5 sec before their keypress response. The pupil dilation evoked by choosing a machine and perceiving the received payoff was measured as the highest deviation from the baseline in the 3 sec following the keypress response.

We compared the average baseline pupil diameter and pupil dilation on exploitation versus exploration trials. In addition, we calculated the degree of exploration for each exploratory choice by subtracting the estimated payoff of the chosen machine from the maximum estimated payoff. We divided all exploration trials into three equally sized bins based on the degree of exploration (low, medium, and high), and assessed the average baseline pupil diameter for these three exploration bins. Because the number of points earned was displayed immediately after the selection of a slot machine, the pupil dilation on each trial reflected both the selection of a machine and the processing of the received payoff. Due to this confound, we could not unequivocally interpret differences in pupil dilation between exploitation and exploration trials, and focused our analyses on the baseline pupil diameter.

Compared to exploitative choices, exploratory choices were more often preceded by other exploratory choices. In addition, exploratory choices were associated with a lower payoff and more negative prediction error on the previous trial, and a lower expected payoff and higher entropy on the current trial (see Results). Entropy is an index of the similarity of the four slot machines' expected payoffs; it increases as the expected payoffs of the four slot machines become more similar. Entropy thus provides an estimate of the level of uncertainty, or conflict, associated with figuring out which slot machine is the most valuable. The entropy H(X) on each trial was calculated as:
formula
where P(xi) is the probability of choosing slot machine xi. To assess whether these potential confounds could account for the differences in baseline pupil diameter on exploration and exploitation trials, we subjected the single-trial baseline pupil diameter values to a multiple linear regression analysis, separately for each participant. Choice strategy (explore vs. exploit) and the five abovementioned nuisance variables (expected payoff, entropy, and the payoff, prediction error and strategy on the previous trial) as well as a constant were included as explanatory factors. For choice strategy and choice strategy on the previous trial, we used binary factors that have a value of 1 on exploit trials and 0 on explore trials. To assess which variables were significant predictors of baseline pupil diameter, we conducted a one-sample t test on the regression coefficients of each explanatory factor (Lorch & Myers, 1990).

We also assessed whether individual differences in pupil diameter predicted individual differences in exploratory behavior. In this analysis, we computed the between-subjects correlation between the average baseline pupil diameter and the proportion of exploratory choices, and between the average baseline pupil diameter and the value of the gain/inverse temperature parameter of the reinforcement learning model.

To assess the development of our utility measures (payoff, expected payoff, and entropy) and baseline pupil diameter surrounding the transition between exploitative and exploratory choice strategies, we averaged trials as a function of their position relative to the transition from an exploitative to an exploratory choice strategy, and vice versa. For this analysis, we only considered the exploration trials that were preceded or followed by a minimum of three exploitation trials.

RESULTS

Participants alternated between choosing the slot machine with the highest estimated current payoff (exploitation) and choosing slot machines with a lower expected payoff (exploration). In comparison to the exploitation trials, exploration trials were more often preceded by other exploration trials (Table 1), indicating that participants tended to explore for several successive trials before settling on a new slot machine. The main characteristics of the exploitation and exploration trials are summarized in Table 1.

Table 1. 

Characteristics of Exploration and Exploitation Trials


Exploration
Exploitation
p
Proportion of total number of trials 0.31 (0.10) 0.69 (0.10) <.001 
Proportion preceded by exploration trial 0.41 (0.07) 0.28 (0.13) .001 
RT (msec) 492 (82) 508 (75) .15 
RT variability (SD of RTs) 150 (45.5) 151 (40.0) .912 
RT trial n − 1 (msec) 498 (72) 504 (79) .36 
Payoff (points) 48 (1.6) 63 (1.9) <.001 
Prediction error (points) −2.8 (6.5) −1.0 (5.1) .07 
Expected payoff (points) 51 (6.4) 64 (4.0) <.001 
Entropy (bits) 1.5 (0.14) 1.2 (0.33) <.001 
Payoff preceding trial (points) 54 (2.4) 60 (3.1) <.001 
Prediction error preceding trial (points) −3.6 (4.4) −1.0 (5.9) .001 

Exploration
Exploitation
p
Proportion of total number of trials 0.31 (0.10) 0.69 (0.10) <.001 
Proportion preceded by exploration trial 0.41 (0.07) 0.28 (0.13) .001 
RT (msec) 492 (82) 508 (75) .15 
RT variability (SD of RTs) 150 (45.5) 151 (40.0) .912 
RT trial n − 1 (msec) 498 (72) 504 (79) .36 
Payoff (points) 48 (1.6) 63 (1.9) <.001 
Prediction error (points) −2.8 (6.5) −1.0 (5.1) .07 
Expected payoff (points) 51 (6.4) 64 (4.0) <.001 
Entropy (bits) 1.5 (0.14) 1.2 (0.33) <.001 
Payoff preceding trial (points) 54 (2.4) 60 (3.1) <.001 
Prediction error preceding trial (points) −3.6 (4.4) −1.0 (5.9) .001 

Standard deviations are in parentheses.

Pupil Diameter on Exploitation versus Exploration Trials

First, we compared the baseline pupil diameter preceding exploitative and exploratory choices. Baseline pupil diameters preceding exploratory choices were larger than those preceding exploitative choices [3.93 vs. 3.88 mm, t(16) = 3.0, p = .008; Figure 2, left panel]. Furthermore, within the exploration trials, baseline pupil diameter increased as a function of the degree of exploration (Methods), as revealed by a repeated measures linear trend analysis [F(1, 16) = 15.3, p = .001; Figure 2, right panel]. We also examined the pupil dilations evoked by exploratory and exploitative choices. There was a trend toward larger dilations on exploration than exploitation trials [0.17 vs. 0.13 mm, t(16) = 2.1, p = .051]. This was probably due to the higher incidence of negative prediction errors on exploration trials (Satterthwaite et al., 2007), as the effect disappeared when only the trials with positive prediction errors were included (p = .14).

Figure 2. 

Pupil diameter on exploration and exploitation trials. (A) Time course of grand-average pupil diameter aligned to the keypress indicating the selection of a slot machine for exploratory and exploitative choices. (B) Average baseline pupil diameter for exploitative choices (black bar), and exploratory choices with a low, medium, and high degree of exploration (striped bars).

Figure 2. 

Pupil diameter on exploration and exploitation trials. (A) Time course of grand-average pupil diameter aligned to the keypress indicating the selection of a slot machine for exploratory and exploitative choices. (B) Average baseline pupil diameter for exploitative choices (black bar), and exploratory choices with a low, medium, and high degree of exploration (striped bars).

The difference in baseline pupil size between exploitation and exploration trials already started to develop during the pupil response on the preceding trial (Figure 3): Trials immediately preceding exploration trials were associated with a larger pupil dilation than trials immediately preceding exploitation trials [0.17 vs. 0.13 mm, t(16) = 3.2, p = .006]. However, this effect on the preceding trial could not (fully) explain the difference in baseline pupil diameters between exploitation and exploration trials because the difference remained significant when pupil dilation on the previous trial was included as a covariate in the analysis [F(1, 15) = 4.69, p = .047].

Figure 3. 

Time course of grand-average post-choice pupil dilation for the trials preceding exploration and exploitation trials.

Figure 3. 

Time course of grand-average post-choice pupil dilation for the trials preceding exploration and exploitation trials.

Exploitation and exploration trials differed in several aspects other than choice strategy (Table 1). Trials preceding exploration trials were characterized by a larger proportion of exploratory choices, a lower payoff, and a more negative prediction error than trials preceding exploitation trials. In addition, exploration trials were characterized by a lower model-estimated expected payoff (of the chosen slot machine) and higher entropy than exploitation trials. We investigated whether choice strategy (explore vs. exploit) could predict baseline pupil diameter independently of these potential nuisance variables by means of a linear multiple regression analysis (Methods). Importantly, when adjusted for all other variables, choice strategy made a unique contribution to the prediction of baseline pupil diameter [t(16) = 3.43, p = .003]. The only other significant predictor of baseline pupil diameter was the strategy on the previous trial [t(16) = 2.98, p = .009]. Additional control analyses that yielded similar results are reported in the Supplemental Information.

Together, these findings confirm our first prediction that exploratory choices are associated with a larger baseline pupil diameter, while excluding a range of alternative interpretations for the observed pupil effect.

Individual Differences in Pupil Diameter and Exploratory Choice Behavior

So far we have examined pupil diameter as a function of the within-subject factor choice strategy. We next assessed whether individual differences in overall pupil diameter were predictive of individual differences in exploratory choice behavior. There was a positive correlation, across participants, between the average pupil diameter over all trials and the proportion of exploratory choices (r = .50, p = .04; Figure 4, left panel). Similarly, there was a negative correlation between the average pupil diameter and the value of the gain parameter of the reinforcement learning model (r = −.53, p = .03; Figure 4, right panel). These correlations were also present when the baseline pupil diameters on exploitation and exploration trials were considered separately (pupil diameter on exploitation trials and proportion exploratory choices: r = .49, p = .04; pupil diameter on exploitation trials and gain parameter: r = −.52, p = .03; pupil diameter on exploration trials and proportion exploratory choices: r = .48, p = .05; pupil diameter on exploration trials and gain parameter: r = −.53, p = .03). Unlike the gain parameter, the other model parameters did not correlate with pupil diameter (decay parameter: r = −.24, p = .36; decay center: r = .07, p = .78).

Figure 4. 

Individual differences in pupil diameter and exploratory choice behavior. (A) Scatterplot of the between-subjects correlation between average baseline pupil diameter and the proportion of exploratory choices. (B) Scatterplot of the between-subjects correlation between average baseline pupil diameter and the value of the gain or inverse-temperature parameter of the reinforcement-learning model. A lower value of this parameter indicates a more exploratory choice strategy.

Figure 4. 

Individual differences in pupil diameter and exploratory choice behavior. (A) Scatterplot of the between-subjects correlation between average baseline pupil diameter and the proportion of exploratory choices. (B) Scatterplot of the between-subjects correlation between average baseline pupil diameter and the value of the gain or inverse-temperature parameter of the reinforcement-learning model. A lower value of this parameter indicates a more exploratory choice strategy.

Obviously, individual differences in pupil diameter relate to many factors other than control state, such as age, personality, and intelligence (Janisse, 1977). Importantly, these factors presumably increased the between-subjects error variance in our data, which decreased the power for detecting a correlation. Thus, the fact that we found a correlation despite a presumably large error variance in the between-subjects pupil data affirms the existence of the correlation. However, it is also possible that individual differences in pupil diameter reflect individual differences in motivation or the amount of attention paid to the task. Such motivational factors might influence choice strategy, which could provide an alternative explanation for the correlations between pupil diameter and exploratory behavior across participants.

Changes in Utility and Pupil Diameter Surrounding a Transition between Choice Strategies

So far we have examined the difference in pupil diameter between exploitation and exploration trials. We next examined the changes in utility measures surrounding the transition between exploitative and exploratory choice strategies. As measures of utility, we used the model-estimated expected payoff of the chosen machine, the received payoff, and the entropy (Methods). Subsequently, we tested whether such changes in utility were accompanied by changes in pupil diameter.

Figure 5 (top) shows the expected payoff, received payoff, and entropy for the first and the last of a series of exploration trials and the three preceding and following exploitation trials. During the three exploitation trials that preceded the switch to an exploratory choice strategy, entropy gradually increased [F(1, 16) = 10.16, p = .006] and payoff gradually decreased [F(1, 16) = 50.72, p < .001], as revealed by a repeated measures linear trend analysis. Expected payoff also showed a decrease over the three trials preceding the first explore trial, but this effect missed significance [F(1, 16) = 2.85, p = .11]. Thus, there was a gradual decrease in utility preceding the switch from an exploitative to an exploratory choice strategy, suggesting that, on average, participants began exploring when task utility was at a minimum. In addition, during the three exploitation trials following the last exploration trial, entropy gradually decreased [F(1, 16) = 9.74, p = .007] and expected payoff gradually increased [F(1, 16) = 13.72, p = .002]. Thus, there was a gradual increase in utility following the switch from an exploratory to an exploitative choice strategy.

Figure 5. 

Grand-average dependent measures for the first and last of a series of exploration trials, and the three preceding and following exploitation trials. (A) Our measures of utility: expected payoff, received payoff, and entropy. (B) Baseline pupil diameter.

Figure 5. 

Grand-average dependent measures for the first and last of a series of exploration trials, and the three preceding and following exploitation trials. (A) Our measures of utility: expected payoff, received payoff, and entropy. (B) Baseline pupil diameter.

We next examined the development of baseline pupil diameter over the trials surrounding the switch between exploitative and exploratory choice strategies (Figure 5, bottom). Baseline pupil diameter did not differ significantly across the three exploitation trials preceding the first exploration trial [F(2, 32) = 1.30, p = .29]. However, baseline pupil diameter showed a gradual decrease over the three exploitation trials following the last exploration trial [F(1, 16) = 6.18, p = .024], resembling the gradual decrease in entropy and increase in expected payoff during these trials. As predicted, baseline pupil diameter correlated negatively with expected payoff [r = −.72, p(one-tailed) = .023] and positively with entropy [r = .68, p(one-tailed) = .032] across the eight trial positions in Figure 5. These findings provide some evidence for our second prediction: that changes in utility surrounding the transition between control states would be systematically correlated with changes in baseline pupil diameter.

DISCUSSION

We investigated the relationship between pupil diameter, choice strategy (exploitation vs. exploration), and task utility, in order to test predictions of the adaptive gain theory of LC function in humans. This study was inspired by recent observations that pupil diameter might be used as a reliable index of LC activity. Our main findings can be summarized as follows: (i) exploratory choices were associated with a larger baseline pupil diameter than exploitative choices; (ii) individual differences in baseline pupil diameter predicted individual differences in exploratory choice behavior: Participants with a larger pupil diameter made more exploratory choices and were characterized by a smaller gain parameter of the reinforcement-learning model; and (iii) trial-to-trial changes in baseline pupil diameter surrounding the transition between choice strategies correlated systematically with changes in utility, at least during the transition from exploration to exploitation. At the least, these findings provide novel evidence for a close relationship between pattern of pupillary response and control state. More tentatively, these findings provide indirect support for the two main assumptions of the adaptive gain theory, namely, that LC firing mode regulates the trade-off between exploitative and exploratory control states, and that changes in LC mode are driven by on-line assessments of task-related utility (Aston-Jones & Cohen, 2005).

Our finding that pupil diameter is predictive of choice strategy, in a manner consistent with the adaptive gain theory, corroborates with recent findings by Gilzenrat et al. (2010) that pupil diameter is related to behavioral indications of the tonic and phasic LC mode. Gilzenrat et al. found that large baseline pupils were associated with slower, more variable reaction times and less accurate performance in a target-detection task, and with task disengagement in a task in which participants were given the opportunity to disengage from the current task context when utility decreased. Furthermore, several pharmacological studies have shown that drug-induced activation of the LC–NE system increases cognitive flexibility and behavioral disengagement. For example, drugs that increase tonic NE levels (i.e., mimic the effects of elevated NE release that characterize the tonic LC mode) have been found to improve attentional-set shifting and reversal learning in rats and monkeys (Seu, Lang, Rivera, & Jentsch, 2008; Lapiz, Bondi, & Morilak, 2007; Lapiz & Morilak, 2006; Steere & Arnsten, 1997; Devauges & Sara, 1990; but see Chamberlain et al., 2006). In humans, increased NE levels induced by the selective NE reuptake inhibitor atomoxetine have been found to improve the ability to stop an ongoing motor response when cued to do so (Chamberlain et al., 2006). A possible explanation for this finding is that the drug-related increase in cognitive flexibility facilitates disengaging from one task (responding) and switching to a new task (stopping the response). In addition, increased NE levels induced by the selective NE reuptake inhibitor reboxetine have been found to enhance social flexibility in human participants, as indicated by increased social engagement and cooperation and a reduction in self-focus (Tse & Bond, 2002). Although none of these studies directly investigated exploitative versus exploratory behaviors, their findings support the idea that the tonic LC mode produces an enduring and largely nonspecific increase in responsivity, which promotes a flexible, exploratory control state.

Modeling studies have started to investigate the relationship between the LC mode and task-related utility, integrated over different timescales (Aston-Jones & Cohen, 2005, Figure 10; McClure, Gilzenrat, & Cohen, 2005). However, to date, there has been hardly any empirical research on the temporal dynamics of utility-driven changes in the LC mode. We addressed this issue by assessing the trial-to-trial changes in utility and baseline pupil diameter surrounding the switch between exploitative and exploratory choice strategies. The switch to an exploratory choice strategy was preceded by a gradual decrease in utility, but by an abrupt increase in baseline pupil diameter. When participants started to exploit a new machine after a period of exploration, utility gradually increased and baseline pupil diameter gradually decreased again. This pattern suggests that the transition from the tonic to the phasic mode is rather gradual, whereas the transition from the phasic to the tonic LC mode is more abrupt. A somewhat similar pattern was found by Gilzenrat and colleagues: Baseline pupil diameter showed a marked gradual decrease when participants started to engage in a new task; the increase in baseline pupil diameter leading up to task disengagement was less gradual and less pronounced. The implications of these data for our understanding of the specific mechanisms by which changes in short- and long-term utility control LC mode remain a matter for further research. One possibility is that LC baseline activity abruptly increases when long-term utility falls below a certain value. Consistent with this possibility, there is some evidence that tonic LC activity in monkeys can increase abruptly after a change in task contingency (Aston-Jones, Rajkowski, & Kubiak, 1997) or during the transition from a drowsy to an alert behavioral state (Rajkowski, Kubiak, & Aston-Jones, 1994). In any case, more empirical data are needed to determine how different measures of utility are integrated over different timescales and to specify the function relating overall utility to changes in LC mode. Such knowledge will also inform the implementation of a utility-sensitive adaptive gain mechanism in reinforcement-learning models. This will present a significant advance compared to current models, such as the model used here, in which the gain parameter is estimated for each participant but fixed across the experiment.

The abrupt increase in baseline pupil diameter prior to an exploratory choice might also be related to the specific task that we used. An aspect of the task that could be important in this respect is the high learning rate (see Supplemental Information). A comparably high learning rate was found in a previous study using this task (Daw et al., 2006), thus it seems to be characteristic of participants' choice behavior in this task. Such high learning rates imply that participants base their expectations regarding the slot machines' payoffs primarily on their most recent experience with each machine. Accordingly, a single bad outcome on a certain trial is likely to be experienced as a substantial decrease in utility and to promote the exploration of another machine. This possibly explains the abrupt increase in baseline pupil diameter we observed immediately preceding the first of a series of exploratory choices. Thus, it will be important to assess in future studies whether tasks that are associated with lower learning rates will result in a more gradual increase in pupil diameter preceding the switch to an exploratory choice strategy.

Because the evidence for a close relationship between pupil diameter and LC activity is currently limited (Nieuwenhuis et al., 2011; Gilzenrat et al., 2010; Rajkowski et al., 1993), more neurophysiological studies are needed to further establish this relationship. In addition, the neural mechanism underlying this putative relationship remains to be determined. To date, there are no known direct connections from the LC to the autonomic centers that regulate pupil size. It is more likely that pupil diameter and LC activity are closely linked because they receive downstream influences from a common afferent source. This common afferent might be the paragigantocellularis (PGi) nucleus of the ventral medulla, which plays a pivotal role in controlling both the LC and the sympathetic axis of the autonomic nervous system (Nieuwenhuis et al., 2011; Aston-Jones, Ennis, Pieribone, Nickell, & Shipley, 1986). The notion that the LC and the autonomic nervous system receive their major input from a common source is consistent with several findings that suggest a strong temporal correlation between LC–NE activity and sympathetic nervous system activity (Abercrombie & Jacobs, 1987; Elam, Svensson, & Thorén, 1986; Reiner, 1986). Anatomical studies have revealed widespread afferents to the PGi from numerous brain areas, including medial prefrontal cortex, insula, hypothalamus, and periaqueductal gray, suggesting that activity in these areas might influence pupil diameter by way of the PGi (Aston-Jones et al., 1986). Consistent with this possibility, fMRI studies in humans and single-cell stimulation/recording studies in animals have shown that activity in this afferent network (including prefrontal cortex) is related to changes in pupil diameter (Critchley, Tang, Glaser, Butterworth, & Dolan, 2005; Siegle, Steinhauer, Stenger, Konecky, & Carter, 2003; Loewenfeld, 1993).

Although our study focused on a possible role of the LC–NE system, it is unlikely that this is the only brain system involved in regulating the exploration–exploitation trade-off. There is some evidence that the dopamine system also influences levels of exploration or task (dis)engagement (Frank, Doll, Oas-Terpstra, & Moreno, 2009; Dreisbach et al., 2005). For example, in one study, individuals with high spontaneous eyeblink rates (a marker of central dopaminergic activity) showed enhanced cognitive flexibility, as measured by the tendency to disengage from previously task-relevant stimuli and orient to novel stimuli (Dreisbach et al., 2005). Furthermore, this effect was modulated by the D4 dopamine receptor gene polymorphism. Another study reported that the val158met polymorphism of COMT, a gene that substantially affects prefrontal dopamine levels, could account for individual differences in uncertainty-based exploration (Frank et al., 2009). In addition to other neuromodulator systems, recent studies have implicated fronto-polar cortex in the control of exploratory behaviors (Bourdaud, Chavarriaga, Galan, & Millán, 2008; Daw et al., 2006), although the specific computations performed by fronto-polar cortex in this context are a topic of ongoing debate (Boorman, Behrens, Woolrich, & Rushworth, 2009). A key objective for future research is to specify the distinct contributions and interactions of the dopamine and LC–NE systems and prefrontal cortex in the regulation of the exploration–exploitation trade-off.

Our experimental design enabled examination of the baseline pupil diameter but, due to the overlap of the decision and outcome processing, did not allow examination of the decision-induced pupil dilation. Hence, the hypotheses we tested were restricted to the adaptive gain theory's assumptions about tonic LC activity. To provide complementary data with regard to phasic LC responses, an important aim for future studies is to use a task in which the decision and outcome presentation are separated in time such that the pupil dilations associated with these two processes can be isolated.

The present study tested specific predictions regarding the relationship between pupil diameter, utility measures, and choice strategy based on a mechanistic theory about the role of the LC–NE system in regulating control state, and preliminary evidence from previous studies for a close relationship between LC activity and pupil diameter. Given the specificity, and therefore the intrinsic unlikelihood, of our predictions, the fact that the predicted effects were observed lends provisional support to the hypotheses that drove the predictions. However, because this is an inductive argument, it is important to note that we cannot rule out the possibility that the observed relationships were not related to LC-mediated modulation of control state. Thus, future studies using more direct measures or manipulations of the LC–NE system are needed to either confirm or invalidate the conclusions from the present study.

For a long time, the LC–NE system has been associated with basic functions such as arousal and the sleep–wake cycle. Only recently, researchers have begun to examine its involvement in more specific cognitive functions, such as attention, memory, perceptual selection, and the signaling of unexpected uncertainty (Sara, 2009; Einhäuser, Stout, Koch, & Carter, 2008; Yu & Dayan, 2005; Robbins, 1997). The present study contributes to this work by addressing, albeit indirectly, the role of the LC–NE system in the control of human behavior. Specifically, the findings reported here support the adaptive gain theory (Aston-Jones & Cohen, 2005), which posits an important role for the LC–NE system in the optimization of behavioral performance by regulating the balance between exploitative and exploratory control states.

Acknowledgments

This research was supported by the Netherlands Organization for Scientific Research. We thank Nathaniel Daw and John O'Doherty for the four-armed bandit task script, Eric-Jan Wagenmakers for his help with the model analyses, Henk van Steenbergen and Rinus Verdonschot for their technical assistance, and four anonymous reviewers for valuable comments on an earlier version of this manuscript.

Reprint requests should be sent to Marieke Jepma, Department of Psychology, Cognitive Psychology Unit, Leiden University, Wassenaarseweg 52, 2333 AK Leiden, the Netherlands, or via e-mail: mjepma@fsw.leidenuniv.nl.

Note

1. 

Whereas we discuss the phasic and tonic LC modes as distinct, they likely represent the extremes of a continuum of function. When we refer to the phasic or tonic LC mode, we mean a more phasic or tonic LC mode, not necessarily the extremes of the continuum.

REFERENCES

REFERENCES
Abercrombie
,
E. D.
, &
Jacobs
,
B. L.
(
1987
).
Single-unit response of noradrenergic neurons in the locus coeruleus of freely moving cats: II. Adaptation to chronically presented stressful stimuli.
Journal of Neuroscience
,
7
,
2844
2848
.
Aston-Jones
,
G.
, &
Cohen
,
J. D.
(
2005
).
An integrative theory of locus coeruleus–norepinephrine function: Adaptive gain and optimal performance.
Annual Review of Neuroscience
,
28
,
403
450
.
Aston-Jones
,
G.
,
Ennis
,
M.
,
Pieribone
,
V. A.
,
Nickell
,
W. T.
, &
Shipley
,
M. T.
(
1986
).
The brain nucleus locus coeruleus: Restricted afferent control of a broad efferent network.
Science
,
234
,
734
737
.
Aston-Jones
,
G.
,
Rajkowski
,
J.
, &
Kubiak
,
P.
(
1997
).
Conditioned responses of monkey locus coeruleus neurons anticipate acquisition of discriminative behavior in a vigilance task.
Neuroscience
,
80
,
697
715
.
Aston-Jones
,
G.
,
Rajkowski
,
J.
,
Lu
,
W.
,
Zhu
,
Y.
,
Cohen
,
J. D.
, &
Morecraft
,
R. J.
(
2002
).
Prominent projections from the orbital prefrontal cortex to the locus coeruleus in monkey.
Society for Neuroscience Abstracts
,
28
,
86
89
.
Berridge
,
C. W.
, &
Waterhouse
,
B. D.
(
2003
).
The locus coeruleus–noradrenergic system: Modulation of behavioral state and state-dependent cognitive processes.
Brain Research, Brain Research Reviews
,
42
,
33
84
.
Boorman
,
E. D.
,
Behrens
,
T. E.
,
Woolrich
,
M. W.
, &
Rushworth
,
M. F.
(
2009
).
How green is the grass on the other side? Frontopolar cortex and the evidence in favor of alternative courses of action.
Neuron
,
62
,
733
743
.
Botvinick
,
M. M.
(
2007
).
Conflict monitoring and decision making: Reconciling two perspectives on anterior cingulate function.
Cognitive, Affective & Behavioral Neuroscience
,
7
,
356
366
.
Bourdaud
,
N.
,
Chavarriaga
,
R.
,
Galan
,
F.
, &
Millán
,
J. D. R.
(
2008
).
Characterizing the EEG correlates of exploratory behavior.
IEEE Transactions on Neural Systems and Rehabilitation Engineering
,
16
,
549
556
.
Chamberlain
,
S. R.
,
Müller
,
U.
,
Blackwell
,
A. D.
,
Clark
,
L.
,
Robbins
,
T. W.
, &
Sahakian
,
B. J.
(
2006
).
Neurochemical modulation of response inhibition and probabilistic learning in humans.
Science
,
311
,
861
863
.
Cohen
,
J. D.
,
McClure
,
S. M.
, &
Yu
,
A. J.
(
2007
).
Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration.
Philosophical Transactions of the Royal Society of London, Series B, Biological Sciences
,
362
,
933
942
.
Critchley
,
H. D.
,
Tang
,
J.
,
Glaser
,
D.
,
Butterworth
,
B.
, &
Dolan
,
R. J.
(
2005
).
Anterior cingulate activity during error and autonomic response.
Neuroimage
,
27
,
885
895
.
Daw
,
N. D.
,
O'Doherty
,
J. P.
,
Dayan
,
P.
,
Seymour
,
B.
, &
Dolan
,
R. J.
(
2006
).
Cortical substrates for exploratory decisions in humans.
Nature
,
441
,
876
879
.
Devauges
,
V.
, &
Sara
,
S. J.
(
1990
).
Activation of the noradrenergic system facilitates an attentional shift in the rat.
Behavioural Brain Research
,
39
,
19
28
.
Doya
,
K.
(
2002
).
Metalearning and neuromodulation.
Neural Networks
,
15
,
495
506
.
Dreisbach
,
G.
,
Müller
,
J.
,
Goschke
,
T.
,
Strobel
,
A.
,
Schulze
,
K.
,
Lesch
,
K. P.
,
et al
(
2005
).
Dopamine and cognitive control: The influence of spontaneous eyeblink rate and dopamine gene polymorphisms on perseveration and distractibility.
Behavioral Neuroscience
,
119
,
483
490
.
Einhäuser
,
W.
,
Stout
,
J.
,
Koch
,
C.
, &
Carter
,
O.
(
2008
).
Pupil dilation reflects perceptual selection and predicts subsequent stability in perceptual rivalry.
Proceedings of the National Academy of Sciences, U.S.A.
,
105
,
1704
1709
.
Elam
,
M.
,
Svensson
,
T. H.
, &
Thorén
,
P.
(
1986
).
Locus coeruleus neurons and sympathetic nerves: Activation by cutaneous sensory afferents.
Brain Research
,
366
,
254
261
.
Frank
,
M. J.
,
Doll
,
B. B.
,
Oas-Terpstra
,
J.
, &
Moreno
,
F.
(
2009
).
Prefrontal and striatal dopaminergic genes predict individual differences in exploration and exploitation.
Nature Neuroscience
,
12
,
1062
1068
.
Gilzenrat
,
M. S.
,
Nieuwenhuis
,
S.
,
Jepma
,
M.
, &
Cohen
,
J. D.
(
2010
).
Pupil diameter tracks changes in control state predicted by the adaptive gain theory of locus coeruleus function.
Cognitive, Affective & Behavioral Neuroscience
,
10
,
252
269
.
Janisse
,
M. P.
(
1977
).
Pupillometry: The psychology of the pupillary response.
Washington, D.C.
:
Hemisphere Publishing Co.
Lapiz
,
M. D.
,
Bondi
,
C. O.
, &
Morilak
,
D. A.
(
2007
).
Chronic treatment with desipramine improves cognitive performance of rats in an attentional set-shifting test.
Neuropsychopharmacology (Berlin, Germany)
,
32
,
1000
1010
.
Lapiz
,
M. D.
, &
Morilak
,
D. A.
(
2006
).
Noradrenergic modulation of cognitive function in rat medial prefrontal cortex as measured by attentional set shifting capability.
Neuroscience
,
137
,
1039
1049
.
Loewenfeld
,
I.
(
1993
).
The pupil: Anatomy, physiology, and clinical applications.
Detroit
:
Wayne State University Press
.
Lorch
,
R. F.
, &
Myers
,
J. L.
(
1990
).
Regression-analyses of repeated measures data in cognitive research.
Journal of Experimental Psychology: Learning, Memory, and Cognition
,
16
,
149
157
.
McClure
,
S.
,
Gilzenrat
,
M.
, &
Cohen
,
J.
(
2005
).
An exploration–exploitation model based on norepinepherine and dopamine activity.
In Y. Weiss, B. Schölkopf, & J. Platt (Eds.),
Advances in neural information processing systems
(
Vol. 18
, pp.
867
874
).
Cambridge, MA
:
MIT Press
.
Nieuwenhuis
,
S.
,
de Geus
,
E. J.
, &
Aston-Jones
,
G.
(
2011
).
The anatomical and functional relationship between the P3 and autonomic components of the orienting response.
Psychophysiology
,
48
,
162
175
.
Rajkowski
,
J.
,
Kubiak
,
P.
, &
Aston-Jones
,
G.
(
1993
).
Correlations between locus coeruleus (LC) neural activity, pupil diameter and behavior in monkey support a role of LC in attention.
Society for Neuroscience Abstracts
,
19
,
974
.
Rajkowski
,
J.
,
Kubiak
,
P.
, &
Aston-Jones
,
G.
(
1994
).
Locus coeruleus activity in monkey: Phasic and tonic changes are associated with altered vigilance.
Brain Research Bulletin
,
35
,
607
616
.
Rajkowski
,
J.
,
Lu
,
W.
,
Zhu
,
Y.
,
Cohen
,
J. D.
, &
Aston-Jones
,
G.
(
2000
).
Prominent projections from the anterior cingulate cortex to the locus coeruleus (LC) in rhesus monkey.
Society for Neuroscience Abstracts
,
26
,
2230
.
Reiner
,
P. B.
(
1986
).
Correlational analysis of central noradrenergic neuronal activity and sympathetic tone in behaving cats.
Brain Research
,
378
,
86
96
.
Ridderinkhof
,
K. R.
,
Ullsperger
,
M.
,
Crone
,
E. A.
, &
Nieuwenhuis
,
S.
(
2004
).
The role of the medial frontal cortex in cognitive control.
Science
,
306
,
443
447
.
Robbins
,
T. W.
(
1997
).
Arousal systems and attentional processes.
Biological Psychology
,
45
,
57
71
.
Sara
,
S. J.
(
2009
).
The locus coeruleus and noradrenergic modulation of cognition.
Nature Reviews Neuroscience
,
10
,
211
223
.
Satterthwaite
,
T. D.
,
Green
,
L.
,
Myerson
,
J.
,
Parker
,
J.
,
Ramaratnam
,
M.
, &
Buckner
,
R. L.
(
2007
).
Dissociable but inter-related systems of cognitive control and reward during decision making: Evidence from pupillometry and event-related fMRI.
Neuroimage
,
37
,
1017
1031
.
Servan-Schreiber
,
D.
,
Printz
,
H.
, &
Cohen
,
J. D.
(
1990
).
A network model of catecholamine effects: Gain, signal-to-noise ratio, and behavior.
Science
,
249
,
892
895
.
Seu
,
E.
,
Lang
,
A.
,
Rivera
,
R. J.
, &
Jentsch
,
J. D.
(
2008
).
Inhibition of the norepinephrine transporter improves behavioral flexibility in rats and monkeys.
Psychopharmacology (Berlin, Germany)
,
202
,
505
519
.
Siegle
,
G. J.
,
Steinhauer
,
S. R.
,
Stenger
,
V. A.
,
Konecky
,
R.
, &
Carter
,
C. S.
(
2003
).
Use of concurrent pupil dilation assessment to inform interpretation and analysis of fMRI data.
Neuroimage
,
20
,
114
124
.
Steere
,
J. C.
, &
Arnsten
,
A. F.
(
1997
).
The alpha-2A noradrenergic receptor agonist guanfacine improves visual object discrimination reversal performance in aged rhesus monkeys.
Behavioral Neuroscience
,
111
,
883
891
.
Tse
,
W. S.
, &
Bond
,
A. J.
(
2002
).
Difference in serotonergic and noradrenergic regulation of human social behaviours.
Psychopharmacology (Berlin, Germany)
,
159
,
216
221
.
Usher
,
M.
,
Cohen
,
J. D.
,
Servan-Schreiber
,
D.
,
Rajkowski
,
J.
, &
Aston-Jones
,
G.
(
1999
).
The role of locus coeruleus in the regulation of cognitive performance.
Science
,
283
,
549
554
.
Yu
,
A. J.
, &
Dayan
,
P.
(
2005
).
Uncertainty, neuromodulation, and attention.
Neuron
,
46
,
681
692
.
Zhu
,
Y.
,
Iba
,
M.
,
Rajkowski
,
J.
, &
Aston-Jones
,
G.
(
2004
).
Projection from the orbitofrontal cortex to the locus coeruleus in monkeys revealed by anterograde tracing.
Society for Neuroscience Abstracts
,
30
,
211
213
.