The ability to adaptively shift between exploration and exploitation control states is critical for optimizing behavioral performance. Converging evidence from primate electrophysiology and computational neural modeling has suggested that this ability may be mediated by the broad norepinephrine projections emanating from the locus coeruleus (LC) [Aston-Jones, G., & Cohen, J. D. An integrative theory of locus coeruleus-norepinephrine function: Adaptive gain and optimal performance. Annual Review of Neuroscience, 28, 403–450, 2005]. There is also evidence that pupil diameter covaries systematically with LC activity. Although imperfect and indirect, this link makes pupillometry a useful tool for studying the locus coeruleus norepinephrine system in humans and in high-level tasks. Here, we present a novel paradigm that examines how the pupillary response during exploration and exploitation covaries with individual differences in fluid intelligence during analogical reasoning on Raven's Advanced Progressive Matrices. Pupillometry was used as a noninvasive proxy for LC activity, and concurrent think-aloud verbal protocols were used to identify exploratory and exploitative solution periods. This novel combination of pupillometry and verbal protocols from 40 participants revealed a decrease in pupil diameter during exploitation and an increase during exploration. The temporal dynamics of the pupillary response was characterized by a steep increase during the transition to exploratory periods, sustained dilation for many seconds afterward, and followed by gradual return to baseline. Moreover, the individual differences in the relative magnitude of pupillary dilation accounted for 16% of the variance in Advanced Progressive Matrices scores. Assuming that pupil diameter is a valid index of LC activity, these results establish promising preliminary connections between the literature on locus coeruleus norepinephrine-mediated cognitive control and the literature on analogical reasoning and fluid intelligence.
The ability to adaptively regulate the balance between exploration and exploitation is critical for optimizing behavior in the diverse, dynamic environments we encounter on a daily basis. Despite the ubiquity of the exploration–exploitation trade-off and its broad importance in understanding executive control, the neural mechanisms involved are still not well understood (Cohen, McClure, & Yu, 2007; Berridge & Waterhouse, 2003). Recent animal and human studies suggest that the locus coeruleus norepinephrine (LC-NE) system may support the exploration–exploitation trade-off, but the work is limited by the use of low-level tasks (Jepma & Nieuwenhuis, 2011; Gilzenrat, Nieuwenhuis, Jepma, & Cohen, 2010; Aston-Jones & Cohen, 2005). Here, we present a unique paradigm that examines how the pupillary response during exploration and exploitation covaries with individual differences in fluid intelligence (Gf) by combining pupillometry and verbal protocol analysis during analogical reasoning on Raven's Advanced Progressive Matrices (APM; Raven, Raven, & Court, 1998). Expanding the study of the exploration–exploitation trade-off to a high-level analogical reasoning task and employing an individual differences approach provided novel insights into the relationship between the exploration–exploitation trade-off, noradrenergic function, and individual differences in Gf.
Converging evidence suggests that the LC-NE system plays a central role in mediating the exploration–exploitation trade-off (Cohen et al., 2007; Aston-Jones & Cohen, 2005). Specifically, the LC-NE system is thought to monitor for unexpected uncertainty and actively mediate the shift between exploration and exploitation in response to reward history (Aston-Jones & Cohen, 2005). Much of the current theory of LC-NE function is based on monkey electrophysiological recordings (see Aston-Jones & Cohen, 2005, for a review) and computational models thereof (Brown, Gilzenrat, & Cohen, 2005; Usher, Cohen, Servan-Schreiber, Rajkowski, & Aston-Jones, 1999). The electrophysiological data suggested that locus coeruleus (LC) neurons in monkeys exhibit distinct firing patterns that lie along a continuum of task performance from offline to phasic to tonic modes.
Offline mode occurs when the animal is drowsy or actively sedated and is characterized by low levels of LC activity and poor task performance. In phasic mode, the baseline firing rate remains low, but the LC neurons fire phasic bursts of activity synchronized to task-relevant events; the animals exhibit high task performance. Finally, the tonic mode is characterized by high baseline rates of LC firing, poor task performance, exploratory behaviors, and indiscriminate sensitivity to both task-related and task-unrelated stimuli. These electrophysiological findings have been incorporated into a broader theory of LC function in which the LC mediates the exploration–exploitation trade-off in response to online assessments of task utility (adaptive gain theory, AGT; Aston-Jones & Cohen, 2005). AGT postulates that the LC actively mediates the gain of cortical units through the release of norepinephrine to promote exploitation via phasic mode or exploration via tonic mode (Aston-Jones & Cohen, 2005). Although AGT provides an elegant account of the existing data, it is limited by a paucity of corroborative human studies to validate and test the theorized link between LC function and the exploration–exploitation trade-off.
One major obstacle to studying LC function in humans is identifying a noninvasive method for measuring LC activity. Recently, pupil diameter has emerged as one promising noninvasive measure for LC activity and is being increasingly employed for this purpose (e.g., Cheadle et al., 2014; Eldar, Cohen, & Niv, 2013; Jepma & Nieuwenhuis, 2011; Einhäuser, Koch, & Carter, 2010; Gilzenrat et al., 2010; Einhäuser, Stout, Koch, & Carter, 2008). Neuroimaging work has shown that the pupil diameter covaries with fMRI BOLD activity in the LC (Murphy, O'Connell, O'Sullivan, Robertson, & Balsters, 2014) and that both P3 ERP and pupil diameter are sensitive to LC-NE modes of task engagement (Cheadle et al., 2014; Murphy, Robertson, Balsters, & O'Connell, 2011). Converging evidence from electrophysiology (e.g., Rajkowski, Kubiak, & Aston-Jones, 1994) and pharmacology (e.g., Phillips, Szabadi, & Bradshaw, 2000; Koss, 1986) also suggests that pupil diameter correlates with LC activity in animals. The anatomical pathways linking the LC and the pupil are a topic of ongoing research but probably involve α2-adrenoreceptor-mediated inhibition of the parasympathetic Edinger–Westphal nucleus responsible for pupil constriction (Samuels & Szabadi, 2008).
Here, we extend the study of the exploration–exploitation trade-off and LC function by tracking real-time shifts in exploration and exploitation in a rich, temporally extended analogical reasoning task, Raven's APM (Raven et al., 1998). The APM is a geometric analogy test with excellent psychometric properties (Brouwers, Van de Viver, & Van Hemert, 2009) that has been a popular and trusted instrument in psychology for 70 years (e.g., Hayes, Petrov, & Sederberg, 2011, 2015; Gray, Chabris, & Braver, 2003; Carpenter, Just, & Shell, 1990). Raven's APM is an excellent environment to induce strong shifts between exploration and exploitation because it repeatedly places participants in an unfamiliar relational environment in which they must engage in relational foraging to attempt to construct and/or identify the correct answer.
Although the term “relational foraging” is not commonly used in the literature on relational reasoning and Gf, it is pertinent to geometric analogy problems such as Raven's APM (Figure 1, left). The problem-solving process is a search of an abstract problem space and involves the formulation of hypotheses and subgoals that either succeed or fail (e.g., Taatgen, Huss, Dickison, & Anderson, 2008; Newell & Simon, 1976). There are deep structural parallels, although not strict isomorphism, with reinforcement learning concepts such as reward, punishment, exploration, and exploitation. Indeed, reinforcement learning algorithms are commonly used to learn the utilities of production rules in production systems (e.g., Taatgen, 2013; Anderson et al., 2004). These structural similarities provide connections with independently developed theories of LC-NE function.
Specifically in Raven's APM, the relational foraging process consists of an extended series of goals and subgoals, in which hypothesized patterns are extracted from one part of the problem matrix and tested on others. If the hypothesized pattern generalizes to another part of the matrix, the current pattern receives reinforcement. If the test fails, then a new pattern hypothesis must be generated (Carpenter et al., 1990). This iterative strategy is used to extract all the relations contained in a given Raven item or as many relations as needed to narrow the number of possible responses. This process has been formalized in models of matrix reasoning (Lovett, Tomai, Forbus, & Usher, 2009; Carpenter et al., 1990).
Importantly, for our present purposes, the relational foraging process maps onto key aspects of AGT. Each Raven problem is a miniature environment with a precise definition of optimal performance (i.e., pinpointing the item that best completes the relational pattern), fluctuations in task utility over time that result from testing hypothesized patterns, and implicit reinforcement received when hypothesized patterns generalize or fail to generalize. Last but not least, the use of Raven's APM instead of simpler reinforcement learning tasks is methodologically beneficial for a pupillometric study because it produces more extended periods of exploration and exploitation, which are better suited to the relatively low temporal resolution of the pupillary response.
In the current study, pupil diameter was recorded as an indirect proxy for LC activity on each trial. Periods of exploration and exploitation were identified with the aid of think-aloud verbal protocols (Ericsson & Simon, 1993) that were collected while the participants solved each Raven problem. The results revealed a decrease in pupillary response during exploitative periods and a significant increase during exploratory periods.
This pattern is consistent with prominent theories of LC-NE system function and provides the first evidence that this system may be involved in cognitive control of the exploration–exploitation trade-off during analogical reasoning. Moreover, the individual differences in the exploratory pupillary dilation could account for 16% of the variation in APM scores across participants.
This study was conducted in a larger context of several related experiments that combined think-aloud verbal protocols and eye tracking to investigate the role of strategic cognitive control during visual relational reasoning on Raven's APM (Hayes, 2015; Hayes et al., 2011, 2015). These experiments involved multiple sessions and various manipulations from Session 2 onward, but the first session was always the same: To establish a common baseline, eye-tracking and think-aloud protocols were collected while the participants worked on 14 Raven problems as detailed below. This study is based exclusively on data from this common baseline session. Although other aspects of this large and multifaceted data set have been published elsewhere (Hayes et al., 2011, 2015), the pupillometric and verbal protocol aspects are reported here for the first time.
One hundred thirty-six students at the Ohio State University participated in the experiments outlined above (Hayes, 2015). They responded to recruitment flyers posted in the Ohio State University Psychology Building and were paid $6 per hour plus $1 bonus for each correct answer to a Raven problem. Sixteen participants did not consistently provide think-aloud protocols throughout each trial and were excluded from further consideration for the present purposes. Because of the labor-intensive nature of verbal protocol preprocessing, only 40 sessions' worth of verbal protocols were coded and analyzed. Thus, all results reported below are based on a random stratified sample of 40 participants (20 women and 20 men).
The distribution of Gf scores in the large sample (N = 120) was partitioned into four ability groups as follows: high (APM scores of 13–14), medium-high (scores of 11–12), medium-low (scores of 8–10), and low (scores of ≤7) abilities. Ten participants were then drawn at random from each ability group, and the verbal protocols and pupillometric data from their first (baseline) session were processed.
The participants completed a short-form test from Raven's APM Set II (Raven et al., 1998). Participants either completed Items 2, 4, 6, 9 10, 11, 16, 17, 19, 21, 23, 24, 26, and 29 or Items 1, 3, 5, 7, 12, 13, 14, 15, 18, 20, 22, 25, 27, and 28. The participant instructions followed the Raven APM Manual guidelines for untimed individual test administration (Raven et al., 1998). The two 14-item subsets of the complete (36-item) APM were chosen to be approximately matched for difficulty on the basis of their psychometric characteristics published in the manual (Raven et al., 1998). There were no statistically significant differences in the respective distributions of scores in our sample (Hayes, 2015).
The Raven items were presented on a 21-in. NEC AccuSync 120 color CRT using Experiment Builder (SR Research, Mississauga, Canada). Participants viewed the items binocularly from a chin-and-forehead rest located 935 mm away. Gaze position and pupil response data were recorded from the left eye using an EyeLink 1000 desktop eye tracker (SR Research) at a sampling rate of 250 Hz. The experimental room had a constant ambient illuminance with 25 lux incident at participants' eyes to control for the pupillary light reflex. Image analysis of the Raven APM items revealed high luminance consistency across the 28 Raven test items (grayscale intensity: M = 0.96, SD = 0.02) and across the individual matrix and response cells within each item (grayscale intensity: M = 0.90, SD = 0.04). Therefore, we did not alter the luminance properties of the original Raven APM test images, preserving their original psychometric properties.
Verbal protocols were recorded for each Raven item using a Shure Beta 58A supercardioid dynamic microphone (Shure, Inc., Niles, IL) and E-MU 0202 audio interface (E-MU, Scotts Valley, CA) controlled via Experiment Builder (SR Research). The microphone was placed close to the participant's mouth (≈5 cm) using a telescoping boom tripod microphone stand to provide clear audio recordings. The concurrent think-aloud verbal protocols were collected according to standard think-aloud procedures (Ericsson & Simon, 1993). After the participants received instructions on thinking aloud, they practiced it on unrelated items such as multiplication problems until the experimenter was confident they understood the instructions.
Before the study, participants completed the EyeLink 1000 9-point calibration procedure. Each Raven item was preceded by a beep and fixation cross (similar to the EyeLink 1000 9-point calibration procedure) that appeared in the middle of the screen (Figure 1, right). The fixation screen was equal in luminance to the subsequent Raven item to avoid luminance changes at stimulus onset. After the participant fixated for 1 sec, which allowed for equipment recalibration, the Raven problem appeared, and the participant had unlimited time to work on it. Once participants had chosen an answer, they used the mouse to click on one of the eight responses, thereby ending the trial. Moving the mouse out of the fixation box triggered an isoluminant mask to be drawn over the problem matrix, which delineated solution and response phases. No accuracy feedback was provided until the very end of the experimental session to avoid feedback-induced pupillary dilations. Accuracy and solution time data were collected for each trial. Accuracy was defined as the total number of Raven items answered correctly, and solution time was measured from stimulus onset until response selection. Pupil diameter and gaze position were recorded throughout (i.e., from pretrial fixation through the end of the intertrial interval).
Pupil Data Preprocessing
Before analysis, the pupillary data were corrected for blink artifacts and pupil foreshortening error. Following standard procedures, pupillary measurements were first filtered for blink artifacts, linearly interpolated, and then smoothed for measurement noise (Klingner, 2010; Beatty & Lucero-Wagoner, 2000). In addition, the pupil data were corrected to account for pupil foreshortening error—the systematic foreshortening of the pupil image as the eye rotates away from the eye-tracking camera (Hayes & Petrov, 2015). Pupil foreshortening error must be corrected before analysis because solving Raven items requires the participants to freely scan the screen. The pupil foreshortening error correction described in detail in Hayes and Petrov (2015) fits a geometric model that expresses the pupil foreshortening as a function of the cosine of the angle between the eye-to-camera axis and the eye-to-stimulus axis. In calibration studies with artificial eyes with known, fixed pupil diameters (Hayes & Petrov, 2015), the geometric correction successfully reduced the root mean squared error in pupil diameter estimates by 97.5% when the model parameters were optimized to fit the empirical error surface. The calibration results strongly indicated that the pupil foreshortening error is invariant across changes in pupil size and systematically varies as a function of the orientation of the eye with respect to the camera. The results corresponded well with previous empirical measurements of pupil foreshortening error in biological human eyes (Mathur, Gehrmann, & Atchison, 2013; Jennings & Charman, 1978; Jay, 1962; Spring & Stiles, 1948). Together, these findings suggest the geometric correction can be used to virtually eliminate pupil foreshortening error. In this study, we first performed artifact correction to measure the apparent pupil diameter and then applied the optimized geometric model correction from Hayes and Petrov (2015) to estimate the true pupil diameter.
Verbal Protocol Coding
Concurrent think-aloud verbal protocols were used to segment each trial into exploration and exploitation solution periods. A broad coding scheme was developed to assist the coder in identifying exploration and exploitation periods during each Raven trial. Exploration periods were indicated by utterances that described isolated Raven image features (e.g., “Alright, it looks like we have a bunch of circles and squares….”) or expressed uncertainty (“Not sure what is going on here…I don't see any patterns yet.”). Exploitation periods were indicated by utterances that described a specific pattern within the Raven item (e.g., “In each line it looks like we have circle and diamond and square…. So on this we have square…circle…So the bottom should be a diamond.”). Many transitions from exploration to exploitation were signaled by insight language (“Oh, I see!”). Early Raven items that were easier to solve often only contained one exploration-to-exploitation shift, whereas later more difficult Raven items contained multiple transitions between exploring and exploiting. On these more difficult items, subsequent transitions from exploiting back to exploring were preceded by participants realizing either that a pattern extracted on one row of the Raven problem matrix did not generalize to the subsequent rows or that there was no response option that matched the final solution they had in mind. In these cases, the return to exploration was signaled by failure utterances (e.g., “But that doesn't match the second row” or “ok, it doesn't look like that is even one of the possible options”) followed by a transition back to uncertainty utterances and/or isolated feature descriptions.
A semiautomated coding routine was developed in MATLAB (The MathWorks, Natick, MA) and was used to code all verbal protocol data. In this routine, for each trial, the human expert coder would be presented with an image of the relevant APM item while the recorded verbal protocol audio was played back in real time. The coder served as an “exploration detector” pressing one key to indicate the beginning of an exploratory period and another key to indicate the end of an exploratory period and the beginning of an exploitative period. The beginning of a trial was coded as neutral before any key presses. The MATLAB routine would then convert the time-stamped key presses into a code stream that contained the neutral (0), exploratory (+1), and exploitative (−1) codes for that verbal protocol, each sampled at 250 Hz. This procedure was completed for all participants (n = 40) and trials (n = 14), resulting in 560 individual protocol code streams.
Recent studies that have examined the effect of thinking aloud relative to silent control conditions have not shown any pupillary effect of vocalization (Hertzum & Holmegaard, 2013; Kammerer & Gerjets, 2013). Therefore, no distinction was made in the verbal protocol coding between periods of vocalization and gaps in vocalization.
All coding was performed by the first author (T. R. H.). He did not have access to any pupillometric data while he was coding the verbal protocols. Coder reliability was assessed by coding the data from five randomly sampled participants twice. The recoding was done approximately 1 full year after the original coding. The intrarater reliability for T. R. H. across the two coding sessions was high (mean % agreement = 82.16, 95% CI [80.17, 84.15]). This suggests that the coding scheme was applied consistently.
Synchronizing Pupil and Verbal Protocol Streams
To synchronize the pupillary response stream with the verbal protocol code stream, three sources of latency were considered: participant latency, coder latency, and pupillary response/LC latency. Participant latency refers to the latency that occurs because of a participant processing the APM item information and transforming it into an utterance. Participant latency unfortunately cannot be accounted for in our study because it is known to vary across individuals and types of processing steps and, therefore, will invariably add some noise to our data (Ericsson & Simon, 1993). In contrast, coder latency and LC pupillary response can and were accounted for before analysis. Coder response latency refers to the processing time it takes for the verbal protocol coder to process what they are hearing, make the decision to switch codes, and then actually press the key on the keyboard. To estimate this value, a random sample of 50 trials was used to compare the coder key RT stamps to the original audio time series using audio editing software (Apple, Cupertino, CA). The results showed a coder response latency of approximately 1 sec (M = 1014 msec, SD = 198 msec). Finally, we considered the documented lag between LC activity and the pupillary response. Single-cell studies of LC neurons show that LC activity is tightly linked to stimulus onset, with a lag of only ≈200 msec (Clayton, Rajkowski, Cohen, & Aston-Jones, 2004; Rajkowski, Majczynski, Clayton, & Aston-Jones, 2004). However, the temporal resolution of the pupillary response is much lower than that of LC neurons. The pupil acts as a low-pass filter of LC activity with a lag of approximately 1 sec after stimulus onset (Hayes & Petrov, submitted; van Steenbergen & Band, 2013; Gagl, Hawelka, & Hutzler, 2011). As the coder and pupillary response latencies were approximately equivalent (each about 1 sec), no additional preprocessing was necessary to synchronize the pupil and code streams before analysis.
Segmentation of the Pupillary Data
Finally, the pupillary data were segmented according to the exploratory and exploitative periods obtained from the verbal protocols. First, a baseline pupil diameter was calculated for each segment as follows: The baseline for the first nonneutral (exploratory) segment at the beginning of each trial was computed as the average pupil diameter during the first 500 msec of that segment to provide a more accurate baseline estimate as participants began the trial. The baseline for all subsequent segments until the end of the trial was computed as the average pupil diameter during the last 1000 msec of the immediately preceding segment.
Our main dependent variable is the percent change in pupil diameter (PCPD) relative to the relevant (most recent) baseline. The PCPD is measured in dimensionless units and is invariant with respect to the considerable individual differences in absolute pupil diameter as well as to slow drifts in pupillary tone. Specifically, the PCPD was computed as the task-evoked diameter minus the baseline diameter, divided by the baseline diameter. The mean PCPD was calculated by averaging the PCPD time series within each exploratory or exploitative segment for each participant on each trial (Beatty & Lucero-Wagoner, 2000).
The accuracy and RT data replicated well-documented patterns in the literature on Raven's APM (e.g., Bors & Vigneau, 2003; Carpenter et al., 1990). There were substantial individual differences in overall APM scores, and the trial-by-trial accuracy decreased whereas RTs increased for the later, more difficult problems on the test. The verbal protocols indicated a slightly greater number of exploration than exploitation periods (990 explore, 945 exploit, 560 neutral). This is because the first nonneutral period on a trial was always exploratory, the two types alternated thereafter, and some trials ended in exploration mode. However, the exploitation periods were, on average, longer in duration (exploit: M = 25.9 sec, SD = 25.2 sec; explore: M = 18.7 sec, SD = 14.4 sec; neutral: M = 1.75 sec, SD = 0.9 sec).
A significant boost in mean PCPD was observed during exploration periods relative to exploitation periods (Figure 2, left). A repeated-measures ANOVA with Segment type (explore vs. exploit) as a fixed factor and Participant as a random factor confirmed a strong exploration/exploitation effect on the PCPD (F(1, 39) = 71.9, p < .001, = 0.65). A pair of one-tailed t tests confirmed that the exploration effect was significantly greater than zero (t(39) = 7.59, p < .001, r2 = .59) and the exploitation effect was significantly less than zero (t(39) = −4.90, p < .001, r2 = .38). Under the linking hypothesis that PCPD is a valid index of LC-NE function, which in turn mediates the exploration–exploitation trade-off, this novel finding suggests that this LC-NE mediation also operates during high-level analogical reasoning.
Furthermore, the mean exploratory PCPD increased linearly as a function of fluid reasoning ability as indexed by the APM. The steady increase is evident both in the group level (Figure 2, left) and individual level (Figure 2, right) data. A linear regression with mean exploratory PCPD as the sole predictor accounted for 16% of the variance in individual APM scores (F(1, 38) = 7.05, p = .01, r2 = .16). Under the linking hypothesis outlined above, this additional novel finding suggests that individual differences in the mediation of the exploration–exploitation trade-off may contribute to individual differences in Gf. By contrast, no significant trends were observed in the exploitative pupillary response as a function of fluid ability.
The underlying temporal dynamics of the pupillary response revealed that these patterns in the averaged PCPD data were driven by sustained rather than momentary changes in pupil diameter (Figure 3). Figure 3A shows the grand-mean PCPD time-locked to exploration and exploitation onset.1 The transitions from exploratory to exploitative segments were not associated (on average) with steep changes in pupil diameter. Rather, the exploitative segments exhibited (on average) a slow steady decrease in pupil diameter, depicted by the white line in Figure 3A. By contrast, the transitions from neutral/exploitative to exploratory segments were accompanied with steep increases in pupil diameter (black line) that began before exploratory language became manifest in the verbal protocols. This suggests that exploration likely preceded verbalization (on average) as can be seen by the positive slope of the black line near the transition boundary in Figure 3A. As this same period was taken as the baseline for subsequent change in pupil diameter, the PCPD values used in the statistical analyses may underestimate the magnitude of the pupillary dilation during exploration.
Furthermore, the exploratory pupil dilation seemed to persist for many seconds into the exploration period, as depicted in Figure 3 (right). We interpret this sustained pupillary dilation as a marker of a temporally extended exploratory state as opposed to a transient event.
In the behavioral data, we expected that both mean error rate and mean solution time would increase as a function of trial number according to the progressive nature of Raven's test and in agreement with previous findings (e.g., Bors & Vigneau, 2003; Carpenter et al., 1990). A one-tailed Pearson's product–moment correlation test confirmed that trial number accounted for a significant amount of variance in both error rate (t(12) = 8.42, p < .001, r2 = .85) and solution time (t(12) = 9.13, p < .001, r2 = .87).
Given the strong trial effect on error rate and solution time, we tested for a trial difficulty effect on the pupillary response during exploration and exploitation. A trend analysis revealed a significant linear decrease in mean PCPD as a function of within-subject trial number during the exploration periods (F(1, 507) = 41.01, p < .001, = .08), whereas no statistically significant trend was detected during the exploitation periods (F(1, 507) = 1.08, p = .299). This negative relationship between the magnitude of the exploratory pupillary dilation, on the one hand, and trial number, on the other, can be attributed to the much longer solution times on later, more difficult trials. Although the exploratory dilation could be sustained, on average, for at least 20 sec (Figure 3B), many exploratory periods were quite longer on difficult trials, eventually diluting the exploratory PCPD increase.
To investigate this further, Figure 4 presents some basic descriptive statistics about the number and duration of exploratory and exploitative segments. Both quantities increased as a function of trial number (and trial difficulty). The earliest trials typically exhibited only one brief exploration period followed by a single brief exploitation period. On the more difficult middle and late items, however, the participants tended to alternate multiple times between exploring and exploiting. Trend analyses revealed significant linear (F(1, 507) = 193.38, p < .001, = .27) and quadratic (F(1, 507) = 21.02, p < .001, = .04) trends in the total number of transitions between exploration and exploitation as a function of Trial. Analogous analyses also revealed significant linear and quadratic trends in exploration duration (linear: F(1, 468) = 32.72, p < .001, = .06; quadratic: F(1, 468) = 46.82, p < .001, = .09) and exploitation duration (linear: F(1, 468) = 18.21, p < .001, = .04; quadratic: F(1, 468) = 26.47, p < .001, = .05).
Furthermore, there was evidence for interactions between the difficulty of the test items and the Gf of the participants as indexed by their APM scores. Recall that the participants were sampled from four ability groups. Repeated-measures ANOVAs with Group as a between-subject factor and Trial as a within-subject factor showed significant Trial × Group interactions for the number of transitions (F(39, 468) = 1.65, p < .01, = .12), exploration duration (F(39, 468) = 1.83, p < .01, = .13), exploitation duration (F(39, 468) = 2.22, p < .01, = .16), and total solution time (F(39, 468) = 2.37, p < .001, = .16). These Ability × Difficulty interactions reflected the differential ability of the participants to engage with the most difficult items (Trials 11–14). High-ability participants would struggle yet work through those difficult items over a lengthy sequence of alternating exploration and exploitation periods, whereas lower ability participants were prone to become overwhelmed, take a guess, and terminate the trial after a comparatively short effort.
These Ability × Difficulty interactions raise a possible alternative explanation for the correlation between exploratory PCPD and ability group depicted in Figure 2A above. It is possible that this correlation might simply be driven by the difference in performance between high- and low-ability participants on the most difficult items. However, the following analysis suggests that this is unlikely. When the mean exploratory PCPD, taken across Trials 1–10 only, excluding the most difficult trials (Trials 11–14), was used as a predictor in a linear regression, it accounted for 20% of the variance in individual APM scores (F(1, 38) = 9.28, p < .01, r2 = .20). Recall that the mean exploratory PCPD across all 14 trials accounted for 16% of this variance. Therefore, when the high- and low-ability participants spent the same amount of time exploring, the correlation between exploratory PCPD and APM scores actually increased. This rules out the late trials as a potential confounding factor. If anything, the random guessing on the most difficult trials in the lower ability groups probably adds nonsystematic variance that degrades the correlation.
We also checked the so-called time-on-task effect as a potential confound. Prior studies have found that pupil diameter can decrease systematically during the experimental session (Hayes & Petrov, submitted; Beatty, 1982; Kahneman & Beatty, 1967). It should be noted that these studies used low-level perceptual tasks with subsecond RTs (e.g., vigilance, auditory discrimination, visual-motion discrimination). These monotonous simple tasks are quite different than Raven's APM, which is designed to vary the figural material constantly to measure fluid (as opposed to crystallized) intelligence. It has been suggested that the decreasing pupil size in earlier studies may be a result of decreasing arousal as participants get bored with the task (Laeng, Sirois, & Gredeback, 2012; Beatty & Lucero-Wagoner, 2000). Some participants in our study worked for over 5 min on some of the (difficult) Raven problems, which raises the concern of a within-trial time-on-task effect as a potentially confounding factor in our data. To estimate the magnitude of the time-on-task effect over the course of individual trials, we performed a series of robust linear regressions2 on the pupillary diameter as a function of the time since each stimulus onset. This produced one slope-parameter estimate per trial. These were averaged across trials to produce one aggregate slope estimate per participant. The latter estimates did not differ significantly from zero (t(39) = 1.18, two-tailed p = .25). In addition, recall that the exclusion of the four longest trials from the analysis only strengthened the pattern in Figure 2A. Being the longest, these trials should be the most vulnerable to a possible time-on-task effect. Overall, this confound does not seem a viable explanation of our results. Apparently, the consistently novel and challenging nature of the Raven task kept our participants engaged throughout each trial and throughout the session as a whole.
Finally, we checked whether the exploration and exploitation periods differed in terms of missing values in the pupillometric time series and in terms of saccade frequency. Missing values occur when the eye tracker temporarily loses the pupil, for example, because of blink artifacts. Such artifacts were rejected during preprocessing. The artifact frequency in the raw pupil data was similar for exploration (M = 11.18, SD = 12.42 percent of period) and exploitation (M = 10.96, SD = 12.47, t(39) = 0.50, p = .62). Saccade frequency was significantly lower during exploration (M = 10.03, SD = 5.11 percent of period) compared with exploitation (M = 11.19, SD = 5.58, t(39) = 6.28, p < .001). However, we are not aware of any studies showing a systematic effect of saccade frequency on the pupillary response, and the 1% difference is not likely to account for the large exploration/exploitation effect in our data. Saccades produce a known risk of pupil foreshortening error as they change the gaze position, but this source of systematic error was corrected during preprocessing (Hayes & Petrov, 2015).
A novel combination of pupillometry and verbal protocol analysis was used to compare changes in pupil diameter during exploration and exploitation control states during visual analogy making. The analysis revealed a significant increase in pupil diameter during exploration and decrease during exploitation. This broad finding is the first to generalize theories of the LC-NE system's role in the exploration–exploitation trade-off to a high-level analogical reasoning task such as Raven's APM. More importantly, individual differences in the relative magnitude of exploratory pupillary dilation accounted for 16% of the variance in APM scores. This novel result suggests that individual differences in general Gf may be related to underlying differences in noradrenergic function.
Our findings build upon and are consistent with previous studies of the exploration–exploitation trade-off that monitored the pupillary response as a noninvasive index of the LC-NE system (Jepma & Nieuwenhuis, 2011; Gilzenrat et al., 2010). Gilzenrat et al. (2010) had participants complete an auditory pitch discrimination task in which reward increased as the pitch discrimination difficulty increased, until the discrimination eventually became impossible. Importantly, their participants were allowed the option to escape before each trial. Escaping would reset the reference tone, difficulty, and reward levels. They found that baseline pupil diameter increased leading up to escape trials and decreased afterward. This is consistent with our observation of a decrease during exploitation and increase during exploration. However, the effect size was modest, and Gilzenrat et al. (2010) suggested that this may be because of the escape manipulation not sufficiently emulating exploration. In a follow-up study, Jepma and Nieuwenhuis (2011) tracked the exploration–exploitation trade-off during a dynamic n-armed bandit gambling task in which participants repeatedly had to choose to play one of four slot machines with nonstationary rewards. Although the dynamic n-armed bandit task strongly promoted shifts in the exploration–exploitation trade-off, gaze position was not controlled during bandit selection, and reward feedback was visually presented immediately after selection. To avoid measurement artifacts, Jepma and Nieuwenhuis (2011) restricted their pupillary response analysis to the pretrial baseline period only. The main result showed an overall increase in baseline pupillary response before exploratory trial choices (i.e., trials where participants switched their bandit choice) compared with exploitative trial choices (i.e., trials in which participants picked the same bandit). The present data add converging indirect support for the role of the LC-NE system in the exploration–exploitation trade-off.
Our study builds upon this earlier work (Jepma & Nieuwenhuis, 2011; Gilzenrat et al., 2010) in three important ways. First, it generalizes the exploration–exploitation trade-off findings from animal electrophysiology and lower level human tasks to a high-level visual analogy-making task. The previously studied tasks such as perceptual discrimination (Gilzenrat et al., 2010) and forced choice (Jepma & Nieuwenhuis, 2011) are relatively simple tasks with subsecond RTs. Raven's APM provides a much richer task environment with solutions that often unfold over minutes and produce extended periods of exploration and exploitation. These more temporally extended periods are better suited to the lower temporal resolution of the pupillary response as an index for the LC-NE system. In addition, exploration–exploitation shifts in Raven's APM were often triggered by insight moments or pattern failures providing a sharper boundary compared with previous studies where the explore–exploit transitions were more gradual in nature. Second, the combination of think-aloud verbal protocols and pupillometry allowed for pupillometric analysis of exploration–exploitation shifts as they happened. Combining these diverse data sources and addressing limitations of earlier human studies (i.e., removing overt task feedback and correcting for pupil foreshortening error) allowed for tighter experimental control over the pupillary response during control state shifts. Last but not least, Raven's APM test is strongly correlated with a major dimension of individual differences—Gf. This allowed for a novel examination of how the exploration–exploitation trade-off and noradrenergic function may contribute to individual differences in Gf.
Examining how individual differences in APM score covary with the pupillary response during control state shifts expands the domain in a novel direction and offers a plausible explanation for past inconsistencies in the literature. Recent work (Van Der Meer et al., 2010) has indicated that high fluid-intelligence individuals have larger task-evoked pupillary responses when performing difficult tasks. This supports the view that people with high Gf may simply have more cognitive resources that can be recruited during demanding tasks (resource hypothesis; Van Der Meer et al., 2010). Earlier work (Ahern & Beatty, 1979, 1981) showed the opposite pattern in which higher intelligence individuals showed smaller task-evoked pupillary responses than those with average intelligence. This supports the view that high-intelligence individuals use their cognitive resources more efficiently (efficiency hypothesis; Ahern & Beatty, 1979, 1981).
Our results do not directly refute either of these earlier hypotheses but offer a third account—a control hypothesis. Higher fluid-ability individuals may be better able to regulate their task-relevant control state. Our finding that the exploratory boost in pupil diameter covaried with Gf opens up the interesting possibility that individual differences in Gf may be related to individual differences in mediating control state through stronger shifts in neural gain. The control hypothesis offers a parsimonious explanation for the conflicting earlier findings on the relationship between intelligence and pupillary response. In tasks that require exploration (such as the geometric analogy task used by Van Der Meer et al., 2010), high-Gf individuals who shift into higher gain states will have larger task-evoked pupillary responses than low-Gf individuals. On the other hand, overlearned tasks that primarily require exploitation (such as the mental multiplication, digit span used by Ahern & Beatty, 1979, 1981) are easier for high-Gf than low-Gf individuals. This produces a smaller task-evoked pupillary response in high-Gf individuals. Although our study does not directly bear on the role of the pupillary response during overlearned tasks, there are many cognitive load studies indicating that easier tasks induce smaller pupillary response than difficult tasks (see Beatty & Lucero-Wagoner, 2000, for a review).
One limitation of our individual differences finding is that both exploratory pupillary response and Gf were measured simultaneously on a common task. Although Raven's APM is a strong psychometric test (e.g., Brouwers et al., 2009), it is not a noise-free measure of Gf (cf. Hayes et al., 2015). Therefore, it is possible that the exploratory pupillary response and Gf may share error variance because of other factors such as participant motivation or alertness. This study is a first step, but it will be important to test in future research whether its findings replicate in a design that measures Gf and exploratory pupillary response on independent tasks (e.g., use Raven's APM on Day 1 to assess Gf and an isoluminant foraging task on Day 2 to assess the exploratory pupillary response).
In conclusion, by combining verbal protocol analysis and pupillometry, we identified and tracked shifts in the exploration–exploitation trade-off during analogical reasoning on Raven's APM fluid intelligence test. The results showed decreased pupil diameter during exploitation and increased diameter during exploration, consistent with prominent theories of LC-NE function. Importantly, one sixth of the variance in Raven scores was accounted for by individual differences in exploratory pupillary dilation. These findings shed new light on the relationship between the exploration–exploitation trade-off, noradrenergic function, and individual differences in Gf.
This research was supported by the National Eye Institute (R21 EY022745).
Reprint requests should be sent to Taylor R. Hayes, Center for Mind and Brain, University of California, Davis, CA 95618, or via e-mail: email@example.com.
Note that the pupil baseline is applied retroactively to the data points before the segment transition boundary in Figure 3A. This is for plotting purposes only. In the statistical analyses, these data points were incorporated into the preceding segment.
Each regression used iteratively reweighed least squares with a bisquare weighting function. Ordinary regressions yielded similar results.