## Abstract

A substantial proportion of individuals who complete the widely used multiple price list (MPL) instrument switch back and forth between the safe and the risky choice columns, behavior that is believed to indicate low-quality decision making. We develop a conceptual framework to formally define decision-making quality, test explanations for the nature of low-quality decision making, and introduce a novel “nudge” treatment that reduced multiple switching behavior and increased decision-making quality. We find evidence in support of task-specific miscomprehension of the MPL and that non-multiple switchers and relatively high-cognitive-ability individuals are not immune to low-quality decision making.

## I. Introduction

RELIABLE and meaningful measurement of individual risk preferences is critical for understanding a wide range of economic decision making. Experimental economics has contributed many tools to measure individual risk preferences using incentivized choice situations (see Harrison & Rutström, 2008, for a survey), although different risk elicitation methods are not always correlated with one another (see, for example, Charness & Viceisza, 2016, and the references in Niederle, 2016). The MPL instrument, often called the Holt and Laury (2002) instrument, is one of the most widely used methods to elicit individual risk preferences. The MPL is also used in settings other than the measurement of risk preferences, such as pricing commodities (Kahneman, Knetsch, & Thaler, 1990; Cassar, Wordofa, & Zhang, 2016) and measuring discount rates (Harrison, Lau, & Williams, 2002; Andersen et al., 2008). An attractive feature of the MPL is that it can be used to elicit arbitrarily precise intervals of risk aversion estimates (Charness, Gneezy, & Imas, 2013; Tanaka, Camerer, & Nguyen, 2010; Jacobson & Petrie, 2009).

Despite its popularity, an empirical difficulty that researchers encounter when using the MPL is that a substantial proportion of subjects switch back and forth between the safe and the risky choice columns in the instrument (i.e., engage in multiple switching), which is behavior incompatible with standard assumptions on preferences (Charness et al., 2013). Such multiple switching behavior (MSB) is generally considered low-quality decision making, and the observed responses are treated as noise, although some studies argue that MSB may indicate indifference among a range of options (Andersen et al., 2006). MSB is especially pronounced in developing countries. Whereas typical studies in developed countries find multiple switching to affect approximately 10% of the subjects (Holt & Laury, 2002, report 13% multiple switchers; Dave et al., 2010, report 8.5% multiple switchers), in developing countries the multiple switching rate can be over 50% (Jacobson & Petrie, 2009; Charness & Viceisza, 2016). A survey of studies using the MPL found an inconsistency rate of 17.1% (including multiple switching and choosing dominated options) in a vast sample of 6,315 subjects over 54 published papers (Crosetto & Filippin, 2016; Filippin & Crosetto, 2016).

A common experimental practice used to reduce MSB is to ask subjects to indicate the row in which they would like to switch from the risky option to the safe option (Andersen et al., 2006; Tanaka et al., 2010). This eliminates MSB but also reduces the choice set, so we do not know whether the subject would have engaged in multiple switching if she had been free to do so. In this study, we develop a “nudge” protocol to increase cognitive effort without limiting the choice set.1 After subjects complete the MPL task, we ask them if they are sure of their responses and give them the option of hearing the instructions one more time.2 We found a reduction in MSB from 31% using the standard protocol to 10% in the nudge protocol ($p$-value of difference $<$ 0.001). This suggests that at least two-thirds of MSB can be categorized as mistakes that are corrected on further reflection, which sets it apart from the deliberate randomization behavior described in Agranov and Ortoleva (2017) and Chew et al. (2018), for example.

Although the literature generally views MSB as equivalent to low decision-making quality, an individual can make low-quality decisions that do not result in MSB. The potential for non–multiple switchers to make low-quality decisions is not well understood and may be an important source of noise in the data. We develop a conceptual framework, which formally defines decision-making quality independent of MSB and, using the covariance between responses on the MPL task and responses on a simpler lottery selection task, allows us to test three explanations for low-quality decision making in the MPL suggested by the findings in the literature.

Of the three explanations, perhaps the most pessimistic is that bad decision making in the MPL task is a stable attribute of the decision maker. This explanation is supported by the evidence in Jacobson and Petrie (2009), which shows that multiple switchers in MPL instruments make suboptimal decisions in other areas of their lives. Similarly, Choi et al. (2013) find that people make bad decisions in many areas of their lives, and individuals who make low-quality decisions in an experimental setting have less wealth, controlling for their current income and a slew of demographic and socioeconomic status variables.3 Under this view, decision-making quality is not readily improvable and should not respond to an unintrusive stimulus such as the nudge treatment.

At first pass, this explanation seems immediately incompatible with our finding that MSB can be reduced by the nudge protocol. However, since we do not make the assumption that MSB captures the full extent of low-quality decision making, the finding that MSB can be reduced does not necessarily imply that decision-making quality can be improved. For example, individuals could have inferred that experimenters wanted less MSB and in their desire to be helpful make fewer multiple switches but continue to give noisy responses that do not reflect their true risk preferences. Charness et al. (2013) raise a similar concern in the context of treatments that eliminate MSB but do not induce higher-quality decision making, which would mask data quality issues.

A second explanation is that low-quality decision making in the MPL is incidental to the complexity of the MPL instrument. For example, Charness and Viceisza (2016) argue that a lack of comprehension is a serious concern with using the MPL in developing countries. Charness et al. (2013) use the term failure to understand. Andreoni and Sprenger (2011) directly tell subjects that “most people begin by preferring option A and then switch to option B, so one way to view this task is to determine the best row to switch from option A to option B, in an effort to improve comprehension. Furthermore, Dave et al. (2010) demonstrate that the MPL produced noisier estimates of risk aversion than did a simpler lottery selection instrument developed in Eckel and Grossman (2002), especially for individuals with low math ability, which suggests that cognitive ability plays a role in comprehension of the MPL. Under this interpretation, short-term treatments such as the nudge protocol can improve decision-making quality in the MPL by improving comprehension.

A third explanation also views low-quality decisions in the MPL as improvable in the short term, but assumes that individuals who make low-quality decisions in the MPL also make low-quality decisions using other instruments because they are careless. This explanation was mentioned in Brick et al. (2012) but has not been given much attention in the literature. The main difference between carelessness and miscomprehension is that low-quality decision making due to carelessness is not unique to the MPL instrument. The implication is that the MPL, by virtue of allowing for MSB, which may be an indicator of low-quality decision making, may in fact be preferred to simpler instruments that do not allow for MSB and which obscures low-quality decision making.

To test the different explanations, we compare responses given in the MPL to responses given in a simpler lottery selection (LS) task in which MSB is not possible, in both the control and nudge treatment groups.4

In the conceptual framework that we develop below, we show that the covariance of the responses given in the MPL and the LS tasks will be the same for the control and nudge groups under the stable attribute explanation, larger for the nudged group than the control group under the miscomprehension explanation, and weakly larger for the control group than the nudge group under the carelessness explanation. To make inferences about the difference between two population covariances, we derived the variance of the difference between two sample covariances and propose an estimator that is consistent and asymptotically normal (see online appendix B.2).

Our test finds consistent evidence in support of the miscomprehension explanation and that the nudge treatment improved comprehension. Furthermore, several potentially useful applications arise from our framework and experimental design. First, a test of relative complexity between any two tasks can be performed using a procedure we describe involving the nudge treatment. Second, a comparison of covariances between a complex task and a simpler task can reveal the relative effectiveness of various devices for improving task comprehension in the complex task. And third, under the assumption that the comparison task is simple, a data quality metric that we propose can be used to generate a ranking of instrument complexity and can be used to compare task comprehension across studies.

Using the data quality metric, which defines data quality independent of MSB, we find that the nudge treatment conservatively improves data quality by 152%. These findings are broadly consistent with Imas et al. (2018), who find that imposing a waiting period before making intertemporal work allocations substantially increased patient choices and reduced overall workload. We found data from both multiple switchers and nonmultiple switchers under the standard protocol to be characterized by low decision-making quality, implying that MSB does not capture the full extent of low-quality decision making and that discarding multiple switchers does not ensure high data quality, as is often believed. Although cognitive ability predicts MSB, which is consistent with the previous literature, cognitive ability is not a significant determinant of data quality. To the extent that the nudge treatment was able to elicit more cognitive effort from respondents, the findings suggest that cognitive ability is not a limiting factor in achieving high data quality in the MPL, but cognitive effort is. We find comparable data quality as in our control group among one sample of university students with relatively low MSB rates.

Figure 1.

Conceptual Framework

Figure 1.

Conceptual Framework

This paper is organized as follows. In section II, we develop the conceptual framework. In section III, we describe the experimental design and the subjects. Section IV presents the empirical analysis. Section V presents robustness checks. Section VI presents general applications of the framework and experimental design and proposes a data quality metric. Section VII concludes.

## II. Conceptual Framework

In this section, we develop a simple conceptual framework (see figure 1) to motivate our experimental design and analysis, presented in the next section. Let the signal in risk tolerance, or “true” risk preferences, be denoted $S$, with $μs=E(S)$, and $σs2=Var(S)>0.$$ηij$ and $νij$ are i.i.d. noise terms in measured risk aversion for the MPL and the LS tasks, respectively. $i∈{1,2}$ denotes type, $j∈{s,n}$ denotes treatment status, where $s$ represents standard MPL and $n$ represents nudged MPL. Suppose there are two types of individuals: those who are confused by the MPL and those who are not. Type 1, occurring with probability $p$, are not confused by the MPL and give “high-quality” responses that reflect both signal and noise. Without additional intervention, type 2 individuals, occurring with probability $1-p$, are confused by the MPL and give “low-quality” responses that reflect only noise. Note that $1-p$ can be larger than the proportion of multiple switchers because type 2 individuals could by chance avoid multiple switching. The maintained assumption is that $0. The presence of the two types leads to a mixture distribution in measured risk aversion.

### A. Scenario I: Low-Quality Decision Making Is a Stable Attribute of the Decision Maker

For the control group using the MPL, the response from type 1 is $X1s=S+η1s$, and the response from type 2 is $X2s=η2s$ (refer to figure 1, panel a). For the LS task, the response from type 1 is $Y1s=S+ν1s$. Because under this scenario, confused individuals consistently make low-quality decisions, we expect the response on the LS task for type 2 to be $Y2s=ν2s$.

Since the nudge treatment cannot induce confused individuals to make “high-quality” decisions, the responses of the treatment group will be equal in distribution to those in the control group. That is, $X1n$$d=$$X1s$; $X2n$$d=$$X2s$; $Y1n$$d=$$Y1s$, and $Y2n$$d=$$Y2s$.

The response in the MPL is denoted by $MPLj$, $j∈{s,n}$, whose density function is the mixture of the density functions of $X1j$ and $X2j$, where $p$ is the weight placed on the density function of $X1j$. Similarly, the response in the LS task is denoted by $LSj$, whose density function is the mixture of the density functions of $Y1j$ and $Y2j$, where $p$ is the weight placed on the density function of $Y1j$.

Because of the equality in distribution of the component distribution functions, $MPLs$$d=$$MPLn$, and $LSs$$d=$$LSn$. This implies that $Cov(MPLs,LSs)=Cov(MPLn,LSn)$. Similarly, $Var(MPLs)=Var(MPLn)$ and $Var(LSs)=Var(LSn)$, implying $Corr(MPLs,LSs)=Corr(MPLn,LSn)$.

Note that the result is also consistent with the interpretation that the MPL task was fully comprehended using the standard protocol, that is, $p=1$.

### B. Scenario II: Task-Specific Miscomprehension

Under this scenario, the confusion is specific to the MPL task (refer to figure 1, panel a). For the control group using the MPL, the response is identical to scenario I: the response from type 1 is $X1s=S+η1s$, and the response from type 2 is $X2s=η2s$.

For the control group using the LS task, because the confusion is specific to the MPL, the responses from types 1 and 2 are $Y1s=S+ν1s$ and $Y2s=S+ν2s$, respectively, where $ν1s$$d=$$ν2s$.

For the treatment group using the nudged MPL, the response from type 1 is $X1n=S+η1n$, where $η1n$$d=$$η1s$. If the treatment fully “unconfuses” type 2s, then the response from type 2 is $X2n=S+η2n$, where $η2n$$d=$$η1n$.

Because the nudge treatment works only on the MPL, we expect no differences in the responses on the LS task by treatment status, for both types of individuals. That is, $Y1n=S+ν1n$ and $Y2n=S+ν2n$, where $ν1n$$d=$$ν1s$ and $ν2n$$d=$$ν2s$. Therefore, $LSs$$d=$$LSn$.

By the properties of mixture distributions,
$Cov(MPLj,LSj)=∑i[piCov(Xij,Yij)+pi(μXij-μXj)(μYij-μYj)],$
(1)

where $μXij=E(Xij)$, $μYij=E(Yij)$, $μXj=E(Xj)$, and $μYj=E(Yj)$.

For the control group, $Cov(MPLs,LSs)=pCov(X1s,Y1s)+(1-p)Cov(X2s,Y2s)=pσs2$, and for the treatment group, $Cov(MPLn,LSn)=σs2$. This yields the result that $Cov(MPLs,LSs).

More generally, we can assume that the nudge treatment “unconfuses” a proper subset of type 2s (refer to figure 1, panel a). Online appendix B.1.1 shows that in that case $Cov(MPLs,LSs)=p1σs2$ and $Cov(MPLn,LSn)=(p1+p2)σs2$ where $p1$ is the proportion who are not confused in the control group and $p2$ is the additional proportion who have become unconfused by the nudge treatment in the treatment group (and $p3=1-p1-p2$ is the proportion of individuals who remain confused by the MPL even with the nudge treatment). We will have the same result that $Cov(MPLs,LSs) in this general case if $p2>0$. Scenario II also implies that for both the control and treatment groups, the lower the proportion of confused individuals, the higher will be the covariance between the responses on the MPL and the LS tasks.

Because this scenario does not produce clear predictions on the relative sizes of $Var(MPLs)$ and $Var(MPLn)$, it does not produce clear predictions on the relative sizes of $Corr(MPLn,LSn)$ and $Corr(MPLs,LSs).$

### C. Scenario III: Carelessness

This scenario assumes that the confusion of type 2 individuals is due to non-task-specific carelessness, affecting both the MPL and the LS tasks for the control group (refer to figure 1, panel a). Alternatively, this scenario can also be understood as assuming that the LS task is equally susceptible to miscomprehension as the MPL task. The nudge treatment, which is only applied to the MPL, removes confusion only in the MPL task.

Identical to scenarios I and II, for the control group using the MPL, the response from type 1 is $X1s=S+η1s$, and the response from type 2 is $X2s=η2s$.

For the control group using the LS task, the response from type 1 is $Y1s=S+ν1s$, and the response from type 2 is $Y2s=ν2s$. Because carelessness is not task specific, the responses on the LS task for type 2 individuals also only capture noise.

For the treatment group using the nudged MPL, the response from type 1 is $X1n=S+η1n$, where $η1n$$d=$$η1s$. If the treatment fully “unconfuses” type 2s, then the response from type 2 is $X2n=S+η2n$, where $η1n$$d=$$η2n$.

Because the nudge treatment works only on the MPL, we expect no differences in the responses on the LS task by treatment status. For the treatment group using the LS task, the response from type 1 is $Y1n=S+ν1n$, and the response from type 2 is $Y2n=ν2n$, where $ν1s$$d=$$ν1n$ and $ν2s$$d=$$ν2n$, so that $LSs$$d=$$LSn$.

Using the fact that $μ=pμ1+(1-p)μ2$, equation (1) simplifies to
$Cov(MPLj,LSj)=pCov(X1j,Y1j)+(1-p)Cov(X2j,Y2j)+p(1-p)(μX1j-μX2j)(μY1j-μY2j).$
(2)

For the control group, $Cov(MPLs,LSs)=pσs2+p(1-p)(μX1s-μX2s)(μY1s-μY2s)$. For the treatment group, $Cov(MPLn,LSn)=pσs2$. As long as the difference in the expected value of measured risk tolerance in the control group for types 1 and 2 is not in the opposite direction for the two tasks, then $Cov(MPLs,LSs)≥Cov(MPLn,LSn)$. We would violate this assumption if, for example, confused individuals without intervention are more risk averse in the MPL task but less risk averse in the LS task. We do not know of any theory or evidence that predicts this pattern.

The intuition for this result is that unlike in scenario II, there are no gains in the covariances between the two tasks for type 2 individuals under the nudge treatment because the carelessness of type 2 individuals is not improved for the LS task. On the other hand, the relative magnitudes of the expected values of risk tolerance for types 1 and 2 individuals are allowed to have the same pattern for MPL and LS tasks in the control group, which adds to the overall covariance of the control group, but they do not have this “similarity” in the treatment group.

It can be demonstrated that under the more general assumption where the nudge treatment “unconfuses” a proper subset of type 2s (refer to figure 1, panel a), we have the same result that $Cov(MPLs,LSs)≥Cov(MPLn,LSn)$. See online appendix B.1.3 for the proof.

## III. Experimental Design

### A. Experimental Setting

Subjects were recruited from a rural middle school (seventh to ninth grades) in an ethnically diverse region of southwest China. The county in which this middle school is located has been on the register of nationally recognized “poor” counties since the criteria for the designation were established in 1986. According to the provincial statistical yearbook, in 2014, the county annual average GDP per capita was RMB 11,345 (USD 1,650).

We randomly drew students from the complete year 7 and year 8 enrollment rosters.5 After cross-referencing with teachers' class lists to identify dropouts, replacements were randomly drawn from one year 9 class. Self-selection into our experiment was minimal. Our final sample consists of 193 of 212 students whom we selected, for a response rate of 91%. Nonresponse is largely due to student absenteeism and dropouts not reflected in class lists. There was no difference in test scores (average of math and verbal scores, standardized within each grade) between the no-shows and those who participated in our study ($p$-value $=$ 0.657).

An advantage of using middle school students is that self-selection is minimized (Sutter, Zoller, & Glätzle-Rützler, 2019). At the same time, monetary incentives are appropriate, unlike with very young children. The literature also suggests that risk preferences are mature by this age (see Sutter et al., 2013, Eckel et al., 2012, and a survey of relevant findings in Sutter et al., 2019). For some preferences that are also measured using a multiple price list, such as time preferences, adolescents are an especially relevant group to study, because their time preferences have been shown to predict health and educational outcomes (Sutter et al., 2013; Castillo et al., 2011; Castillo, Jordan, & Petrie, 2019), and interventions in this age group have been shown to be effective in promoting patience (Alan & Ertac, 2018). Multiple switching is a common phenomenon in these instruments as well; for example, Castillo et al. (2011) find 31% of subjects exhibited MSB. To the extent that this study sheds light on the issue of comprehension with multiple price lists eliciting risk preferences, the findings may also be relevant for time preferences.

The experiments were conducted in the spring semester of 2015, mainly during the 4:00 p.m. to 7:00 p.m. break time on campus.6 Students completed the experiments one-on-one with our experimenters. At the beginning of the experiments, students were told that they would play two games, and only one of them would be chosen randomly to realize their final payment. Students were also asked to fill out a short survey after all experiments were completed to capture basic demographic and socioeconomic status information. Subject payments were handed out after the surveys were completed. Average payout was RMB 6.19, not including a pencil and eraser as a showup gift. Student test scores were separately obtained from the school administrators.

### B. Experimental Design

Both the treatment and control groups were administered two tasks: the MPL task and the LS task. The control group was administered the MPL using the standard protocol, while the treatment group was administered the MPL with a nudge protocol, explained below. There was no difference in the administration of the LS task for the control and treatment groups. The order of the two tasks was randomized within each group.

Table 1.
Balance Check and MSB
ControlTreatment$p$-value for
(1)(2)H0 (1) $=$ (2)
Female .51 (.5) .61 (.49) 0.192
Age 14.42 (1.2) 14.34 (1) 0.671
Number of children in the household 2.09 (.87) 2.26 (.92) 0.197
Number of family members in the household 5.34 (2.05) 5.72 (2.8) 0.290
Distance from home to school ($=$1 if less than or equal to 30 min walk) .38 (.49) .43 (.5) 0.410
Mother's educational attainment ($=$1 if less than or equal to primary) .68 (.47) .68 (.47) 0.981
Mother's occupation ($=$1 if agricultural) .71 (.45) .77 (.42) 0.354
Monthly household income ($=$1 if less than or equal to 750 RMB) .45 (.5) .45 (.5) 0.999
Monthly allowance ($=$1 if less than or equal to 300 RMB) .82 (.38) .78 (.41) 0.497
Grade 7 dummy .5 (.5) .52 (.5) 0.817
Grade 8 dummy .43 (.5) .39 (.49) 0.629
Grade 9 dummy .07 (.26) .09 (.28) 0.649
Test score .08 (.82) −.09 (.92) 0.177
Multiple switcher .31 (.46) .1 (.3) 0.000
Observations 101  92
ControlTreatment$p$-value for
(1)(2)H0 (1) $=$ (2)
Female .51 (.5) .61 (.49) 0.192
Age 14.42 (1.2) 14.34 (1) 0.671
Number of children in the household 2.09 (.87) 2.26 (.92) 0.197
Number of family members in the household 5.34 (2.05) 5.72 (2.8) 0.290
Distance from home to school ($=$1 if less than or equal to 30 min walk) .38 (.49) .43 (.5) 0.410
Mother's educational attainment ($=$1 if less than or equal to primary) .68 (.47) .68 (.47) 0.981
Mother's occupation ($=$1 if agricultural) .71 (.45) .77 (.42) 0.354
Monthly household income ($=$1 if less than or equal to 750 RMB) .45 (.5) .45 (.5) 0.999
Monthly allowance ($=$1 if less than or equal to 300 RMB) .82 (.38) .78 (.41) 0.497
Grade 7 dummy .5 (.5) .52 (.5) 0.817
Grade 8 dummy .43 (.5) .39 (.49) 0.629
Grade 9 dummy .07 (.26) .09 (.28) 0.649
Test score .08 (.82) −.09 (.92) 0.177
Multiple switcher .31 (.46) .1 (.3) 0.000
Observations 101  92

Means and standard deviations are presented. Standard deviations in parentheses. Exchange rate: RMB 1 $=$ USD 0.16.

We observed only the final responses and did not record the initial responses of those who were nudged. We made this design choice for several reasons. First, we wanted to minimize problematic experimenter demand effects. For example, if the subjects realized that we would be recording their choices before and after the nudge, they might perceive that the desired behavior was to change their choices after the nudge and therefore might make changes in a random manner without exerting more effort to understand the MPL task. Second, we wanted to keep the treatment and control conditions as similar as possible, with the exception of the nudge itself. These considerations relate to the well-known issue of “testing” in the experimental psychology literature, which finds that the act of taking an initial measure can have an impact on the value of the final measure and threaten both internal and external validity (Campbell & Stanley, 1963).

### C. Balance Tests

Because the experiment was conducted one-on-one, it allowed us to randomly assign treatment condition and task order at the individual level.7 Table 1 reports the balance of demographic and socioeconomic status variables, as well as school performance between the treatment and the control groups. We found no statistically significant differences between the control and treatment groups in age, size of household, distance to school, mother's educational attainment, mother's occupation, monthly household income, monthly allowance, a set of grade dummies, or students' test scores. In online appendix table A.1 we report balance by treatment status and the order in which the MPL and LS tasks were administered. The results show that there are no statistically significant differences among the four groups defined by treatment status and task order.

### D. MPL Instrument and the Nudge Treatment

The MPL task follows the design in Dohmen et al. (2011).8 Subjects are required to make six choices between a lottery and a certain payout (see online appendix C for the instrument). Option A is a coin flip lottery (risky choice) with a 50% chance of paying RMB 10 and 50% chance of paying RMB 0. Option A does not change across the six choices. Option B is the certain cash payout (safe choice), in increasing increments of RMB 1, from RMB 1 to RMB 6. One of the six pairs of options is randomly selected from the instrument for each subject after she makes her choices, and the option she chose from the selected pair will be implemented. The instrument is incentive compatible. For example, a participant who values the coin flip lottery (option A) at RMB 3.5 certain payout should choose the lottery for all values of the certain payout below RMB 3.5 and should choose the certain payout when it is above RMB 3.5. For this subject, we should observe three choices of the risky choice before switching to make three safe choices. Under standard assumptions on preferences, subjects should make at most one switch from the risky choice to the safe choice; nevertheless, in practice, we find that many subjects switch back to the risky choice after having made a safe choice.

In the nudge treatment, the subjects are first given the same instructions used in the control group to administer the MPL task. As each subject hands in his or her responses, we say: “Have you decided? You can think about your choices again carefully and can change your choices. If you would like, we can explain this game one more time.” (See online appendix C for the protocol.) Those who indicate the need are given the instructions again. The treatment is designed to encourage the subjects to put more cognitive effort into the task, without taking away their ability to engage in MSB. Indeed, 10% of the treatment group exhibited MSB.

Allowing subjects to reconsider their responses relates to a well-established finding in the education literature on changing answers to objective questions on exams. In a survey of 33 studies, Benjamin et al. (1984) find that most examinees who change answers improve their scores by doing so. Scores also typically improve when reviewing answers is allowed (as in computerized exams) and when students are instructed in the benefits of changing answers (Bauer, Kopp, & Fischer, 2007; Vispoel, 1998). Research into the reason for changing answers finds that students do not always understand the question on the first pass. McMorris et al. (1987) find that 28% cited “rereading the item and better understanding the question” as the reason for changing answers.

The choice results using the MPL instrument are shown in online appendix table A.2. We report the distribution of subjects choosing each possible number of lottery (risky) options (from 0 to 6) before making their first switch to the safe choice for the control and treatment group separately. We also report the range of the implied constant relative risk aversion (CRRA) coefficients corresponding to each number of lottery options chosen, assuming no multiple switching.9 In the full sample, about 20% of subjects were multiple switchers, which falls in the range of many previous findings (Charness et al., 2013). We also show in online appendix table A.4 the distribution of MPL choices limited only to the nonmultiple switchers.

### E. LS Instrument

The format of the LS instrument follows Barr and Genicot (2008). The appeal of this design is its simplicity. Subjects are only allowed to choose one coin flip lottery out of six, with the first lottery offering a certain amount and all other alternatives offering higher expected payoff along with higher variance (see online appendix C for the instrument). A more risk-tolerant individual is more likely to choose lotteries with higher expected payoffs and higher variance. All choices are consistent with standard assumptions on preferences.

Online appendix table A.3 reports the simple LS game results for the control and treatment group separately. We report the low and high payoffs for each lottery, the implied CRRA coefficient range corresponding to each choice, and the percentage of subjects choosing each lottery in each group. Comparing the distribution of choices in the two groups allows us to conduct another balance test. In each group, the lottery chosen with the highest frequency is the third-safest lottery, and a Mann-Whitney test finds no significant distributional differences between the treatment and control groups in the lottery chosen ($p$-value $=$ 0.33).

Table 2.
Covariance between Responses on the MPL and LS Tasks
ControlTreatment$p$-values of
(1)(2)H0: (1) $=$ (2)
Method A: Number of risky choices
Covariance with LS task 0.300 1.514*** 0.014
Method B: First switch point
Covariance with LS task 0.713* 1.670*** 0.076
Method C: Average switch point
Covariance with LS task 0.332 1.543*** 0.013
Randomly generated MPL choices
Mean of covariance with LS task −0.0017 −0.0027 0.4957
Standard error of covariance with 0.3840 0.3879
$N$ 101 92
ControlTreatment$p$-values of
(1)(2)H0: (1) $=$ (2)
Method A: Number of risky choices
Covariance with LS task 0.300 1.514*** 0.014
Method B: First switch point
Covariance with LS task 0.713* 1.670*** 0.076
Method C: Average switch point
Covariance with LS task 0.332 1.543*** 0.013
Randomly generated MPL choices
Mean of covariance with LS task −0.0017 −0.0027 0.4957
Standard error of covariance with 0.3840 0.3879
$N$ 101 92

Method A defines MPL response as the total number of risky choices; method B defines MPL response as the number of risky choices made before the “first switch point.” Method C defines MPL response as the average switch point when the subject exhibits MSB. Randomly generated MPL choices use 10,000 bootstrap samples of MPL choices from the empirical distribution of the number of risky choices made in the MPL (coded using method B), separately for the control and treatment groups. Significant at *10%, **5%, and ***1%.

## IV. Empirical Analysis

The last row of table 1 reports the share of subjects who are multiple switchers in the control and treatment groups. The nudge treatment reduces the share of multiple switchers from 31% in the control group to 10% in the treatment group. The $p$-value of the difference in the multiple switching rate is less than 0.001. The fact that the nudge treatment, which does not limit choice sets or overtly discourage MSB, is able to eliminate 66% of MSB suggests that the majority of MSB are mistakes that are corrected upon further reflection rather than the result of deliberate choice.

As demonstrated in section II, the relative size of the covariances between the MPL and the LS task in the control and treatment groups will allow us to pin down the explanation for low quality decision making most consistent with the data. Table 2 reports the covariance between choices made in the MPL task (number of risky choices) and the LS task (riskiness of lottery chosen), which correspond to the risk tolerance ranking of the implied CRRA coefficients of the choices made in each instrument. Because of the presence of multiple switching, we used three different methods to code the number of risky choices in the MPL task. Method A uses the total number of risky choices, a method suggested by Holt and Laury (2002). Method B uses the number of risky choices made before the first point at which individuals switch from the risky choice to the safe choice, or the “first switch point.” This method was used in, for example, Harrison and Rutström (2008) and Meier and Sprenger (2013). Method C uses the average of the decision number preceding the first switch from a risky choice to a safe choice and the decision number preceding the last switch from a risky choice to a safe choice, or the last decision number if the last decision was a risky choice. This method was inspired by the argument in the literature that multiple switching is due to indifference (Andersen et al., 2006). The last column in table 2 reports the $p$-value of the difference between the covariance in the control and treatment groups. Because we were unable to find an estimator in the literature for the difference between two population covariances, to do inference, we derived the variance of the difference between two sample covariances in online appendix B.2 and propose an estimator that is consistent and asymptotically normal. This method is also used to find the significance levels of the estimated covariances.

To simulate the magnitude of the covariance between the two tasks if subjects did not understand the MPL at all and made their choices randomly, we randomly generated 10,000 bootstrap samples of MPL choices from the empirical distribution of the number of risky choices made in the MPL, separately for the control and treatment groups.10 We find the covariance of MPL responses in each bootstrap sample with the actual responses in the LS task and report the mean of sample covariance and the standard error of sample covariance in table 2. The last column reports the average $p$-value of the difference between the covariance in the control and in the treatment groups over the 10,000 observations.

In the treatment group, methods A, B, and C produce covariances between the MPL and LS tasks of 1.51, 1.67, and 1.54, respectively, all significantly different from 0 at the 1% level. However, in the control group, the covariance between the responses on these two tasks is small, and only method B results in a marginally significant covariance of 0.71. The $p$-values of the differences in the covariances are 0.01, 0.08, and 0.01, respectively. The randomly generated MPL choices produce a mean covariance close to 0 for both the control and treatment groups, which are not statistically different from each other. For the control group, the 95% confidence interval constructed from the bootstrap samples, using normal approximation, is $[-0.754,0.751]$, which contains 0.300, 0.713, and 0.332, indicating that the covariance of control group responses (using any of the three coding methods) is not statistically significantly different from the mean covariance obtained from randomly generated MPL choices.11

Online appendix table A.5 shows the Pearson correlation coefficients between responses in the MPL and LS tasks. Although the conceptual framework does not produce clear predictions on the relative sizes of the variances of the MPL and LS responses in the treatment and control groups, empirically we find that the variances of the MPL responses and the variances of the LS responses are not statistically different from each other in the treatment versus the control groups (see online appendix table A.6). Therefore, the correlation results should produce a similar pattern as the covariance results. In the treatment group, methods A, B, and C produce correlation coefficients between the MPL and LS tasks of 0.424, 0.449, and 0.432, respectively, all significantly different from 0 at the 1% level. However, in the control group, the correlation coefficients between these two tasks are small, and only method B results in a marginally significant correlation coefficient of 0.185. The $p$-values of the differences in the correlation coefficients are statistically significant at the 5% level for all three coding methods. Results from the 10,000 randomly generated bootstrap samples also show a similar pattern as the covariance results.

This set of results consistently supports scenario II—task-specific miscomprehension of the MPL, which is the only explanation for low decision-making quality from the literature that predicts greater covariance in the treatment group. Furthermore, the results imply that the nudge was effective in increasing comprehension of the MPL task (increasing the proportion of high-quality responses).

A potential concern with interpreting the lower rate of MSB in the treatment group is that subjects may somehow infer that it is undesirable to engage in MSB without understanding why and without actually applying more cognitive effort in making their choices in the MPL. Since any such effect would add noise to the data, our finding that the covariance is larger in the treatment group indicates that the intended effect of the nudge is detectable despite any potentially problematic experimenter demand effects.12

Results using method C also speak to the potential for indifference to account for MSB. If we assume MSB is a result of indifference between the lottery and a range of certain payout values defined by the first switch point and last switch point from the risky to the safe choice, then of the three methods, method C, which uses the midpoint of these certain payouts to value the lottery, should give the best approximation to true risk aversion. However, as table 2 and online appendix table A.5 show, the covariance and the correlation coefficient between the two tasks are no larger using method C than the other two methods for either the control or treatment group. Indifference in the true sense also should not respond to the nudge treatment, whereas we find a substantial reduction in MSB. Because coding method B (the first switch point) produces the most conservative treatment effect, subsequent MPL results will be reported using this method, unless otherwise stated.

## V. Robustness Checks

### A. Spillover Effects

We have assumed that because the nudge is only applied to the MPL task, it has no impact on the LS task. However, it is possible that the LS task also receives spillover nudging in the treatment group (when the MPL task is administered first). To the extent that such spillover effects lead to greater covariance between the two tasks, the nudge effect may be overestimated.

To address this concern, we analyzed the data using the subsample of individuals who were administered the LS task first, thus precluding any spillover effects. We find results similar to those in the full sample. Online appendix table A.7 shows that in the treatment group, coding methods A, B, and C all produce covariances between the MPL and LS task responses that are significantly different from 0 at the 5% level. However, in the control group, the covariances between the responses on these two tasks are small and none is statistically significant. Furthermore, we find that the covariance of control group responses (using any of the three coding methods) is not statistically significantly different from the mean covariance obtained from randomly generated MPL choices (95% confidence interval is $[-0.972,0.971]$).

Moreover, if there is a spillover effect, the size of the treatment effect is expected to be larger among those administered the MPL task first compared to those administered the LS task first. We test for differences in the treatment effect in the two groups using a permutation test, which has the advantage of not relying on parametric assumptions. To perform this test, we randomly assign task order to the control group and treatment group in the same frequency as in our actual data set, 10,000 times. For each of the placebo data sets with randomly assigned task order, we obtain the difference in the treatment effects by subtracting the treatment effect in the MPL first group from the treatment effect in the LS first group. In online appendix figure A.2, we plot the distribution of these differences in treatment effects using the placebo data sets, and mark the location of the difference in the two treatment effects from the actual data. As expected, the distribution of the placebo difference in treatment effects is centered around 0. The share of the placebo difference in treatment effects that is larger than the actual difference in treatment effects (−0.17) in absolute value, which is analogous to a $p$-value from a two-sided test, is 0.88. The results indicate that the treatment effect among those who were potentially exposed to spillover nudging is no different from the treatment effect among those who were not.

### B. Heuristics

One alternative explanation for our main results is that individuals are applying a heuristic when responding to the LS task, and the nudge induces individuals to apply the same heuristic in the MPL task as they did in the LS task. Several pieces of evidence mitigate this concern.

First, we consider two potential heuristics mentioned in the literature (Holt & Laury, 2014). One potential heuristic is to always choose the option with higher expected value. This would result in the choice of option A four (or five) times before switching to option B in the fifth (or sixth) choice (that is, choosing four or five risky choices). We do not find that this is a popular pattern of choices: fewer than 10% of the subjects in both the control and treatment groups made each of these choices (see online appendix table A.2, and the difference in the proportions choosing four or five risky choices is not statistically significant ($p$-value $=$ 0.63). Another potential heuristic is to choose the option that maximizes the minimum payoff. This would result in always choosing option B, the safe choice. Again, we do not find a significant difference in the proportions choosing 0 risky choices in the MPL by treatment status ($p$-value $=$ 0.78).

Table 3.
Gender Differences in MPL Choices
Dependent VariableNumber of Risky ChoicesFirst Switch PointAverage Switch Point
Panel A: MPL standard: Control group
Female −0.628 −0.389 −0.577
(0.384) (0.449) (0.379)
Panel B: MPL nudged: Treatment group
Female −0.812* −0.740 −0.761*
(0.446) (0.473) (0.447)
Dependent VariableNumber of Risky ChoicesFirst Switch PointAverage Switch Point
Panel A: MPL standard: Control group
Female −0.628 −0.389 −0.577
(0.384) (0.449) (0.379)
Panel B: MPL nudged: Treatment group
Female −0.812* −0.740 −0.761*
(0.446) (0.473) (0.447)

$N=101$ for panel A; $N=92$ for panel B. Linear regressions. Robust standard errors are in parentheses. Significant at *10%, **5%, and ***1%.

Second, the potential for individuals to heuristically conform their MPL responses to that on the LS task exists only when the LS task was administered first. Therefore, if the treatment effect is driven by heuristics, we would expect a larger treatment effect when the LS task is (randomly) administered first. As shown in online appendix figure A.2, a permutation test finds no difference in the treatment effect by task order.

### C. Gender Differences in Responses

A well-established body of literature finds that women are more risk averse than men. A meta-analysis of 150 studies spanning thirty years in the psychology literature found women to be more risk averse than men in fourteen of sixteen task categories (Byrnes, Miller, & Schafer, 1999). In their survey of the literature, Croson and Gneezy (2009) conclude that there are robust findings of women being more risk averse than men and such differences “are found in most domains of risk taking.” Charness and Gneezy (2012) assemble data from fiffteen papers that use the Gneezy and Potters (1999) risk elicitation method and find consistent evidence of women being more risk averse than men across a variety of subject pools. Female adolescents have also been found to be more risk averse than their male counterparts using a variety of risk instruments (Sutter et al., 2019).

Table 3 displays the results of regressions of the responses on the MPL task on gender. When the standard protocol is used, the gender difference is insignificant for all three coding methods. In contrast, when the nudged MPL protocol is used, the magnitude of the gender differences is larger for all three coding methods, and these differences are statistically significant at the 10% level in two of three coding methods. Responses on the LS task, which we expected to be well understood and which we show in section VIA to be less prone to miscomprehension than the MPL task, show that men are significantly more risk tolerant than women ($p$-value $=$ 0.05). These findings are consistent with Charness et al. (2018), who find no gender differences in responses on the standard MPL but men are significantly more risk tolerant on a simpler task in which subjects are asked to make only one of the choices in the MPL.13 More generally, the findings highlight the potential for task miscomprehension to lead to erroneous conclusions about economic preferences. This also relates to Gillen, Snowberg, and Yariv (2019), who find that poorly measured risk preferences can lead to the“overidentification” of new phenomena that are in fact attributable to risk preferences.

## VI. General Applications of the Conceptual Framework and Experimental Design

### A. Relative Complexity of Tasks

Scenario III can also be interpreted as assuming that the LS task is as complex as the MPL task, where complexity is defined as the proportion of individuals who initially miscomprehend a task under control conditions. We generalize this in scenario III' (see figure 1, panel c) to include all cases where initially the nudged task (task A) is as complex or less complex than the nonnudged task (task B) in online appendix B.1.4. Referring to panel c of figure 1, the baseline story in scenario III' assumes that task A is less complex than task B and that understanding of task A is improved after the nudge but some individuals remain confused. But scenario III' also allows for the case where initially task A was equally complex as task B ($p2=0$), where the nudge does not improve comprehension of task A ($p3=0$), and where no individuals remain confused by task A after the nudge ($p4=0$). The prediction of scenario III' is that $Cov(As,Bs)≥Cov(An,Bn)$, where $Cov(As,Bs)$ and $Cov(An,Bn)$ are the covariances between the responses on tasks A and B in the control group and the nudge group, respectively. Therefore, the fact that we find $Cov(MPLs,LSs) implies that the LS task is less complex than the MPL task, which is consistent with the finding in the literature that the LS task produces less noisy estimates than the MPL task (Dave et al., 2010) and with Charness et al.'s (2013) classification of the LS task as “simple” and the MPL task as “complex.” To the extent that the particular MPL task we used is simpler than others (e.g., the Holt & Laury, 2002, instrument or the Tanaka et al., 2010, instrument is arguably more complex), we can conclude that those MPL tasks are also more complex than the LS task.

### B. Effect Size

In online appendix B.1.2, we generalize scenario II to any two tasks A and B, where task A is the relatively poorly understood task and task A is the task that is nudged. Referring to panel c in figure 1, in scenario II' we allow for a subset of individuals to be confused when using task B and to remain confused using task A even when nudged. The baseline story in scenario II' assumes that task A is initially more susceptible to miscomprehension than task B, and the nudge improves understanding of task A, but task B remains better understood even after the nudge. Type 1 individuals do not miscomprehend either task. Types 2 and 3 individuals miscomprehend task A but not task B. Type 2 individuals become unconfused by the nudge, whereas type 3 individuals remain confused even when nudged. Type 4 individuals miscomprehend both tasks A and B. Types 1, 2, 3, and 4 individuals occur with probabilities $p1$, $p2$, $p3$, and $p4$, respectively, where $p4=1-p1-p2-p3$. Scenario II' allows for the case where the nudge improves understanding of task A to the same degree as task B ($p3=0$) and for the nudge to have no effect ($p2=0$), and for there to be complete miscomprehension of task A initially ($p1=0$). It does not allow for the nudge to work so well such that task B becomes the relatively poorly understood task after task A is nudged.14 Under this scenario, the result remains that $Cov(As,Bs) as long as the nudge is effective, that is, $p2>0$.

Under scenario II', assuming that task B, the relatively better understood task, remains at least as well understood after nudging, the treatment effect size equals
$p2σs2+p2p4(μX4s-μX1s)(μY4s-μY1s),$
(3)
where $μX4s-μX1s$ (and $μY4s-μY1s$) represents the mean difference in responses on task A (and task B) between those who are not confused by task A and those who are confused by task B, without nudging. More generally, equation (3) will be the difference in covariance between task B and any task A versus the covariance between task B and task A', where task A' contains identical choices as task A, but includes a device that may improve understanding, provided Task B remains better understood than both tasks A and A'. For instance, if task A is a bare-bones MPL with the choices listed in table format with written instructions, a device that improves understanding can be framing the probabilities associated with each choice in day-to-day terms such as the weather (e.g., Charness & Viceisza, 2016), or instructing subjects to make at most one switch (e.g., Tanaka et al., 2010). If the device was effective, all else equal, the covariance between tasks A' and B will be higher than that between tasks A and B. Furthermore, the more effective the device is, the higher will be the covariance between tasks A' and B, which allows for a ranking of the effectiveness of various devices (or combinations of devices).15

### C. Data Quality Metric

As long as task B is the simpler task, referring back to equation (1), if we can further assume there is no mean disparity in responses on task B given by those confused and those not confused by task B, the covariance between any relatively more complex task A and task B is equal to $p1σs2$, where $p1$ is the proportion that give high-quality responses on task A (refer to figure 1, panel c, Scenario II'). For ease of exposition, in this subsection we use notation corresponding to the control group (using the standard protocol). All results extend in a straightforward manner to the treatment group (using the nudge protocol). The covariance between task A and task B will also reduce to $p1σs2$ if task B is fully understood, or “simple” ($p4=0$). Although there is no test for whether a task is fully understood, this assumption can be falsified if another task is shown to be simpler using the procedure described in section VIA.

Under the assumption that task B is simple, as made in scenario II, the maximum possible covariance is $σs2$. Therefore, the ratio of the actual covariance to the maximum possible covariance gives us $p1$, the percent of maximum decision-making quality achieved. To approximate $σs2$ we can use $Var(Bj)$, which is equal to $σs2+σνj2$. This gives an upper-bound estimate on $σs2$, or maximum decision-making quality. Therefore, $Cov(Aj,Bj)/Var(Bj)$, or the slope coefficient in a regression of task A responses on task B responses, gives a lower-bound estimate of percent of maximum decision-making quality achieved in task A.

Table 4.
Regression Coefficient from Regressing MPL on LS
ControlTreatment$p$-value of
(1)(2)H0:(1) $=$ (2)
Method A: Number of risky choices
Coefficient on LS 0.103 0.557*** 0.008
Method B: First switch point
Coefficient on LS 0.244* 0.615*** 0.046
Method C: Average switch point
Coefficient on LS 0.113 0.568*** 0.007
$N$ 101 92
ControlTreatment$p$-value of
(1)(2)H0:(1) $=$ (2)
Method A: Number of risky choices
Coefficient on LS 0.103 0.557*** 0.008
Method B: First switch point
Coefficient on LS 0.244* 0.615*** 0.046
Method C: Average switch point
Coefficient on LS 0.113 0.568*** 0.007
$N$ 101 92

Method A defines MPL response as the total number of risky choices; method B defines MPL response as the number of risky choices made before the “first switch point.” Method C defines MPL response as the average switch point when the subject exhibits MSB.

To the extent that task B captures factors other than $S$, true risk preferences in the same domain as that captured in task A, the data quality metric will be an underestimate of percent of maximum decision-making quality achieved. These factors can include noise or risk preferences in a different domain. For example, the willingness-to-risk (WTR) survey question in Dohmen et al. (2011) is shown to be correlated with risk-taking behavior across a range of domains. Therefore, if the WTR measure is used as the comparison task, or task B, MPL data quality estimated using the regression coefficient on WTR (from a regression of MPL responses on WTR responses) is expected to be more underestimated than if the regression coefficient on LS is used (obtained from a regression of MPL responses on LS responses), all else equal and assuming that both the LS task and the WTR question are simple.

If task B is not fully understood, as in scenario II', but there is no mean disparity in responses on task B given by those confused and those not confused by task B, $Var(LSj)$=$(p1+p2+p3)(σs2+σνj12)+p4σνj4$. Therefore, the regression coefficient will be a lower-bound estimate of $p1/(p1+p2+p3)$, the proportion of high-quality decisions made in task A relative to the proportion in task B.

To the extent we can assume the LS task is simple, we can interpret the coefficient on LS responses in a regression of MPL responses on LS responses as a data quality metric. According to table 4, a lower-bound estimate of 56% to 62% of maximum decision-making quality is achieved using the nudge protocol, whereas when the standard protocol is used, the figures range from 10% to 24%. This implies that the nudge treatment increased data quality or, more precisely, the proportion of high-quality responses, by 152% to 400%.

Furthermore, if task B is simple or if there is no mean disparity across those confused and those not confused by task B, using the same task B, and holding the population studied constant, the data quality metric can be used to produce a ranking of data quality across studies using various MPL instruments and to produce a ranking of complexity across these instruments.

### D. Benchmarking Data Quality

To benchmark the data quality in our study against findings from previous studies, we searched for all studies that used both the MPL and the LS tasks. To our knowledge, only four other studies used both tasks, and only two report the regression coefficient or correlation coefficient between the two tasks.16 Deck et al. (2013) report a correlation coefficient of 0.27 using only subjects that did not engage in MSB, and based on our calculations, their regression coefficient is 0.26.17 This lower-bound estimate of 26% maximum decision-making quality achieved in the MPL is comparable to that in our control group. Reynaud and Couture (2012) report regression coefficients of 0.57 and 0.78 depending on the size of the (nonincentivized) stakes. These figures represent somewhat higher data quality than in our treatment group. Unfortunately, the study did not report details of the protocol or the MSB rate.

We further use the regression coefficient to compare data quality across different implementations of the MPL using the WTR survey question as the comparison task. The regression coefficient in Dohmen et al. (2011) is 0.61. In Crosetto and Filippin (2016), the correlation coefficient between the MPL task and the WTR measure is 0.23. Without knowing the relative sizes of the standard deviations of the MPL task and the WTR measure, we cannot know whether a regression coefficient would be higher or lower than 0.23. However, a regression coefficient larger than 0.61 would imply that the variance of the MPL task is over seven times the size of the variance of the WTR measure. In Charness and Viceisza (2016), the correlation coefficient between the MPL task and the WTR measure is 0.06.

The regression coefficient in Dohmen et al. (2011) is comparable to that in our treatment group. In their study, subjects were asked, once they preferred the safe option to the risky option (i.e., switched from the risky option to the safe option), whether they would like also prefer higher-valued safe options to the risky option (all subjects indicated they would). This device may have increased comprehension relative to a standard implementation of the MPL.

### E. Is MSB a Good Proxy for Data Quality?

Although the literature generally views MSB as indicative of low decision-making quality, potential low decision-making quality among nonmultiple switchers is not well understood. Here we can separately identify decision-making quality and MSB. Interpreting the regression coefficient as data quality, we explicitly test decision-making quality for multiple switchers and nonmultiple switchers in table 5. The results show that data quality is insignificantly different from 0 for both multiple switchers and nonmultiple switchers in the control group, and is only significantly different from 0 for the nonmultiple switchers in the treatment group. MSB is not a good indicator of data quality in the control group ($p$-value $=$ 0.567). Among nonmultiple switchers, data quality is significantly higher under the nudge treatment than under the standard protocol ($p$-values $=$ 0.025).

Table 5.
MSB and Data Quality: Regression Coefficient on LS
Multiple SwitchersNonmultiple Switchers$p$-values of
(1)(2)H0: (1) $=$ (2)
Control −0.019 0.137 0.567
$N$ 31 70
Treatment 0.000 0.602*** 0.350
$N$ 83
Multiple SwitchersNonmultiple Switchers$p$-values of
(1)(2)H0: (1) $=$ (2)
Control −0.019 0.137 0.567
$N$ 31 70
Treatment 0.000 0.602*** 0.350
$N$ 83

MPL is coded using method B, the number of risky choices made before the “first switch point.” Significant at *10%, **5%, and ***1%.

These results imply that the MSB rate is not necessarily a good indicator of data quality, and low MSB rates are not indicative of good comprehension and high data quality. Furthermore, the common practice of restricting the data to nonmultiple switchers (Charness et al., 2013) does not necessarily resolve data quality issues.

### F. Cognitive Ability, MSB, and Data Quality

The previous literature shows that MSB is related to cognitive ability, in particular math ability (Dave et al., 2010; Meier & Sprenger, 2013). In the following analysis, we first check whether subjects' multiple switching behaviors are correlated with their school test scores and then check if a relationship exists between test scores and data quality. The tests are uniform across the middle schools in the county, and the test results are provided to us by the school administrators.

Online appendix table A.8 reports the results from the linear regression of multiple switching on test scores for the control and treatment groups separately. Test scores are the average of standardized math and standardized verbal test scores, standardized within each grade. Column 1 shows that in line with our expectations, individuals' cognitive ability is strongly correlated with MSB in the standard MPL protocol. Column 2 adds a set of control variables: gender, monthly household income, mother's educational attainment, mother's occupation, the number of children in the household, and grade fixed effects. The results are essentially unchanged. Column 2 shows that an increase of 1 standard deviation in test scores is associated with a 14.7 percentage point decrease in the likelihood of multiple switching. This corresponds to a 48% ($=$ 0.147/0.307) reduction in the probability of multiple switching. Online appendix table A.9 shows that both math and verbal scores are significant predictors of MSB. These findings suggest that one reason that multiple switchers using the standard MPL protocol make worse financial decisions (Jacobson & Petrie, 2009) could be their lower cognitive ability, which leads to both MSB and poor financial decision making.

Columns 3 and 4 in online appendix table A.8 show that cognitive ability no longer predicts MSB when using the nudged MPL protocol. Online appendix table A.10 shows that this is also the case for both verbal and math test scores.

Because MSB is not a good proxy for data quality, to examine the relationship between decision-making quality and cognitive ability, table 6 reports the regression coefficient on the LS task by overall cognitive ability for the control and treatment groups. The results show that data quality is low (insignificantly different from 0) for both high- and low-cognitive-ability individuals using the standard MPL. For the treatment group, data quality is significantly different from 0 for both high- and low-cognitive-ability individuals. The point estimates indicate that low-cognitive-ability individuals exhibited higher-quality decision making with the nudge protocol than high-cognitive-ability individuals did using the standard protocol. Table 6 also shows that cognitive ability is not a significant determinant of data quality for either the control or treatment group.18

Table 6.
Cognitive Ability and Data Quality: Regression Coefficient on LS
LowHigh$p$-values of
(1)(2)H0:(1) $=$ (2)
Control 0.149 0.229 0.768
$N$ 49 50
Treatment 0.523*** 0.704*** 0.486
$N$ 46 46
LowHigh$p$-values of
(1)(2)H0:(1) $=$ (2)
Control 0.149 0.229 0.768
$N$ 49 50
Treatment 0.523*** 0.704*** 0.486
$N$ 46 46

MPL is coded using method B, the number of risky choices made before the “first switch point.” High cognitive ability is defined as having above-median test scores (average of standardized verbal and math scores) in the control and treatment groups separately. Significant at *10%, **5%, and ***1%.

To the extent that the nudge treatment was able to elicit more cognitive effort from respondents, the findings suggest that cognitive ability is not a limiting factor in achieving high data quality in the MPL, but cognitive effort is. Increased cognitive effort has also been proposed as an explanation for the reduction in MPL errors in choice situations where subjects can incur a loss (Von Gaudecker, Van Soest, & Wengström, 2011) and when stakes are high (Holt & Laury, 2002). Cognitive effort in the form of longer deliberation time has also been shown to increase patience in intertemporal decisions (Imas et al., 2018). This relates broadly to the literature on bounded rationality, which finds that cognitive effort plays an important role in decision making independent of intelligence and that humans are prone to cognitive laziness (Kahneman, 2003, 2011).

### G. Generalizability

The results in section 6.6 have implications for generalizing our nudge treatment effect to populations with higher cognitive ability and generally lower MSB, such as university students in developed countries, the typical subject pool in experimental economics studies. To test whether our main results would hold for a subject pool with attributes more similar to the typical subject pool, we restrict the analysis to individuals with above-median test scores. We find that among the control group in this subsample, the MSB rate is 19%, which approaches the average inconsistency rate of 17.1% in a survey of 54 published MPL studies (Crosetto & Filippin, 2016) and (Filippin & Crosetto, 2016), and is lower than the MSB rate in some studies conducted with university students in developed countries.19 Among the subsample with above-median test scores, we find similar results as in our full sample. The covariance between the two tasks is significantly different from 0 for all three coding methods in the treatment group, but not among the control group. See online appendix table A.11.

The findings of online appendix table A.11 are consistent with our finding that MSB rates are correlated with cognitive ability, but data quality is not. Therefore, despite the generally lower MSB rates in studies conducted with university students, it cannot be taken for granted that these subjects comprehend the MPL task. For example, in Deck et al. (2013), the subjects were individuals affiliated with the University of Arkansas, over 80% of whom were undergraduates and the remaining were faculty, staff, and graduate students. The study found an MSB rate of 15.8%, and the data quality metric was 0.26, which is comparable to that in our control group and lower than that in our treatment group.20 This suggests that even among university students in developed countries, there can be room for improvement in data quality and comprehension of the MPL.

## VII. Conclusion

In this study we developed a conceptual framework defining decision-making quality in the MPL to test several prominent explanations of low decision-making quality suggested by the findings in the literature using a novel experimental design and treatment protocol. In a departure from previous literature, our study provides a direct test of and finds evidence in support of task-specific miscomprehension as the explanation for low-quality decision making in the MPL. We conclude that the MPL task is more complex—a larger proportion of individuals miscomprehend the task—than the LS task. Our findings are robust to considerations of spillover effects and the use of heuristics.

Using our framework and experimental design, we propose several general applications for testing task complexity and comprehension. In particular, using the data quality metric we propose, we make several observations from our data: MSB is not a good proxy for data quality in the MPL, as is often believed, and removing multiple switchers from the data does not necessarily improve data quality. Cognitive ability explains MSB in the standard protocol, but it is not a significant determinant of data quality. This suggests that cognitive effort is the limiting factor in achieving high data quality, and that even among populations of high cognitive ability exhibiting low rates of MSB, comprehension of the MPL task can be a problem. We found one sample of university students with relatively low MSB rates exhibiting comparable data quality to our control group.

The nudge protocol was shown to be effective in increasing comprehension of the MPL and data quality, which implies that protocol design innovations in the MPL can reveal risk preferences that would otherwise be obscured due to poor comprehension. While our study shows the nudge to be effective in improving comprehension and data quality, more research is needed to understand the impact of various other devices and strategies that are used to improve comprehension in the MPL. The importance of this research is underscored by the fact that low data quality can lead to erroneous conclusions about economic preferences and can lead to overidentification of new phenomena that are in fact attributable to risk preferences.

## Notes

1

A nudge, as defined by Thaler and Sunstein (2008), is “any aspect of the choice architecture that alters people's behavior in a predictable way without forbidding any options or significantly changing their economic incentives.”

2

In a similar vein, Imas, Kuhn, and Mironova (2018) implement a waiting period to study the impact of deliberation on intertemporal choices.

3

Low-quality decisions in Choi et al. (2013) are defined as choices inconsistent with the generalized axiom of revealed preferences in their experiment, which asks subjects to choose bundles of goods under varying budget slopes.

4

The lottery selection task was independently developed in Binswanger (1980) and Eckel and Grossman (2002). The LS task requires subjects to select one out of six different coin-flip lotteries. Selecting a lottery with higher expected value and/or higher variance is indicative of higher risk tolerance. In this task, all choices are compatible with standard assumptions on preferences. The LS task was categorized as “simple” by Charness and Viceisza (2016) on the basis that it requires subjects to make only one choice rather than multiple choices, as in the MPL. See also Charness et al. (2013), which surveys several popular risk elicitation methods and categorizes each as simple or complex. Besides the LS task, the balloon analogue risk task (Lejuez et al., 2002) and the Gneezy and Potters (1999) method are also categorized as simple tasks.

5

We omitted the honors classes at the request of school administrators.

6

All students are in school through the evening self-study period (wanzixi) which ends at 10:00 p.m. This middle school is a boarding school. Over 85% of students live on campus and are allowed to return home only on weekends. The rest live within a 10 minute walk to school.

7

Because treatment status was assigned prior to the date of the experiment, the final number of subjects in each group was not identical due to no-shows.

8

While Dohmen et al. (2011) asks subjects after making the first switch from the risky choice to the safe choice whether they would also like higher amounts of the safe choice, we do not overtly discourage subjects from multiple switching in either the control or treatment groups.

9

Our results are robust to assuming CARA utility.

10

The number of risky choices is coded using method B.

11

See online appendix figure A.1 for the empirical distribution of the bootstrap sample covariances.

12

The suggestion to exert more cognitive effort is the treatment itself, and thus does not constitute a problematic experimenter demand effect (Davis & Holt, 1993, p. 26). In our context, problematic experimenter demand effects are those that lead to behavior resulting not from more cognitive effort but from subject perception of what is desired, such as subjects changing their responses in a random manner after the nudge without putting in more cognitive effort.

13

Charness et al. (2018) argue that because the simpler task is “easier to understand,” the responses produced are “more meaningful.”

14

If the nudge works “too” well, it is possible for $Cov(As,Bs)>Cov(An,Bn)$ even if task A was the more complex task before nudging. Therefore, such a finding is inconclusive with regard to relative task complexity, whereas a finding of $Cov(As,Bs) implies that task A is the more complex task.

15

Several studies have compared the impact of various devices on MSB. Bauermeister and Musshoff (2016) find that a visual representation of the Holt and Laury (2002) MPL reduces MSB. Bruner (2011) finds that compared with providing only written instructions, verbally emphasizing that only one row will be paid reduced MSB. However, MSB is not necessarily indicative of comprehension or data quality, as we show in the next section. Moreover, this precludes the study of the effectiveness of devices that forbid or discourage MSB.

16

These studies are Dave et al. (2010), Reynaud and Couture (2012), Deck et al. (2013), and Crosetto and Filippin (2016).

17

Using tables 1 and 3 in a working paper version of the study, we calculated the variances of the MPL and the LS task to be 2.45 and 2.7, respectively.

18

We found that in a regression of MPL responses on LS responses interacted with test scores, the interaction term was insignificant for both the control and treatment groups (control group $p$-value $=$ 0.789; treatment group $p$-value $=$ 0.836), confirming the finding that cognitive ability is not a significant determinant of data quality.

19

Charness et al. (2018) report an MSB rate of 22% using UC San Diego undergraduates; Bruner, McKee, and Santore (2008) report an MSB rate of 25% using University of Tennessee undergraduates (based on our calculations using table 1 in the study); and Bruner (2011) reports an MSB rate of 26% using university students in the United States. Bauermeister and Musshoff (2016) report an MSB rate of 21% using university students in Germany.

20

The MSB rate is based on our calculations using table 1 in the study.

## REFERENCES

REFERENCES
Agranov
,
Marina
, and
Pietro
Ortoleva
, “
Stochastic Choice and Preferences for Randomization
,”
Journal of Political Economy
125
:
1
(
2017
),
40
68
.
Alan
,
Sule
, and
Seda
Ertac
, “
Fostering Patience in the Classroom: Results from a Randomized Educational Intervention,
Journal of Political Economy
126
:
5
(
2018
).
Andersen
,
Steffen
,
Glenn W.
Harrison
,
Morten I.
Lau
, and
E. E.
Rutström
, “
Eliciting Risk and Time Preferences,
Econometrica
76
(
2008
),
583
618
.
Andersen
,
Steffen
,
Glenn W.
Harrison
,
Morten I.
Lau
, and
E. E.
Rutström
Elicitation Using Multiple Price List Formats,
Experimental Economics
9
(
2006
),
383
405
.
Andreoni
,
James
, and
Charles
Sprenger
, “
Uncertainty Equivalents: Testing the Limits of the Independence Axiom,
NBER working paper w17342
(
2011
).
Barr
,
Abigail
, and
Garance
Genicot
, “
Risk Sharing, Commitment, and Information: An Experimental Analysis,
Journal of the European Economic Association
6
(
2008
),
1151
1185
.
Bauer
,
Daniel
,
Veronika
Kopp
, and
Martin R.
Fischer
, “
Answer Changing in Multiple Choice Assessment Change That Answer When in Doubt—and Spread the Word!
BMC Medical Education
7
(
2007
),
28
.
Bauermeister
,
Golo
, and
Oliver
Musshoff
, “
Risk Aversion and Inconsistencies: Does the Choice of Risk Elicitation Method and Display Format Influence the Outcomes?
Paper presented at the 2016 Annual Meeting of the Agricultural and Applied Economics Association
,
2016
.
Benjamin
Jr.,
Ludy T.
,
Timothy A.
Cavell
, and
William R.
Shallenberger III
, “
Staying with Initial Answers on Objective Tests: Is It a Myth?
Teaching of Psychology
11
(
1984
),
133
141
.
Binswanger
,
Hans P.
, “
Attitudes toward Risk: Experimental Measurement in Rural India,
American Journal of Agricultural Economics
62
(
1980
),
395
407
.
Brick
,
Kerri
,
Martine
Visser
, and
Justine
Burns
, “
Risk Aversion: Experimental Evidence from South African Fishing Communities,
American Journal of Agricultural Economics
94
(
2012
),
133
152
.
Bruner
,
David
, “
Multiple Switching Behavior in Multiple Price Lists,
Applied Economics Letters
18
(
2011
),
417
420
.
Bruner
,
David
,
Michael
McKee
, and
Rudy
Santore
, “
Hand in the Cookie Jar: An Experimental Investigation of Equity-Based Compensation and Managerial Fraud,
Southern Economic Journal
75
(
2008
),
261
278
.
Byrnes
,
James P.
,
David C.
Miller
, and
William D.
Schafer
, “
Gender Differences in Risk Taking: A Meta-Analysis,
Psychological Bulletin
125
(
1999
),
367
.
Campbell
,
Donald T.
, and
Julian C.
Stanley
,
Handbook of Research on Teaching
(
Boston
:
Houghton Mifflin
,
1963
).
Cassar
,
Alessandra
,
Feven
Wordofa
, and
Y. Jane
Zhang
, “
Competing for the Benefit of Offspring Eliminates the Gender Gap in Competitiveness,
Proceedings of the National Academy of Sciences
113
(
2016
),
5201
5205
.
Castillo
,
Marco
,
Paul J.
Ferraro
,
Jeffrey L.
Jordan
, and
R.
Petrie
, “
The Today and Tomorrow of Kids: Time Preferences and Educational Outcomes of Children
,”
Journal of Public Economics
95
:
11–12
(
2011
),
1377
1385
.
Castillo
,
Marco
,
Jeffrey L.
Jordan
, and
Ragan
Petrie
, “
Discount Rates of Children and High School Graduation
,”
Economic Journal
129
(
2019
),
1153
1181
.
Charness
,
Gary
,
Catherine
Eckel
,
Uri
Gneezy
, and
Agne
Kajackaite
, “
Complexity in Risk Elicitation May Affect the Conclusions: A Demonstration Using Gender Differences,
Journal of Risk and Uncertainty
56
(
2018
),
1
17
.
Charness
,
Gary
, and
Uri
Gneezy
, “
Strong Evidence for Gender Differences in Risk Taking,
Journal of Economic Behavior and Organization
83
(
2012
),
50
58
.
Charness
,
Gary
,
Uri
Gneezy
, and
Alex
Imas
, “
Experimental Methods: Eliciting Risk Preferences,
Journal of Economic Behavior and Organization
87
(
2013
),
43
51
.
Charness
,
Gary
, and
Angelino
Viceisza
, “
Three Risk-Elicitation Methods in the Field: Evidence from Rural Senegal,
Review of Behavioral Economics
3
(
2016
),
145
171
.
Chew
,
Soo Hong
,
Bin
Miao
,
Qiang
Shen
, and
Songfa
Zhong
, “
Multiple Switching Behavior in Choice Lists,
presentation at the China Greater Bay Area Experimental Economics Workshop, University of Hong King, Hong Kong
,
2018
.
Choi
,
Syngjoo
,
Shachar
Kariv
,
Wieland
Müller
, and
Dan
Silverman
, “
Who Is (More) Rational?
American Economic Review
104
(
2013
),
1518
1550
.
Crosetto
,
Paolo
, and
Antonio
Filippin
, “
A Theoretical and Experimental Appraisal of Four Risk Elicitation Methods,
Experimental Economics
19
(
2016
),
613
641
.
Croson
,
Rachel
, and
Uri
Gneezy
, “
Gender Differences in Preferences,
Journal of Economic Literature
47
(
2009
),
448
474
.
Dave
,
Chetan
,
Catherine
Eckel
,
Cathleen
Johnson
, and
Christian
Rojas
, “
Eliciting Risk Preferences: When Is Simple Better?
Journal of Risk and Uncertainty
41
(
2010
),
219
243
.
Davis
,
Douglas D.
, and
Charles A.
Holt
,
Experimental Economics
(
Princeton, NJ
:
Princeton University Press
,
1993
).
Deck
,
Cary
,
Jungmin
Lee
,
Javier A.
Reyes
, and
Christopher C.
Rosen
, “
A Failed Attempt to Explain within Subject Variation in Risk Taking Behavior Using Domain Specific Risk Attitudes,
Journal of Economic Behavior and Organization
87
(
2013
),
1
24
.
Dohmen
,
Thomas
,
Armin
Falk
,
David
Huffman
,
Uwe
Sunde
,
Jürgen
Schupp
, and
Gert G
Wagner
, “
Individual Risk Attitudes: Measurement, Determinants, and Behavioral Consequences
,”
Journal of the European Economic Association
9
:
3
(
2011
),
522
550
.
Eckel
,
Catherine C.
, and
Philip J.
Grossman
, “
Sex Differences and Statistical Stereotyping in Attitudes toward Financial Risk,
Evolution and Human Behavior
23
(
2002
),
281
295
.
Eckel
,
Catherine C.
,
Philip J.
Grossman
,
Cathleen A.
Johnson
,
Angela C. M.
de Oliveira
,
Christian
Rojas
, and
Rick K.
Wilson
, “
School Environment and Risk Preferences: Experimental Evidence,
Journal of Risk and Uncertainty
45
(
2012
),
265
292
.
Filippin
,
Antonio
, and
Paolo
Crosetto
, “
A Reconsideration of Gender Differences in Risk Attitudes,
Management Science
62
(
2016
),
3138
3160
.
Gillen
,
Ben
,
Erik
Snowberg
, and
L.
Yariv
, “
Experimenting with Measurement Error: Techniques with Applications to the Caltech Cohort Study,
Journal of Political Economy
127
(
2019
),
1826
1863
.
Gneezy
,
Uri
, and
Jan
Potters
, “
An Experiment on Risk Taking and Evaluation Periods,
Quarterly Journal of Economics
112
(
1999
),
631
645
.
Harrison
,
Glenn W.
,
Morten I.
Lau
, and
Melonie B.
Williams
, “
Estimating Individual Discount Rates in Denmark: A Field Experiment,
American Economic Review
92
(
2002
),
1606
1617
.
Harrison
,
Glenn
, and
Elisabet
Rutström
, “
Risk Aversion in the Laboratory,
” in
James
Cox
and
Glenn W.
Harrison
, eds.,
Risk Aversion in Experiments
(
Bingley, UK
:
Emerald Group
,
2008
),
41
196
.
Holt
,
Charles A.
, and
Susan K.
Laury
, “
Risk Aversion and Incentive Effects,
American Economic Review
92
(
2002
),
1644
1655
.
Holt
,
Charles A.
, and
Susan K.
Laury
Assessment and Estimation of Risk Preferences
” (pp.
135
201
), in
Mark
Machina
and
W.
Viscusi
, eds.,
Handbook of the Economics of Risk and Uncertainty
(
Amsterdam
:
North-Holland
2014
).
Imas,
Alex
,
Michael
Kuhn
, and
Vera
Mironova
, “
Waiting to Choose
,”
CESifo working paper
6162
(
2018
).
Jacobson
,
Sarah
, and
Ragan
Petrie
, “
Learning from Mistakes: What Do Inconsistent Choices over Risk Tell Us?
Journal of Risk and Uncertainty
38
(
2009
),
143
158
.
Kahneman
,
Daniel
, “
Maps of Bounded Rationality: Psychology for Behavioral Economics,
American Economic Review
93
(
2003
),
1449
1475
.
Kahneman
,
Daniel
Thinking, Fast and Slow
(
London
:
Macmillan
,
2011
).
Kahneman
,
Daniel
,
Jack L.
Knetsch
, and
R. H.
Thaler
, “
Experimental Tests of the Endowment Effect and the Coase Theorem,
Journal of Political Economy
98
(
1990
),
1325
1348
.
Lejuez
,
Carl W.
,
Jennifer P.
,
Christopher W.
Kahler
,
Jerry B.
Richards
,
Susan E.
Ramsey
,
Gregory L.
Stuart
,
David R.
Strong
, and
R. A.
Brown
, “
Evaluation of a Behavioral Measure of Risk Taking: The Balloon Analogue Risk Task (BART),
Journal of Experimental Psychology: Applied
8
(
2002
),
75
.
McMorris
,
Robert F.
,
Lawrence P.
DeMers
, and
Shirley P.
Schwarz
, “
Attitudes, Behaviors, and Reasons for Changing Responses Following Answer-Changing Instruction,
Journal of Educational Measurement
24
(
1987
),
131
143
.
Meier
,
Stephan
, and
Charles D.
Sprenger
, “
Discounting Financial Literacy: Time Preferences and Participation in Financial Education Programs,
Journal of Economic Behavior and Organization
95
(
2013
),
159
174
.
Niederle
,
Murie
, “
Gender,
” in
Handbook in Experimental Economics
, 2nd ed. (
Princeton, NJ
:
Princeton University Press
,
2016
).
Reynaud
,
Arnaud
, and
Stéphane
Couture
, “
Stability of Risk Preference Measures: Results from a Field Experiment on French Farmers,
Theory and Decision
73
(
2012
)
203
221
.
Sutter
,
Matthias
,
Martin G.
Kocher
,
Daniela
Glätzle-Rützler
, and
Stefan T.
Trautmann
, “
Impatience and Uncertainty: Experimental Decisions Predict Adolescents' Field Behavior,
American Economic Review
103
(
2013
),
510
531
.
Sutter
,
Matthias
,
Claudia
Zoller
, and
Daniela
Glätzle-Rützler
, “
Economic Behavior of Children and Adolescents: A First Survey of Experimental Economics Results,
European Economic Review
111
(
2019
),
98
121
.
Tanaka
,
Tomomi
,
Colin F.
Camerer
, and
Quang
Nguyen
, “
Risk and Time Preferences: Linking Experimental and Household Survey Data from Vietnam,
American Economic Review
100
(
2010
),
557
571
.
Thaler
,
Richard H.
, and
Cass R.
Sunstein
,
Nudge: Improving Decisions about Health, Wealth, and Happiness
(
New Haven, CT
:
Yale University Press
,
2008
).
Vispoel
,
Walter P.
, “
Reviewing and Changing Answers on Computer-Adaptive and Self-Adaptive Vocabulary Tests,
Journal of Educational Measurement
35
(
1998
),
328
345
.
Von Gaudecker
,
Hans-Martin
,
Arthur Van
Soest
, and
Erik
Wengström
, “
Heterogeneity in Risky Choice Behavior in a Broad Population,
American Economic Review
101
(
2011
),
664
694
.

## Author notes

We thank Catherine Eckel, Willa Friedman, Elaine Liu, Muriel Niederle, Angelino Viceisza, Rick Wilson, three anonymous referees, and many seminar and conference participants for helpful conversation and suggestions. We are indebted to Yang Haitao and Geirongduzhi for their field assistance. Glen Ng provided excellent research assistance.

A supplemental appendix is available online at https://doi.org/10.1162/rest_a_00895.