Critical Values Robust to P-hacking

P-hacking is prevalent in reality but absent from classical hypothesis testing theory. As a consequence, significant results are much more common than they are supposed to be when the null hypothesis is in fact true. In this paper, we build a model of hypothesis testing with p-hacking. From the model, we construct critical values such that, if the values are used to determine significance, and if scientists' p-hacking behavior adjusts to the new significance standards, significant results occur with the desired frequency. Such robust critical values allow for p-hacking so they are larger than classical critical values. To illustrate the amount of correction that p-hacking might require, we calibrate the model using evidence from the medical sciences. In the calibrated model the robust critical value for any test statistic is the classical critical value for the same test statistic with one fifth of the significance level.


I. Introduction
Definition of p-hacking.P-hacking occurs when scientists engage in various behaviors that increase their chances of reporting statistically significant results (Simonsohn, Nelson, & Simmons, 2014;Wasserstein & Lazar, 2016).Typical p-hacking practices include running many smallsample experiments rather than one large-sample experiment; reporting studies with significant results but suppressing studies with insignificant results; collecting data until a significant result is obtained; dropping inconvenient observations or outcomes from a study; and searching for statistical specifications that produce significant results (Nosek, Spies, & Motyl, 2012;Lindsay, 2015;Christensen, Freese, & Miguel, 2019;Stefan & Schoenbrodt, 2023).
Prevalence of p-hacking.P-hacking is prevalent in science (online appendix B.1).Scientists readily admit to it.It is visible in meta-analyses: the distributions of test statistics in entire literatures show that scientists tinker with their analyses to obtain significant results.And it appears when tracking cohorts of scientific studies: studies finding significant results are almost certain to be reported, whereas studies finding insignificant results are likely to remain unreported.
Reasons for p-hacking.That p-hacking is so prevalent is unsurprising because scientists face strong incentives to p-hack.First, significant results are more rewarded than insignificant ones (online appendix B.2).This is because scientific journals prefer publishing significant results.Publications, in turn, determine a scientist's career path, including promotions, salary, and honorific rewards.
Second, scientists enjoy a lot of flexibility in data collection and analysis (online appendix B.3).
Hence, even when the null hypothesis is true, they have ample opportunity to obtain significant results without violating scientific norms.
Problems caused by p-hacking.Despite its prevalence, p-hacking is not accounted for in classical hypothesis-testing theory.Therefore, classical critical values set a standard for significance that is too lax: a true null hypothesis is rejected more often than purported by the test's significance level. 1 critical values are in place, scientists continue to p-hack, but readers can be confident that true null hypotheses are not rejected more often than the advertised significance level.
Model of hypothesis testing with p-hacking.We consider a scientist who tests a hypothesis by conducting an experiment.If she obtains a significant result from the experimental data, she obtains a high payoff.By contrast, if she obtains an insignificant result, she obtains a lower payoff.The difference in payoff between significant and insignificant results reflects the facts that significant results are more likely to be published, and publications yield rewards to scientists.Therefore, if the scientist obtains an insignificant result, and if she still has resources to devote to the project, she has the incentive to conduct another experiment to try to obtain a significant result using the second experiment's data.Conducting a second experiment without disclosing the first experiment constitutes p-hacking.1 Optimal p-hacking strategy.Using optimal stopping theory, we find that the scientist's optimal strategy is to conduct experiments until finding a significant result (Ferguson, 2007).Not all projects yield significant results, however, because resources that a scientist can devote to any project are finite (Chen, 2021).If the scientist runs out of resources before finding a significant result, she reports an insignificant result.
Probability of type 1 error.We begin by computing the expected number of experiments run by a scientist when the null hypothesis is true, as a function of the prevailing critical value.From this we compute the probability of type 1 error as a function of the critical value.The critical value influences the probability of type 1 error in two ways.First, it determines the probability that a true null hypothesis is rejected in each experiment-as in classical statistics.Second, it influences the number of experiments that the scientist collects-a feature unique to our model.1Because the number of experiments is not observable, multiple-testing corrections cannot be used to correct for p-hacking.Critical value robust to p-hacking.From these results we compute the critical value such that type 1 errors occur at the intended rate-given by the significance level.This critical value is robust to p-hacking, and it is given by a nonstandard form of Bonferroni correction.For any test statistic and any significance level, the robust critical value is the classical critical value for the same test statistic with the significance level divided by the expected number of experiments when the robust critical value is in place.Accordingly, the robust critical value is larger than the classical critical value for the same test statistic and significance level.An advantage of the model is that the expected number of experiments when the robust critical value is in place, and the robust critical value itself, are solely determined by two parameters: significance level and probability of completing an experiment before running out of resources.
Numerical illustration.To illustrate the amount of correction that p-hacking might require, we calibrate the completion probability using evidence from medical science (Dwan et al., 2008).We obtain the rule of thumb that the robust critical value for any test statistic is the classical critical value for the same test statistic with one fifth of the significance level.Hence, the robust critical value for a significance level of 5% is the classical critical value for a significance level of 5%/5 = 1%.For a z-test with a significance level of 5%, and similarly for a large-sample t-test with a significance level of 5%, this means that the robust critical value is 2.33 instead of 1.64 if the test is one-sided, and 2.58 instead of 1.96 if the test is two-sided.
Extensions of the model.Our model of hypothesis testing is quite stylized, but it can be extended in various ways.In online appendix D, we add a cost of doing research, incurred by the scientist at each new experiment.In online appendix E, we add time discounting, which reduces the value of significant results obtained far into the future.And in online appendix F, we assume that consecutive experiments become more and more difficult to run, and thus less and less likely to be completed.In all these extensions, the robust critical value computed in the basic model continues to be operational: it maintains the probability of type 1 error below the significance level.Other p-hacking strategies.In the model, scientists p-hack by repeatedly running experiments until they reach significant results.This p-hacking strategy appears to be quite common (Bakker, van Dijk, & Wicherts, 2012).However, the model can be adapted to describe a wider range of p-hacking strategies.In online appendix C.2, we consider scientists who pool data across experiments.In online appendix C.3, we consider scientists who remove more and more outliers until they reach significant results.In online appendix C.4, we consider scientists who successively examine different regression specifications so as to obtain significant results.Finally, in online appendix C.5, we consider scientists who successively examine different instruments to reach significant results.We find that the robust critical value computed under the repeated-experiment strategy remains useful under these other p-hacking strategies because it maintains the probability of type 1 error below the significance level.
Control of type 1 error rate for generic p-hacking strategies.More generally, the robust critical value derived in the basic model controls the type 1 error rate for any p-hacking strategy that induces positive dependence across test statistics (online appendix C.1).While the basic model assumes independent test statistics-each obtained from a separate experiment-real-world p-hacking often yields dependent test statistics.Nonetheless, our robust critical value remains valuable by maintaining the probability of type 1 error below the significance level even when p-hacking induces positive dependence across test statistics.Positive dependence results from various p-hacking strategies encountered in practice: when scientists pool data across experiments, when they remove outliers, or when they search across various statistical specifications.Our robust critical value can therefore be used even if the particular p-hacking strategies used by scientists are unknown, as long as these strategies can be expected to generate positive dependence across test statistics, and the completion probability is calibrated to the upper bound of plausible completion probabilities across strategies.
Difference between p-hacking and publication bias.P-hacking and publication bias are frequently conflated, but they represent distinct issues.P-hacking refers to the attempts by researchers to reach significance in individual studies, while publication bias refers to journals' preference for significant 5 results-their reluctance to publish insignificant results.P-hacking makes significance more likely within any published or unpublished study, while publication bias makes significant results more prevalent across bodies of the published literature.This paper solely tackles the distortions created by p-hacking in individual studies.Other methods are available to correct the distortions created by publication bias in meta-analyses (Begg & Berlin, 1988;Hedges, 1992;Duval & Tweedie, 2000;Stanley, 2005;Simonsohn, Nelson, & Simmons, 2014;Stanley & Doucouliagos, 2014;McCrary, Christensen, & Fanelli, 2016;Andrews & Kasy, 2019).Presumably these methods could continue to be used to debias meta-analytic estimates even if critical values robust to p-hacking replaced classical critical values.

II. Model of hypothesis testing with p-hacking
This section develops a simple model of hypothesis testing with p-hacking.A scientist runs experiments with the aim of reaching a significant result.Running experiments takes time, stamina, and money, which are all in finite supply.Because scientists must report results before running out of resources, not all projects yield significant results.

A. Hypothesis test
The scientist tests a null hypothesis H 0 against an alternative hypothesis H 1 .The data are governed by a different probability distribution under each hypothesis.The scientist sets the test's significance level to α ∈ (0, 1).The significance level gives the intended probability of type 1 error-the error that occurs when a true null hypothesis is rejected.Common significance levels are 10%, 5%, and 1%.

B. Test statistic
To conduct the hypothesis test, the scientist collects a dataset from an experiment.From this dataset she constructs a test statistic of the test statistic is F, its survival function is S = 1 -F, and its inverse survival function is Z = S -1 .2

C. Classical critical value
The null hypothesis is rejected when the test statistic exceeds the critical value z.If the scientist obtains a test statistic t > z, the null hypothesis is rejected: the result is significant.But if she obtains a test statistic t ≤ z, the null hypothesis cannot be rejected: the result is insignificant.Accordingly, the probability of type 1 error is S(z).The classical critical value is set such that the probability of type 1 error in one single test equals the significance level: (1) or equivalently z = Z(α).

D. Rewards from significant results
The first nonclassical element of the model is the rewards accruing to significant results.To capture the facts that significant results are more likely to be published than insignificant results, and publications yield rewards to scientists, we assume that the expected rewards v s from a study with significant results are higher than the expected rewards v i from a study with insignificant results.

E. Opportunities for p-hacking
Scientists have ample opportunity to p-hack.However, their resources-time, money, manpower, stamina-are not infinite.Hence, they cannot systematically obtain significant results (Chen, 2021).
We assume that it takes a random amount of resources to conduct an experiment, and the scientist must keep the cumulative resources used below a random limit L. Once the scientist has exhausted more resources than L, she must stop working on the project.The resource limit captures the many 2For simplicity we focus on simple null hypotheses.For composite null hypotheses, we would use the distribution under the null hypothesis's configuration that is the easiest to reject.For example, when testing H 0 : E(X) ≤ µ 0 versus H 1 : E(X) > µ 0 , we would use the distribution of the test statistic at the point E(X) = µ 0 .resource constraints faced by scientists: limited access to data, limited funding, limited coauthor time, limited time before publication of similar results by competing research teams, limited stamina to work on specific projects, or limited time before the opportunity to work on more promising projects arises.Following Ferguson (2007, p. 4.12), we assume that the resource limit has an exponential distribution with rate λ > 0, so P(L > l) = exp(-λl) for any l > 0.3

F. P-hacking process
Experiments.The experiments are denoted by n = 0, 1, 2, . . ., ∞, with n = 0 corresponding to not starting the research project.It takes a random amount of resources to conduct an experiment and collect a dataset.The cumulative amount of resources required to complete 1, 2, . . .experiments is D 1 , D 2 , . . .given by a renewal process independent of the resource limit L. That is, the resources required for each experiment, D 1 , D 2 -D 1 , D 3 -D 2 , . .., are independent and identically distributed (iid) according to a distribution independent of L. Downloaded from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 constructs the nth statistic, T n , which is iid with T 1 , T 2 , . . ., T n-1 .4She may then submit the best of the n test statistics, max{T 1 , . . ., T n }, or she may run yet another experiment.5 Infinite p-hacking.n = ∞ corresponds to running infinitely many experiments and never reporting any result.

G. Completion probability
Following Ferguson (2007, p. 4.13), we introduce the index of the first experiment that cannot be completed before resources are exhausted: K = min{n ≥ 1 : D n > L}.Let γ be the probability that the first experiment can be completed: The index K is independent of the test statistics T 1 , T 2 , . . ., and it has a geometric distribution with success probability 1 -γ, so P(K > k) = γ k for k = 0, 1, 2, . ...6

H. Payoffs
No results.If the scientist does not start the research project, she receives a payoff normalized to y 0 = 0.If resources are exhausted before the end of the first experiment, the scientist does not obtain any result, so she receives the same payoff of y 1 = 0.If the scientist never concludes the research 4By modeling successive test statistics as independent, we are able to derive robust critical values that control the probability of type 1 error across a wide variety of common p-hacking strategies that induce positive dependence-without having to specify which particular p-hacking strategy was used by the scientist (online appendix C.1).
5Here the scientist analyzes the datasets obtained from successive experiments in isolation.The scientist might instead pool the datasets and analyze the pooled data.Thankfully, the robust critical values computed here maintain the probability of type 1 error below the significance level with data pooling (online appendix C.2).
6Here each experiment is completed with the same probability γ.Experiments might instead become more and more difficult to run and less and less likely to be completed.Fortunately, the robust critical values computed here maintain the probability of type 1 error below the significance level with increasingly difficult experiments (online appendix F). project and keeps on p-hacking forever, she also receives a payoff y ∞ = 0.In all other cases, she receives a positive payoff.
Exhausted resources.The scientist cannot continue p-hacking once the project resources are exhausted.To capture the constraint, we set to zero all payoffs once resources are exhausted: y n = 0 in any step n > K.With these payoffs, the scientist never continues past step K.At step K, the scientist cannot obtain a new test statistic, but she can submit for publication the best test statistic from the previous K -1 hypothesis tests, max{T 1 , . . ., T K-1 }.If the statistic is significant, the payoff Non-exhausted resources.Any experiment n < K can be completed before running out of resources, so the scientist can submit the best statistic from the n previous tests, max{T 1 , . . ., T n }.If the statistic is significant, the payoff is y n = v s ; if not, the payoff is y n = v i .7

III. Optimal stopping time
The scientist p-hacks as long as she wishes.At each experiment, she may decide to stop and receive a payoff, or she may decide to continue to the next experiment.If she is able to complete the next experiment, she computes another test statistic.The scientist's problem, which we now solve, is to choose a time to stop p-hacking so as to maximize expected payoffs.

A. Scientist's problem
The stopping rule chosen by the scientist, the critical value z, and the random research events determine the random time N(z) at which the scientist stops p-hacking.The problem of the scientist is to choose a stopping time to maximize expected payoffs.
7Here the scientist does not discount the future, so a significant result yields the same payoff irrespective of when it is obtained.But the robust critical values are not modified if the scientist discounts future payoffs (online appendix E).

B. Reported statistic
As long as she is able to complete at least one hypothesis test, the scientist reports a random statistic R(z) upon stopping.This is the best test statistic that she has been able to obtain through p-hacking.
It may be significant or insignificant, and the scientist may be able to publish it or not.

C. Existence of the optimal stopping time
An optimal stopping time N(z) exists because two conditions are satisfied (Ferguson, 2007, p. 3.3).
Let Y n denote the random payoff received by the scientist when she stops at time n.The first condition is that sup n Y n < ∞ almost surely.This is because Y n ≤ v s almost surely.The second condition is that Y n → y ∞ almost surely as n → ∞.This is because resources inevitably run out, after which point the payoff is Y n = 0, which is just the same as the payoff y ∞ = 0 if the scientist never stops experimenting.

D. Characteristics of the optimal stopping time
Because sup n Y n < ∞ almost surely and Y n → y ∞ almost surely as n → ∞, the optimal stopping time is given by the principle of optimality: the scientist optimally stops as soon as she receives a payoff that is at least as high as the best payoff that can be expected by continuing (Ferguson, 2007, pp. 3.6-3.7).We now characterize the optimal stopping time by considering the various situations faced by the scientist.
Starting the research project.If the scientist does not start the research project, she receives Y 0 = 0.
In contrast, if she starts she earns a nonnegative payoff: 0 if resources are exhausted before the first experiment is completed; v i if she obtains an insignificant result; or v s if she obtains a significant result.Hence it is always optimal to start the research project.
Continuing after insignificant results.How does the scientist behave when she still has resources to allocate to the project?A first possibility is that the result at experiment n and all the results 11 before that are insignificant.Since the best result found by the scientist is insignificant, the scientist earns Y n = v i by stopping at experiment n.All possible payoffs are more than the payoff received for an insignificant result, v i , so all expected payoffs are more than v i .Since the scientist is expected to obtain more than v i by continuing, it is not optimal to stop without obtaining a significant result.
Stopping after a significant result.If the result of test n is significant, the best result found by the scientist is significant, so the scientist earns Y n = v s by stopping at experiment n.All possible payoffs are less than the payoff received for a significant result, v s , so all expected payoffs are less than v s .Hence, the scientist cannot do better by continuing.She optimally stops at experiment n and reports R(z) = max{T 1 , . . ., T n } > z.In fact, the principle of optimality indicates that she should stop at the first occurrence of a significant result.
Stopping when resources are depleted.Once resources are depleted, the scientist must stop phacking.Hence, she stops at step K if she had not stopped before.Two possibilities emerge.If K = 1, resources are depleted before the first experiment, so the scientist has nothing to report.If K > 1, the scientist submits the best test statistic that she has collected.The best result is necessarily insignificant, otherwise she would have stopped before.So she reports R(z) = max{T 1 , . . ., T K-1 } ≤ z.
Summary.The optimality principle gives the following results: Lemma 1.The scientist stops when she obtains a significant result or when she runs out of resources, whichever comes first.In the former case the scientist reports a significant result; in the latter case she reports an insignificant result.So the scientist p-hacks: she never stops at insignificant results, unless she runs out of resources.

IV. Critical value robust to p-hacking
Based on the scientist's p-hacking strategy, we compute the critical value robust to p-hacking.This critical value ensures that the probability of type 1 error remains below the significance level even as the scientist adjusts her p-hacking behavior to the critical value itself.

A. Distribution of the optimal stopping time
We compute the distribution of the optimal stopping time.Since the distribution is used to calculate the robust critical value, we compute it under the null hypothesis.
Probability of finding a significant result at experiment n.Under the null hypothesis, the probability that the test statistic from experiment n reaches the critical value z is given by the test statistic's survival function: P(T n > z) = S(z), where P denotes the probability measure under H 0 .Conversely, the probability that the test statistic does not reach z is given by the test statistic's cumulative distribution function: Probability of continuing after experiment n.The scientist continues p-hacking after any experiment if she has not run out of resources during that experiment, which happens with probability γ, and the latest result is insignificant, which happens with probability 1 -S(z) = F(z).The two events are independent, so the probability that the scientist continues p-hacking is γF(z).Conversely, the probability that the scientist stops at any experiment is (2) 1 -γF(z).
Geometric distribution of the optimal stopping time.The probability of stopping at each experiment is constant, given by (2).The optimal stopping time therefore has a geometric distribution with success probability (2).The probability that the optimal stopping time is n ≥ 1 is Expected number of experiments.Given that the optimal stopping time has a geometric distribution with success probability (2), we obtain the following result: , where E denotes the expectation operator under H 0 .P-hacking is prevalent (E(N(z)) > 1).And scientists p-hack more when the standard for significance is more stringent (E(N(z)) is higher when z is higher).
Since classical critical values are defined by (1), we infer the following result: Corollary 1.With a classical critical value z, the expected number of experiments under the null hypothesis is .
Scientists p-hack more when the significance level is lower (E(N(z)) is higher when α is lower).
P-hacking under the alternative hypothesis.In (4), 1 -α represents the probability of obtaining an insignificant result from an experiment when the classical critical value is used to determine significance and the null hypothesis is true.When the alternative hypothesis is true instead, the probability of obtaining an insignificant result becomes β, where 1 -β is the power of the hypothesis test.Hence, if the alternative hypothesis is true, the expected number of experiments is 1/(1 -βγ).
In many fields, hypothesis tests are acceptable only if their power is above 80% (Duflo, Glennerster, & Kremer, 2007, p. 3928).Setting power to 1 -β = 80%, we find that the expected number of experiments under the alternative hypothesis is 1/(1 -0.2 × γ) < 1/(1 -0.2) = 1.25.So there is almost no p-hacking-which is unsurprising.If the alternative hypothesis is true and the study is well powered, the null hypothesis is rejected most of the time, which renders p-hacking unnecessary.
Hence, if we observe a lot of p-hacking, either the alternative hypothesis is false, or the alternative hypothesis is true but tests have low power (Ioannidis, 2005)

B. Probability of type 1 error
Next, we compute the probability of type 1 error as a function of the critical value.
Proposition 2. When the critical value is set to z, the probability of finding a type 1 error in a reported hypothesis test is .
The probability of type 1 error is larger when scientists p-hack (S * (z) > S(z)).In fact, the probability of type 1 error grows linearly with the expected number of experiments under the null hypothesis: The proof is in online appendix A.1; it relies on an appropriate application of the law of total probability.Since classical critical values are defined by (1), we infer the following: Corollary 2. Under a classical critical value z, the probability of type 1 error is larger than the significance level α:

C. Robust critical value
Influence of the critical value on the type 1 error rate.The critical value influences the probability of type 1 error through two channels (equation ( 6)).The first is a mechanical channel: a higher critical value reduces the probability that a test statistic exceeds it (S(z) is decreasing in z).The second is a behavioral channel: a higher critical value pushes scientists to p-hack more in hope of finding a significant result (E(N(z)) is increasing in z).The novelty of our correction for p-hacking is to take into account this behavioral channel.
Computing the robust critical value.The robust critical value ensures that the probability of type 1 error equals the significance level α when scientists p-hack.Since the probability of type 1 error with p-hacking is given by ( 5), the robust critical value z * is implicitly defined by ( 8) From this definition we obtain the following result (proof details are in online appendix A.2): Proposition 3.For an hypothesis test with significance level α, the robust critical value is The robust critical value is always larger than the classical critical value (z * > Z(α)).
P-hacking with a robust critical value.The robust critical value corrects the distortion introduced by p-hacking without eliminating p-hacking.Because the significance standard imposed by the robust critical value is more stringent than the classical standard, scientists p-hack more under the robust critical value (proposition 1).In fact, combining (3) and ( 8), we obtain the following result: hypothesis is Scientists p-hack more when the significance level is lower (E(N(z * )) is higher when α is lower).

D. Bonferroni correction
Our correction for p-hacking can be formulated as a nonstandard Bonferroni correction: Corollary 4. The critical value that achieves a significance level α under p-hacking is the critical value that achieves a significance level under classical conditions, where E(N(z * )) is the expected number of experiments under the null hypothesis when the robust critical value is in place.
Equation ( 11) is obtained by evaluating (6) at z * , and using α * = S(z * ) and S * (z * ) = α.Unlike a standard Bonferroni correction, the number of experiments used for the correction is not observed.Rather, it is the expected number of experiments under the robust critical value when the null hypothesis is true.Thanks to the model, we can link this number to the probability γ, which we can then calibrate (section V).

E. Influence of the completion probability
Finally, we discuss how the results are influenced by the completion probability γ, which is the main parameter of the model.

A. Completion probability in medical science
Calibration method.In the model, with probability 1 -γ, the first experiment cannot be completed before running out of resources.The probability 1 -γ therefore is the share of studies that stop before completion, while the probability γ is the share of studies that are completed.We use data collected by Dwan et al. (2008) to calibrate γ (table 1).Dwan et al. review 16 metastudies that each follow a cohort of medical studies.The studies are followed from protocol approval to publication, so we can measure the fraction of studies that were stopped before completion and thus γ.
Studies that never started.Overall the data include 5736 approved studies.We focus on the 4563 studies whose fate is known.The information about these studies is obtained both by surveying the scientists who conducted them and by searching the literature.In the pool, 658 studies never started, or 658/4563 = 14.4%.
Studies that started but stopped without analysis.In addition, not all the studies that started were completed.Of the 3905 studies that started, 228 were still ongoing when the metastudies were written, so 3677 studies started and stopped.Of these, 243 stopped before any analysis could be conducted, or 243/3677 = 6.6%.
Calibrated value of the completion probability.Adding the studies that never started to those that stopped without analysis, we find that 14.4% + (1 -14.4%) × 6.6% = 20.0% of the approved studies could not be completed.This yields a completion probability of γ = 1 -20.0%= 80.0%.

B. Robust critical values
We now compute robust critical values using the Bonferroni correction (11) and the completion probability observed in medical science, γ = 80%.
Simple Bonferroni correction.Since the significance level α is always less than 10%, and since γ is less than 1, 1 -αγ is close to 1, and the expected number of experiments under the null hypothesis  2008).The studies with information are all approved studies minus studies for which there is no information and studies that are excluded from the metastudy (for instance because the researchers declined to participate in the metastudy).The studies that stopped without analysis are all studies that stopped early minus studies in which an interim analysis was conducted.NA indicates that the information is not available in the metastudy.A: The curve gives the expected number of experiments run by a scientist as a function of the probability of completing an experiment when the significance level is 5% and significance is determined by a classical critical value.It is obtained from (4) with α = 5%.B: The curve gives the probability of type 1 error as a function of the probability of completing an experiment when the significance level is 5%, significance is determined by a classical critical value, and the scientist optimally p-hacks.It is obtained from (7) with α = 5%.The black dots mark the calibrated value of the completion probability: γ = 80%.with the robust critical value in place is close to 1/(1 -γ) (equation ( 10)).Using equation ( 11), we therefore obtain a simple Bonferroni correction against p-hacking.The classical significance level α * required to correct p-hacking is approximately 1 -γ times the actual significance level α: Numerical application.With γ = 80%, the classical significance level required to address p-hacking is one fifth of the actual significance level: For instance, the critical value that achieves a significance level of 5% under p-hacking is the critical value that yields a significance level of 5%/5 = 1% under classical conditions.The rule of thumb works for any test statistic.For a z-test with a significance level of 5%, the robust critical value is 2.33 instead of 1.64 if the test is one-sided, and 2.58 instead of 1.96 if the test is two-sided.These robust critical values also apply to a large-sample t-test with a significance level of 5%.
Comparison with the Benjamin et al. (2018) proposal.To address the replication crisis in science, Benjamin et al. (2018) propose that scientists replace the standard significance level of 5% by a lower significance level of 0.5%.Such tenfold reduction in the significance level is a more aggressive response to p-hacking than the fivefold reduction obtained in this numerical exercise.However, a tenfold reduction in significance level would be appropriate for a completion probability of γ = 90% (equation ( 12)).In that way, our analysis provides a theoretical underpinning for proposals to reduce the significance levels used in science.It also links the proposed reductions to the amount of resources available to scientists for p-hacking.

C. Additional numerical results
Here we provide additional numerical results.We fix the significance level at 5%.A: The curve gives the critical value robust to p-hacking for a one-sided z-test with significance level of 5%, as a function of the probability of completing an experiment.It is obtained from (9) where α = 5% and Z is the inverse survival function for the standard normal distribution.B: The curve gives the critical value robust to p-hacking for a two-sided z-test with significance level of 5%, as a function of the probability of completing an experiment.It is obtained from (9) where α = 5% and Z is the inverse survival function for the standard half-normal distribution.The black dots mark the calibrated value of the completion probability: γ = 80%.A: The curve gives the number of experiments run by a scientist, in expectation under the null hypothesis.The number is a function of the probability of completing an experiment, when the significance level is 5% and significance is determined by a robust critical value.It is obtained from (10) with α = 5%.B: The curve simultaneously gives the expected numbers of experiments run by a scientist under classical critical value (horizontal axis) and under robust critical value (vertical axis), for any probability of completing an experiment, and for a significance level of 5%.The numbers are obtained from ( 4) and ( 10) with α = 5% and γ ∈ (0, 1).The black dots mark the calibrated value of the completion probability: γ = 80%.Sensitivity to the completion probability.Robust critical values are increasing with the completion probability, but they are not very sensitive to it.For instance, as long as the completion probability remains between 70% and 90%, the robust critical value for one-sided z-tests remains between 2.16 and 2.56 (figure 2A), and the robust critical value for two-sided z-tests remains between 2.42 and 2.79 (figure 2B).This is reassuring: robust critical values remain close even in fields with different p-hacking intensity.
P-hacking with a robust critical value.The expected number of experiments under the null hypothesis and with a robust critical value is given by (10).For the completion probability of 80%, the expected number of experiments is 4.8 (figure 3A).Moreover, the amount of p-hacking is increasing with the completion probability.For instance, when the completion probability increases from 70% to 90%, the expected number of experiments grows from 3.2 to 9.6.Further, p-hacking is more prevalent under a robust critical value than under a classical critical value (figure 3B).At the completion probability of 80%, the expected number of experiments is 4.2 under a classical critical value but 4.8 under a robust critical value.

D. Iterative correction for p-hacking
The corrections for p-hacking proposed by Anscombe (1954), Lovell (1983), andGlaeser (2008) take scientists' p-hacking behavior as fixed, whereas this paper's correction accounts for the fact that scientists would change their p-hacking behavior as soon as the correction is implemented.Here we numerically illustrate the difference between the two approaches.

A:
Step 1 uses the classical critical value, z 1 = Φ -1 (95%) = 1.64,where Φ is the standard normal cumulative distribution function.At step i ≥ 2, the critical value is obtained from a Bonferroni correction using the number of experiments in the previous step: z i = Φ -1 (1 -5%/ E N(z i-1 ) ). B: The expected number of experiments at each step, E(N(z i )), comes from (3) with γ = 80%, F = Φ, and z = z i .C: The probability of type 1 error at each step comes from (5) with γ = 80%, F = Φ, S = 1 -Φ, and z = z i .Bonferroni correction.In the next step, we apply the correction discussed in the literature.The critical value is obtained from a Bonferroni correction that uses the average number of experiments calculated in the initial step.So the critical value is z 2 = Φ -1 (1 -5%/4.17)= 2.25.The expected number of experiments is E(N(z 2 )) = 1/[1 -0.8 × Φ(z 2 )] = 4.77, from (3).Since the critical value is higher than initially, the number of experiments is higher.Scientists p-hack more in response to the more stringent significance standard, which warrants additional correction.The probability of type 1 error remains greater than 5%, although it is much lower than without any correction: S * (z 2 ) = [1 -Φ(z 2 )] × E(N(z 2 )) = 5.7%, from (6).
Next steps.We then iterate the Bonferroni correction.At any step i ≥ 2, the critical value is given by a Bonferroni correction that uses the average number of experiments calculated in the previous step.So the critical value is z i = Φ -1 (1 -5%/ E N(z i-1 ) ).The expected number of experiments is E(N(z i )) = 1/[1-0.8×Φ(zi )], from (3).The probability of type 1 error is S * (z i ) = [1-Φ(z i )]× E(N(z i )), from (6).
Results.The results of the iterative procedure are displayed in figure 4. The sequence of critical values given by the procedure rapidly approaches the robust critical value (figure 4A).By step 3, the critical value is very close to the robust critical value, z * = 2.31, and the probability of type 1 error is very close to the significance level of 5% (figure 4C).We clearly see how scientists respond to an increase in critical value: by p-hacking more (figure 4B).Another take-away is that the iterative application of the Bonferroni correction converges to the robust critical value computed in proposition 3.

VI. Conclusion
We conclude by summarizing our results and comparing our approach with the registration of pre-analysis plans.

First experiment .
If resources are exhausted before the first experiment is completed, L < D 1 , the scientist cannot obtain any results.If resources are not exhausted when the first experiment is completed, L > D 1 , the scientist is able to collect a first dataset and construct a test statistic.This first test statistic is T 1 , which is independent of the resource variables.The scientist then decides to submit the result to a journal, or to run another experiment.Nth experiment.If the scientist chooses to run experiment n ≥ 2, the scientist begins collecting a nth dataset of the same size and drawn from the same underlying population as previous datasets.If resources are exhausted before experiment n is completed, L < D n , the scientist must stop the project before obtaining the nth dataset and submits the best result obtained up to the previous experiment, max T 1 , . . ., T n-1 .If resources are not exhausted, L > D n , the scientist obtains the nth dataset and 3Here research is costless to the scientist.But the robust critical values are not modified if the scientist incurs a cost of doing research (online appendix D). 8 01456 2024 Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license. figure3 Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.Downloaded from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 E(N(z 1 )) = 20.8%,from (6).
Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 T, whose realization is t.Under H 0 , the cumulative distribution function Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 When the critical value is set to z, the expected number of experiments under the 13 01456 2024 . Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 With a robust critical value z * , the expected number of experiments under the null 16 01456 2024 Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 01456 2024 Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 figure 2. Critical values robust to p-hacking for z-tests with significance level of 5% Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024 figure 4. Iterative Bonferroni correction for p-hacking in one-sided z-test with significance level of 5% A. Critical value at each step 1 )] × Review of Economics and Statistics Just Accepted MS. rest by the President and Fellows of Harvard College and the Massachusetts Institute of Technology.Published under a Creative CommonsAttribution 4.0 International (CC BY 4.0) license.from http://direct.mit.edu/rest/article-pdf/doi/10.1162/rest_a_01456/2368497/rest_a_01456.pdf by University of California Santa Cruz Library user on 07 May 2024