## Abstract

The increasing popularity of regression discontinuity methods for causal inference in observational studies has led to a proliferation of different estimating strategies, most of which involve first fitting nonparametric regression models on both sides of a treatment assignment boundary and then reporting plug-in estimates for the effect of interest. In applications, however, it is often difficult to tune the nonparametric regressions in a way that is well calibrated for the specific target of inference; for example, the model with the best global in-sample fit may provide poor estimates of the discontinuity parameter, which depends on the regression function at boundary points. We propose an alternative method for estimation and statistical inference in regression discontinuity designs that uses numerical convex optimization to directly obtain the finite-sample-minimax linear estimator for the regression discontinuity parameter, subject to bounds on the second derivative of the conditional response function. Given a bound on the second derivative, our proposed method is fully data driven and provides uniform confidence intervals for the regression discontinuity parameter with both discrete and continuous running variables. The method also naturally extends to the case of multiple running variables.

## I. Introduction

REGRESSION discontinuity designs, first developed in the 1960s (Thistlethwaite & Campbell, 1960), often allow for simple and transparent identification of treatment effects from observational data (Hahn, Todd, & Van der Klaauw, 2001; Imbens & Lemieux, 2008; Trochim, 1984), and their statistical properties have been the subject of recent interest (Armstrong & Kolesár, 2018; Calonico, Cattaneo, & Titiunik, 2014; Cheng, Fan, & Marron, 1997; Kolesár & Rothe, 2018). The sharp regression discontinuity design assumes a treatment assignment generated by a running variable $X\u2208Rk$, such that individuals get treated if and only $X\u2208A$ for some set $A\u2282Rk$. For example, in epidemiology, $X\u2208R$ could be a severity index (e.g., age or CD4 count), and patients are assigned a medical intervention whenever $X\u2265c$ for some threshold $c$ (i.e., $A=x\u2208R:x\u2265c$). In educational settings, $X\u2208R$ could be a test score that has to exceed a threshold $c$, or, in political science, $X\u2208R2$ could denote the latitude and longitude of a household, and $A$ could be an administrative region that has enacted a specific policy.

Given appropriate assumptions, we can identify a causal effect by comparing subjects $i$ with $Xi$ barely falling within the treatment region $A$ to those with $Xi$ just outside it. Variants of this identification strategy have proven to be useful in education (Angrist & Lavy, 1999; Black, 1999; Jacob & Lefgren, 2004), political science (Caughey & Sekhon, 2011; Keele & Titiunik, 2014; Lee, 2008), criminal justice (Berk & Rauma, 1983), program evaluation (Lalive, 2008; Ludwig & Miller, 2007), and other areas. As we discuss will in more detail, standard methods for inference in the regression discontinuity design rely on plug-in estimates from local linear regression.

In this paper, motivated by a large literature on minimax linear estimation (Armstrong & Kolesár, 2018; Cai & Low, 2003; Donoho, 1994; Donoho & Liu, 1991; Ibragimov & Khas'minskii, 1985; Johnstone, 2011; Juditsky & Nemirovski, 2009), we study an alternative approach based on directly minimizing finite sample error bounds via numerical optimization, under an assumption that the second derivative of the response surface is bounded away from the boundary of the treatment region.^{1} This approach has several advantages relative to local regression. Our estimator is well defined regardless of the shape of the treatment region $A$, whether it be a half line, as in the standard univariate regression discontinuity specification, or an oddly shaped region, as might appear with a geographic regression discontinuity; moreover, our implementation is not affected by potential discreteness of the running variable. Finally, even with univariate designs, our approach strictly dominates local linear regression in terms of minimax mean-squared error. We start by presenting our method in the context of classical univariate regression discontinuity designs with a single treatment cutoff: with $Xi\u2208R$ and $A=x\u2208R:x\u2265c$. A solution to the more general problem will then follow by direct extension. A software implementation, optrdd for R, is available on CRAN.

### A. Optimized Inference with Univariate Discontinuities

^{2}

The behavior of regression discontinuity estimation via local linear regression is fairly well understood. When the running variable $X$ is continuous (i.e., $X$ has a continuous positive density at $c$) and $\mu w(x)$ is twice differentiable with a bounded second derivative in a neighborhood of $c$, Cheng et al. (1997) show that the triangular kernel $K(t)=(1-t)+$ minimizes worst-case asymptotic mean-squared error among all possible choices of $K$; Imbens and Kalyanaraman (2012) provide a data-adaptive choice of $hn$ to minimize the mean-squared error of the resulting estimator; and Calonico et al. (2014) propose a method for removing bias effects due to the curvature of $\mu w(x)$ to allow for asymptotically unbiased estimation. Meanwhile, given a second-derivative bound $|\mu w''(x)|\u2264B$, Armstrong and Kolesár (2018) and Kolesár and Rothe (2018) construct confidence intervals centered at the local linear estimator $\tau ^$ that attain uniform asymptotic coverage, even when the running variable $X$ may be discrete.

Despite its ubiquity, however, local linear regression still has some shortfalls. First under the bounded second derivative assumption often used to justify local linear regression (i.e., that $\mu w(x)$ is twice differentiable and $|\mu w''(x)|\u2264B$ in a neighborhood of $c$), local linear regression is not the minimax optimal linear estimator for $\tau (c)$—even with a continuous running variable. Second, and perhaps even more important, all the motivating theory for local linear regression relies on $X$ having a continuous distribution; however, in practice, $X$ often has a discrete distribution with a modest number of points of support. When the running variable is discrete, there is no compelling reason to expect local linear regression to be particularly effective in estimating the causal effect of interest.^{3} In spite of these limitations, local linear regression is still the method of choice, largely because of its intuitive appeal.

The goal of this paper is to show that we can systematically do better. Regardless of the shape of the kernel $K(\xb7)$ in equation (2), local linear regression yields a linear estimator^{4} for $\tau $, that is, one of the form $\tau ^=\u2211i=1n\gamma ^iYi$ for weights $\gamma ^i$ that depend only on the distances $Xi-c$ (the weights $\gamma ^i$ underlying local linear regression can be written out using the closed-form solution to ordinary least squares regression). Here, we find that if we are willing to rely on numerical optimization tools, we can derive better weights.

Because the estimator 5 is minimax among the class of linear estimators and local linear regression is also a linear estimator, our estimator dominates local linear regression estimator in a minimax sense over all problems where we only know that $Var[Yi|Xi]=\sigma i2$ and $|\mu w''(x)|\u2264B$. For further discussion of related estimators, see Armstrong and Kolesár (2018), Cai and Low (2003), Donoho (1994), Donoho and Liu (1991), and Juditsky and Nemirovski (2009).

In practice, of course, we need methods for choosing the values of $\sigma i2$ and $B$ to run our method with. Tuning the noise scale $\sigma i2$ is not too difficult (it is comparable to estimating the irreducible noise in any regression problem); however, obtaining a good value for $B$ usually requires problem-specific insight. We discuss some approaches to choosing $B$ in the context of applications in sections III and IV, and recommend performing a sensitivity analysis for different choices of $B$. Specifying $B$ is closely related to choice of bandwidth, for example, in existing methods such as those of Calonico et al. (2014) or Imbens and Kalyanaraman (2012); the difference is that choosing $B$ directly reflects a quantitative belief about the world—in terms of regularity of the functions $\mu w(x)$)—whereas a choice of bandwidth interacts with the fundamental parameters of the problem in a more indirect way, which depends on the shape of the kernel $K(\xb7)$ and the distribution of the running variable $Xi$.

When $Xi$ has a discrete distribution, the parameter $\tau (c)$ is usually not point identified because there may not be any observations $Xi$ in a small neighborhood of $c$. However, we can get meaningful partial identification of $\tau (c)$ thanks to our bounds on the second derivative of $\mu w(x)$. Moreover, because our approach controls for bias in finite samples, the estimator (5) is still justified in the partially identified setting and, as discussed further in section IIA, provides valid confidence intervals for $\tau (c)$ in the sense of Imbens and Manski (2004). We view the fact that our estimator can seamlessly move between the point and partially identified settings as an important feature.

The top panel of figure 1 compares the weights $\gamma ^i$ obtained via equation (5) in two different settings: one with a discrete, asymmetric running variable $X$ depicted the lower left panel of the figure, and the other with a standard Gaussian running variable. We see that for $n=1,000$, the resulting weighting functions look fairly similar and are also comparable to the implicit weighting function generated by local linear regression with a triangular kernel. However, as $n$ grows and the discreteness becomes more severe, our method changes both the shape and the scale of the weights, and the discrepancy between the optimized weighting schemes for discrete versus continuous running variables becomes more pronounced.

In the lower right panel of figure 1, we also compare the worst-case conditional mean-squared error of our method relative to that of optimally tuned local linear regression, both with a rectangular and triangular kernel; For the smallest sample size we consider, $n=333$, the discreteness of the running variable has a fairly mild effect on estimation and, as one might have expected, the triangular kernel is noticeably better than the rectangular kernel, while our method is slightly better than the triangular kernel. However, as the sample size increases, the performance of local linear regression relative to our method ebbs and flows rather unpredictably.^{5}

### B. Optimized Inference with Generic Discontinuities

The methods we have presented extend naturally to the general case, where $Xi\u2208Rk$ may be multivariate and $A$ is unrestricted. The problem of regression discontinuity inference with multiple running variables is considerably richer than the corresponding problem with a single running variable because an investigator could now plausibly hope to identify many different treatment effects along the boundary of the treated region $A$. Most of the literature on this setup, including Papay, Willett, and Murnane (2011), Reardon and Robinson (2012), and Wong, Steiner, and Cook (2013), have focused on these questions of identification while using some form of local linear regression for estimation.

In the multivariate case, however, questions about how to tune local linear regression are exacerbated, as the problems of choosing the kernel function $K(\xb7)$ and the bandwidth $h$ are now multivariate. Perhaps for this reason, it is still popular to use univariate methods to estimate treatment effects in the multivariate setting by, for example, using shortest distance to the boundary of the treatment region $A$ as a univariate running variable (Black, 1999), or considering only a subset of the data where univariate methods are appropriate (Jacob & Lefgren, 2004; Matsudaira, 2008).

Here, we show how our optimization-based method can be used to sidestep the problem of choosing a multivariate kernel function by hand. In addition to providing a simple-to-apply algorithm, our method lets us explicitly account for the curvature of the mean-response function $\mu w(x)$ for statistical inference, thus strengthening formal guarantees relative to prior work.

Relative to the univariate case, the multivariate case has two additional subtleties we need to address. First, in equation (5), it is natural to impose a constraint $|\mu w''(x)|\u2264B$ to ensure smoothness; in the multivariate case, however, we have more choices to make. For example, do we constrain $\mu w(x)$ to be an additive function, or do we allow for interactions? Here, we opt for the more flexible specification and simply require that $||\u22072\mu w(x)||\u2264B$, where $||\xb7||$ denotes the operator norm (i.e., the largest absolute eigenvalue of the second derivative).

^{2}

This is the minimax linear estimator for the conditional average treatment effect at $c$. The upside of this approach is that it gives us an estimand that is easy to interpret; the downside is that the when curvature is nonnegligible, equation (6) can effectively make use of only data near the specified focal point $c$, thus resulting in relatively long confidence intervals.

In some cases, it is helpful to consider other interpretations of the estimand underlying equation (7). If we are willing to assume a constant treatment effect $\tau (x)=\tau $, then $\tau \xaf=\tau $, and $\tau ^*$ is the minimax linear estimator for $\tau $. Relatedly, we can always use the confidence intervals from section IIA built around $\tau ^*$ to test the global null hypothesis $\tau (x)=0$ for all $x$.

To gain intuition for the multivariate version of our method, we outline a simple example building on the work of Keele and Titiunik (2014) on the effect of television advertising on voter turnout in presidential elections. To estimate this effect, Keele and Titiunik (2014) examine a school district in New Jersey, half of which belongs to the Philadelphia media market and the other half to the New York media market. Before the 2008 presidential elections, the Philadelphia half was subject to heavy campaign advertising, whereas the New York half was not, thus creating a natural experiment for the effect of television advertising assuming the media market boundary did not coincide with other major boundaries within the school district. Keele and Titiunik (2014) use this identification strategy to build a regression discontinuity design, comparing sets of households straddling the media market boundary.

However, despite the multivariate identification strategy, Keele and Titiunik (2014) then reduce the problem to a univariate regression discontinuity problem for estimation. They first compute Euclidean distances $Di=\u2225Xi-c2\u2225$ to a focal point $c$ and then use $Di$ as a univariate running variable. In contrast, our approach allows for transparent inference without needing to rely on such a reduction. Figure 2 depicts $\gamma ^$-weights generated by our optimized approach; the resulting treatment effect estimator is then $\u2211i\gamma ^iYi$. Qualitatively, we replicate the “no measurable effect” finding of Keele and Titiunik (2014) while directly and uniformly controlling for spatial curvature effects. We discuss details, including placebo diagnostics and the choice of tuning parameter, in our working paper (Imbens & Wager, 2017).

We also see that, at least here, standard heuristics used to reduce the multivariate regression discontinuity problem to a univariate one are not sharp. In the setup of the left panel of figure 2, where we seek to estimate the treatment effect at a focal point $c$, some treated points due west of $c$ get a positive weight, whereas points the same distance south from $c$ get a mildly negative weight, thus violating the heuristic of Keele and Titiunik (2014) that weights should depend only on $Di=\u2225Xi-c2\u2225$. Meanwhile, we can compare the approach in the right panel of figure 2, where we allow for averaging of treatment effects along the boundary, to the popular heuristic of using shortest distance to the boundary of the treatment region as a univariate running variable (Black, 1999). But this reduction does not capture the behavior of our optimized estimator. There are some points at the eastern edge of the treated region that are very close to the boundary but get essentially 0 weight (presumably because there are no nearby units on the control side of the boundary).

### C. Related Work

The idea of constructing estimators of the type 5 that are minimax with respect to a regularity class for the underlying data-generating process has a long history in statistics. In early work, Legostaeva and Shiryaev (1971) and Sacks and Ylvisaker (1978) independently studied inference in “almost” linear models that arise from taking a Taylor expansion around a point (see also Cheng et al., 1997). For a broader discussion of minimax linear estimation over nonparametric function classes, see Cai and Low (2003), Donoho (1994), Ibragimov and Khas'minskii (1985), Johnstone (2011), Juditsky and Nemirovski (2009), and references in them. An important result in this literature is that for many problems of interest, minimax linear estimators are within a small explicit constant of being minimax among all estimators (Donoho & Liu, 1991).

^{6}

Now, although this class of functions is cosmetically quite similar to the bounded-second-derivative class used in equation (5), we note that the class of weights allowed for in equation (9) is substantially larger, even if the value of $B$ is the same. This is because the functions $\mu w(\xb7)$ underlying the above weighting scheme need not be continuous and in fact can have jumps of magnitude $B(x-c)2/2$. Given that the key assumption underlying regression discontinuity designs is continuity of the conditional means of the potential outcomes at the threshold for the running variable, it would appear to be reasonable to impose continuity away from the threshold as well. Allowing for jumps through the condition (9) can make the resulting confidence intervals for $\tau (c)$ substantially larger than they are under the smoothness condition with bounded second derivatives. One key motivation for the weighting scheme (9) rather than our proposed one, equation (5), appears to be that the optimization problem induced by equation (9) is substantially easier and allows for closed-form solutions for $\gamma ^i$. Conversely, we are aware of no closed-form solution for equation (5) and instead need to rely on numeric convex optimization.

In the special case where the running variable $X$ is assumed to have a continuous density around the threshold $c$, there have been a considerable number of recent proposals for asymptotic confidence intervals while imposing smoothness assumptions on $\mu w(x)$. Calonico et al. (2014) propose a bias correction to the local linear regression estimator that allows for valid inference, and Calonico, Cattaneo, and Farrell (2018) provide further evidence that such bias corrections may be preferable to undersmoothing. Meanwhile, Armstrong and Kolesár (2016) show that when $\mu w(x)$ is twice differentiable and $X$ has a continuous density around $c$, we can use local linear regression with a bandwidth chosen to optimize mean-squared error as the basis for bias-adjusted confidence intervals, provided we inflate confidence intervals by an appropriate, universal constant (e.g., to build 95% confidence intervals, one should use a critical threshold of 2.18 instead of 1.96). Gao (2017) characterizes the asymptotically optimal kernel for the regression discontinuity parameter under the bounded second derivative assumption with a continuous running variable. As discussed above, the value of our approach relative to this literature is that we allow for considerably more generality in the specification of the regression discontinuity design: the running variable $X$ may be discrete or multivariate, and the treatment boundary may be irregularly shaped.

Optimal inference with multiple running variables is less developed than in the univariate case. Papay et al. (2011) and Reardon and Robinson (2012) study local linear regression with a “small” bandwidth, but do not account for finite sample bias due to curvature. Zajonc (2012) extends the analysis of Imbens and Kalyanaraman (2012) to the multivariate case and studies optimal bandwidth selection for continuous running variables given second derivative bounds; the inference, however, again requires undersmoothing. Keele, Titiunik, and Zubizarreta (2015) consider an approach to geographic regression discontinuity designs based on matching. To our knowledge, the approach we present is the first to allow for uniform, bias-adjusted inference in the multivariate regression discontinuity setting.

Finally, although local methods for inference in the regression discontinuity design have desirable theoretical properties, many practitioners also seek to estimate $\tau (c)$ by fitting $E[Yi|Xi=x]$ using a global polynomial expansion along with a jump at $c$ (see Lee & Lemieux, 2010, for a review and examples. However, as Gelman and Imbens (forthcoming) argued, this approach is not recommended, as the model with the best in-sample fit may provide poor estimates of the discontinuity parameter. For example, high-order polynomials may give large influence to samples $i$ for which $Xi$ is far from the decision boundary $c$, and thus lead to unreliable performance.

Another approach to regression discontinuity designs (including in the discrete case) builds on randomization inference (see Cattaneo, Frandsen, & Titiunik, 2015; Cattaneo, Idrobo, & Titiunik, forthcoming; and Li, Mattei, & Mealli, 2015, for a discussion). The problem of specification testing for regression discontinuity designs is considered by Cattaneo, Jansson, & Ma (2016), Frandsen (2017), and McCrary (2008).

## II. Formal Considerations

### A. Uniform Asymptotic Inference

^{7}

### B. Implementation via Convex Optimization

In our presentation so far, we have discussed several nonparametric convex optimization problems and argued that solving them was feasible given advances in the numerical optimization literature over the past few decades (Boyd & Vandenberghe, 2004). Here, we present a concrete solution strategy via quadratic programming over a discrete grid and show that the resulting discrete solutions converge uniformly to the continuous solution as the grid size becomes small.^{8}

^{2}, available in our working paper (Imbens & Wager, 2017).

Suppose that $Xi\u2208X$ belong to some compact, convex set $X\u2282Rk$. Then for any tolerance level $\eta >0$, there exists a finite-dimensional quadratic program that can recover the solution $\gamma ^$ to equation (18) with $L2$-error at most $\eta $.

### C. Minimizing Confidence Interval Length

As formulated in equation (5), our estimator seeks to minimize the worst-case mean-squared error over the specified bounded-second-derivative class. However, in some applications, we may be more interested in making the confidence intervals (13) as short as possible; and our approach can easily be adapted for this objective. To do so, consider the minimization objective in equation (15). Writing $v^2=\u2211i=1n\gamma ^i2\sigma i2$, we see that both the worst-case mean-squared error, $v^2+B2t^2$, and the confidence interval length in equation (13) are monotone increasing functions of $v^$ and $t^$; the only difference is in how they weight these two quantities at the optimum.

Now, to derive the full Pareto frontier of pairs $(v^,t^)$, we can simply rerun equaiton (15) with the term $B2t2$ in the objective replaced with $\lambda B2t2$, for some $\lambda >0$. A practitioner wanting to minimize the length of confidence intervals could consider computing this whole solution path to equation (15) and then using the value of $\lambda $ that yields the best intervals; this construction provides minimax linear fixed-length confidence intervals (Donoho, 1994). Since this procedure never looks at the responses $Yi$, the inferential guarantees for the resulting confidence intervals remain valid.

In our applications, however, we did not find a meaningful gain from optimizing over $\lambda $ instead of just minimizing worst-case mean-squared error as in equation (15), and so did not pursue this line of investigation further. This observation is in line with the analytic results of Armstrong and Kolesár (2016), who showed that when $X$ has a continuous density and $\mu w(x)$ is twice differentiable, using the mean-squared error optimal bandwidth for local linear regression is over 99% efficient relative to using a bandwidth that minimizes the length of bias-adjusted confidence intervals.

Finally, although it is beyond the scope of this paper, it is interesting to ask whether we can generalize our approach to obtain asymptotically quasi-minimax estimators for $\tau (c)$ when the per observation noise scale $\sigma i$ needs to be estimated from the data. The resulting question is closely related to the classical issue of when feasible generalized least squares can emulate generalized least squares (see Romano and Wolf (2017) for a recent discussion).

## III. Univariate Optimized Designs in Practice

**Procedure 1: Optimized Regression Discontinuity Inference**

This algorithm provides confidence intervals for the conditional average treatment effect $\tau (c)$, given an a priori bound $B$ on the second derivative of the functions $\mu w(x)$. We assume that the conditional variance parameters $\sigma i2$ are unknown; if they are known, they should be used as in equation (5). This procedure is implemented in our R package optrdd.^{9}

Pick a large window $r$, such that data with $|Xi-c|>r$ can be safely ignored without loss of efficiency. (Here, we can select $r=\u221e$, but this may result in unnecessary computational burden.)

Run ordinary least-squares regression of $Yi$ on the interaction of $Xi$ and $Wi$ over the window $|Xi-c|\u2264r$. Let $\sigma ^2$ be the residual error from this regression.

Obtain $\gamma ^$ via the quadratic program, equation (15), with $\sigma i$ set to $\sigma ^$ and weights outside the range $|Xi-c|\u2264r$ set to 0.

Confirm that the optimized weights $\gamma ^i$ are small for $|Xi-c|\u2248r$. If not, start again with a larger value of $r$.

^{10}Estimate $\tau ^=\u2211i=1n\gamma i^Yi$ and $s^2=\u2211i=1n\gamma ^i2(Yi-\mu ^Wi(Xi))2$, where the $\mu ^Wi(Xi)$ are predictions from the least squares regression from step 1.

Build confidence intervals as in equation (13).

Conversely, obtaining good bounds on the curvature $B$ is more difficult and requires problem-specific insight. In particular, adapting to the true curvature $\mu w(x)$ without a priori bounds for $B$ is not always possible (see Armstrong & Kolesár, 2018 and Bertanha & Moreira, 2016, and references there). In applications, we recommend considering a range of plausible values of $B$ that could be obtained, for example, from subject matter expertise or from considering the mean-response function globally. For example, we could estimate $\mu w(x)$ using a quadratic function globally or over a large, plausible relevant interval around the threshold, and then multiply maximal curvature of the fitted model by a constant (e.g., 2 or 4). The larger the value of $B$ we use the more conservative the resulting inference. In practice, it is often helpful to conduct a sensitivity analysis for the robustness of confidence intervals to changing $B$ (see figure 4 for an example).^{11}

### A. Application: The Effect of Compulsory Schooling

In our first application, we consider a data set from Oreopoulos, 2006, who studied the effect of raising the minimum school-leaving age in the United Kingdom on earnings as an adult. The effect is identified by the change in the minimum school-leaving age from 14 to 15 in 1947, and the response is log-earnings among those with nonzero earnings (in 1998 pounds sterling). This data set exhibits notable discreteness in its running variable and was used by Kolesár and Rothe (2018) to illustrate the value of their bias-adjusted confidence intervals for discrete regression discontinuity designs. For our analysis, we preprocess our data exactly as in Kolesár and Rothe (2018). We refer readers to their paper and to Oreopoulos, 2006 for a more in-depth discussion of the data.

As in Kolesár and Rothe (2018), we seek to identify the effect of the change in minimum school-leaving age on average earnings via a local analysis around the regression discontinuity; our running variable is the year in which a person turned 14, with a treatment threshold at 1947. Kolesár and Rothe (2018) consider analysis using local linear regression with a rectangular kernel and a bandwidth chosen such as to make their honest confidence intervals as short as possible, (recall that we can measure confidence interval length without knowing the point estimate, and so tuning the interval length does not invalidate inference). Here, we also consider local linear regression with a triangular kernel, as well as our optimized design.^{12}

In order to obtain confidence intervals, it remains to choose a bound $B$. Following the above discussion, 3, a second-order polynomial fit with a “large” bandwidth of either 12 or 18 has a curvature of 0.006 (the estimate is insensitive to the choice of large bandwidth); thus, we tried $B=0.006$ and $B=0.012$. We also consider the more extreme choices $B=0.003$ and $B=0.03$. For $\sigma i2$, we proceed as discussed above. Figure 3 shows the effective $\gamma ^(Xi)$ weighting functions for all three considered methods, with $B=0.012$.

We present results in the top panel of table 1. Overall, these results are in line with those presented in figure 1. The optimized method yields materially shorter confidence intervals than local linear regression with a rectangular kernel; for example, with $B=0.03$, the rectangular kernel intervals are 11% longer. In comparison, the triangular kernel comes closer to matching the performance of our method, although the optimized method still has shorter confidence intervals. Moreover, when considering comparisons with the triangular kernel, we note that the rectangular kernel is far more prevalent in practice and that the motivation for using the triangular kernel often builds on the optimality results of Cheng et al. (1997). And once one has set out on a quest for optimal weighting functions, there appears to be little reason to not just use the actually optimal weighting function, equation 5.

A. Effect of Raising Minimum School-Leaving Age . | |||||||
---|---|---|---|---|---|---|---|

$B$ . | Rectangular Kernel . | Triangular Kernel . | Optimized . | ||||

0.003 | $0.0213\xb10.0761$ | $0.0321\xb10.0737$ | $0.0302\xb10.0716$ | ||||

0.006 | $0.0578\xb10.0894$ | $0.0497\xb10.0867$ | $0.0421\xb10.0841$ | ||||

0.012 | $0.0645\xb10.1085$ | $0.0633\xb10.1037$ | $0.0557\xb10.1003$ | ||||

0.03 | $0.0645\xb10.1460$ | $0.0710\xb10.1367$ | $0.0710\xb10.1329$ | ||||

B. Effect of Summer School on Math and Reading Scores | |||||||

Estimator | Unweighted CATE Equation (6) | Weighted CATE Equation (7) | |||||

Subject | $B$ | Confidence Interval | Maximum Bias | Sample Error | Confidence Interval | Maximum Bias | Sample Error |

Math | $0.5\xd740-2$ | $0.037\xb10.093$ | 0.030 | 0.038 | $0.076\xb10.037$ | 0.009 | 0.017 |

Math | $1.0\xd740-2$ | $0.013\xb10.126$ | 0.041 | 0.052 | $0.068\xb10.043$ | 0.011 | 0.019 |

Reading | $0.5\xd740-2$ | $0.014\xb10.098$ | 0.030 | 0.041 | $0.044\xb10.037$ | 0.009 | 0.017 |

Reading | $1.0\xd740-2$ | $-0.015\xb10.130$ | 0.040 | 0.054 | $0.047\xb10.043$ | 0.011 | 0.019 |

A. Effect of Raising Minimum School-Leaving Age . | |||||||
---|---|---|---|---|---|---|---|

$B$ . | Rectangular Kernel . | Triangular Kernel . | Optimized . | ||||

0.003 | $0.0213\xb10.0761$ | $0.0321\xb10.0737$ | $0.0302\xb10.0716$ | ||||

0.006 | $0.0578\xb10.0894$ | $0.0497\xb10.0867$ | $0.0421\xb10.0841$ | ||||

0.012 | $0.0645\xb10.1085$ | $0.0633\xb10.1037$ | $0.0557\xb10.1003$ | ||||

0.03 | $0.0645\xb10.1460$ | $0.0710\xb10.1367$ | $0.0710\xb10.1329$ | ||||

B. Effect of Summer School on Math and Reading Scores | |||||||

Estimator | Unweighted CATE Equation (6) | Weighted CATE Equation (7) | |||||

Subject | $B$ | Confidence Interval | Maximum Bias | Sample Error | Confidence Interval | Maximum Bias | Sample Error |

Math | $0.5\xd740-2$ | $0.037\xb10.093$ | 0.030 | 0.038 | $0.076\xb10.037$ | 0.009 | 0.017 |

Math | $1.0\xd740-2$ | $0.013\xb10.126$ | 0.041 | 0.052 | $0.068\xb10.043$ | 0.011 | 0.019 |

Reading | $0.5\xd740-2$ | $0.014\xb10.098$ | 0.030 | 0.041 | $0.044\xb10.037$ | 0.009 | 0.017 |

Reading | $1.0\xd740-2$ | $-0.015\xb10.130$ | 0.040 | 0.054 | $0.047\xb10.043$ | 0.011 | 0.019 |

A. Confidence interval ($\alpha =95%$) for the effect of raising the minimum school-leaving age on average log earnings, as given by local linear regression with a rectangular kernel, local linear regression with a triangular kernel, and our optimized method, equation (5). The confidence intervals account for curvature effects, provided the second derivative is bounded by $B$. B. Estimates for the effect of summer school on math and reading scores on the following year's test, using different estimators and choices of $B$. Reported are bias-adjusted 95% confidence intervals, a bound on the maximum bias given our choice of $B$, and an estimate of the sampling error conditional on $Xi$.

Finally, we note that a bound $B$ on the second derivative also implies that the quadratic approximation, equation (9), holds with the same bound $B$. Thus, we could in principle also use the method of Armstrong and Kolesár (2018) to obtain uniform asymptotic confidence intervals here. However, the constraint, equation (9), is weaker than the actual assumption we were willing to make—that the functions $\mu w(\xb7)$ have a bounded second derivative—and so the resulting confidence intervals are substantially larger. Using their approach on this data set gives confidence intervals of $0.0518\xb10.0969$ with $B=0.006$ and $0.0682\xb10.1760$ with $B=0.03$; these intervals are not only noticeably longer than our intervals, but are also longer than the best uniform confidence intervals we can get using local linear regression with a rectangular kernel as in Kolesár and Rothe (2018). Thus, the use of numerical convex optimization tools that let us solve equation (5) instead of equation (9) can be of considerable value in practice.

## IV. A Discontinuity Design with Two Cutoffs

We next consider the behavior of our method in a specific variant of the multiple running variable problem motivated by a common inference strategy in education. Some school districts require students to attend summer school if they fail a year-end standardized test in either math or reading (Jacob & Lefgren, 2004; Matsudaira, 2008), and it is of course important to understand the value of such summer schools. The fact that students are mandated to summer school based on a sharp test score cutoff suggests a natural identification strategy via regression discontinuities; however, standard univariate techniques cannot directly be applied as the regression discontinuity now no longer occurs along a point, but rather along a surface in the bivariate space encoding both a student's math and reading scores.

We illustrate our approach using Matsudaira's (2008) data set. As discussed above, the goal is to study the impact of summer school on future test scores, and the effect of summer school is identified by a regression discontinuity: at the end of the school year, students take tests in math and reading; those failing either of these tests must attend summer school. Here, we focus on the 2001 class of graduating fifth graders and filter the sample to include only the $n=30,741$ students whose fifth-grade math and reading scores fall between 40 points of the passing threshold; this represents $44.7%$ of the full sample. Matsudaira (2008) analyzed this data set with univariate methods by using reading score as a running variable and considering only the subset of students who passed the math or reading exam. This allows for a simple analysis, but may also result in a loss of precision.^{13} Not all students mandated to attend summer school in fact attend, and some students who pass both tests still need to attend for reasons discussed in Matsudaira (2008). That being said, the effect of passing tests on summer school attendance is quite strong; furthermore, the treatment effect of being mandated to summer school is interesting in its own right, so here we perform an intent-to-treat analysis without considering noncompliance. We consider both of our optimized estimators, equations 6 and 7.

In order to proceed with our approach, we again need to choose a value for $B$. Running a second-order polynomial regression on the next year's math and reading scores for both treated and control students separately, we find the largest curvature effect among the reading score of control students: roughly, a curvature of $0.46\xd740-2$ along the $(1,2)$ direction. Thus, we run our algorithm with both an optimistic choice of $B=0.5\xd740-2$ and a more conservative choice $B=1.0\xd740-2$.

The top row of figure 4 compares weight functions $\gamma ^$ learned by both methods. The estimator $\tau ^c$ is in fact quite conservative and gives large weights only to students who scored close to $c$. Our choice of estimating the conditional average treatment effect at $(0,0)$ may have been particularly challenging, as it is in a corner of control space and so does not have particularly many control neighbors. In contrast, the weighted method $\tau ^*$ appears to have effectively learned matching: it constructs pairs of observations all along the treatment discontinuity, thus allowing it to use more data while cancelling out curvature effects due to $\mu 0(x)$. In this sample, it is much more common to fail math and pass reading than vice versa; thus, the mean of the samples used for “matching” lies closer to the math pass/fail boundary than the reading one.

The bottom panel of table 1 displays corresponding estimates for the treatment effect. As expected, the confidence intervals using the weighted method, equation (7), are much shorter than those obtained using equation (6), allowing for a 0.95 level significant detection in the first case but not in the second. Since the weighting method allows for shorter confidence intervals and in practice seems to yield a matching-like estimator, we expect it to be more often applicable than the unweighted estimator, equation (6).

The bottom row of figure 4 presents some further diagnostics for our result. The first two panels depict a sensitivity analysis for our weighted CATE result. We vary the maximum bound $B$ on the second derivative and examine how our confidence intervals change.^{14} The result on the effect of summer school on math scores appears to be fairly robust, as we still get significant bias-aware 95% confidence intervals at $B=2\xd740-2$, which is four times the largest apparent curvature observed in the data. The third panel plots a measure of the effective size of the treated and control samples used by the algorithm, $ESSw=1/\u2211i:Wi=w\gamma ^i2$. Although our analysis sample has almost exactly the same number of treated and control units ($W\xaf=0.501$), it appears that our algorithm is able to make use of more control than treated samples, perhaps because the treated units are “wrapped around” the controls.

Finally, it is natural to ask whether the bivariate specification considered here gave us anything in addition to the simpler approach used by Matsudaira (2008) estimating the treatment effect of summer school on the next year's math exam by running a univariate regression discontinuity analysis on only students who passed the reading exam, and vice versa to the effect on the reading exam.^{15} We ran both of these analyses using our method, equation (15), again considering bounds $B=0.5\xd740-2,1\xd740-2$ on the second derivative. For math, we obtained 95% confidence intervals of $0.083\xb10.040$ and $0.079\xb10.047$ for the smaller and larger $B$-bounds, respectively; for reading, we obtained $0.037\xb10.075$ and $0.030\xb10.090$. In both cases, the corresponding bounds for the weighted estimator, equation (7), in the bottom panel of table 1 are shorter, despite accounting for the possibility of bivariate curvature effects. The difference is particularly strong for the reading outcome, since our estimator $\tau ^*$ can also use students near the math pass/fail boundary for improved precision.^{16}

## V. Discussion

In this paper, we introduced an optimization-based approach to statistical inference in regression discontinuity designs. By using numerical convex optimization tools, we explicitly derive the minimax linear estimator for the regression discontinuity parameter under bounds on the second derivative of the conditional response surface. Because any method based on local linear regression is also a linear estimator of this type, our approach dominates local linear regression in terms of minimax mean-squared error. We also show how our approach can be used to build uniformly valid confidence intervals.

A key advantage of our procedure is that given bounds on the second derivative, estimation of the regression discontinuity parameter is fully automatic. The proposed algorithm is the same whether the running variable is continuous or discrete and whether $\tau (c)$ is point identified. Moreover, it does not depend on the shape of treatment assignment boundary when $X$ is multivariate. We end our discussion with some potential extensions of our approach.

### A. Fuzzy Regression Discontinuities

In this paper, we considered only sharp regression discontinuities, where the treatment assignment $Wi$ is a deterministic function of $Xi$. However, there is also considerable interest in fuzzy discontinuities, where $Wi$ is random but $PWi=1|Xi=x$ has a jump at the threshold $c$ (see Imbens & Lemieux, 2008, for a review). In this case, it is common to interpret the indicator $1Xi\u2265c$ as an instrument and then to estimate a local average treatment effect via two-stage local linear regression (Imbens & Angrist, 1994). By analogy, we can estimate treatment effects with fuzzy regression discontinuity via two-stage optimized designs as $\tau ^LATE=\u2211i=1n\gamma ^iYi/\u2211i=1n\gamma ^iWi$, where the $\gamma ^i$ are obtained as in equation (15) with an appropriate choice penalty on the maximal squared imbalance $t2$. This approach would clearly be consistent based on results established in this paper; however, deriving the best way to trade off bias and variance in specifying $\gamma ^i$ and extending the approach of Donoho (1994) for uniform asymptotic inference is left for future work.

### B. Balancing Auxiliary Covariates

In many applications, we have access to auxiliary covariates $Zi\u2208Rp$ that are predictive of $Yi$ but unrelated to the treatment assignment near the boundary $c$. As discussed in, for example, Imbens and Lemieux (2008), such covariates are not necessary for identification, but controlling for them can increase robustness to hidden biases. One natural way to use such auxiliary covariates in our optimized designs is to require the weights $\gamma ^i$ to balance these covariates, that is, to add a constraint $\u2211i=1n\gamma ^iZij=0forallj=1,\u2026,p$ to the optimization problem, equation (5). In principle, if the distribution of $Zi$ is in fact independent of $Xi$ when $Xi$ is near the threshold $c$, we would expect the balance conditions to hold approximately even if we do not enforce them; however, explicitly enforcing such balance may improve robustness.^{17} If we have an additive, linear dependence of $Yi$ on $Zi$, then enforcing balance would also result in variance reduction, as the conditional variance of our estimator $\tau ^$ would now depend on $Var[Yi|Xi,Zi]$, which is always smaller or equal to $Var[Yi|Xi]$.

### C. Working with Generic Regularity Assumptions

Following standard practice in the regression discontinuity literature, we focused on minimax linear inference under bounds on the second derivative of $\mu w(\xb7)$ (Kolesár & Rothe, 2018; Imbens & Kalyanaraman, 2012). However, our conceptual framework can also be applied with higher-order smoothness assumptions via bounds on the $k$th derivative of $\mu w(\xb7)$ and can easily be combined with other forms of structural information about the conditional response functions (e.g., perhaps we know from theory that the functions $\mu w(\xb7)$ must be concave). Thanks to the flexibility of our optimization-based approach, acting on either of these ideas would simply involve implementing the required software using standard convex optimization libraries.

## Notes

^{1}

Of these papers, our work is most closely related to that of Armstrong and Kolesár (2018), who consider minimax linear estimation in the regression discontinuity model for an “approximately linear” model in the sense of Sacks and Ylvisaker (1978) that places restrictions on second differences relative to the response surface at the threshold. In contrast, we assume bounded second derivatives away from the threshold. An advantage of their approach is that it allows a closed-form solution for the weights. However, a disadvantage is that they allow for jumps in the response surface away from the threshold, which implies that given the same value for our bound on the second derivative and their bound on second differences, our confidence intervals can be substantially shorter (moreover, allowing for discontinuities in the response surface does not seem conceptually attractive given that the assumption of continuity of the conditional expectation at the threshold is fundamental to the regression discontinuity design). We discuss this comparison further in section IC.

^{2}

In the fuzzy regression discontinuity design where the probability of receiving the treatment changes discontinuously at $x=c$, but not necessarily from 0 to 1, the estimand can be written as the ratio of two such differences. The issues we address in this paper also arise in that setting, and our discussion extends naturally to it. See section V for a discussion.

^{3}

One inconvenience that can arise in local linear regression with discrete running variables is that if we use a data-driven rule to pick the bandwidth $h^$ (e.g., the one of Imbens & Kalyanaraman, 2012), we may end up with no data inside the specified range (i.e., there may be no observations with $|Xi-c|\u2264h$). The practitioner is then forced to select a different bandwidth ad hoc. Ideally, methods for regression discontinuity analysis should be fully data driven, even when $X$ is discrete.

^{4}

We note the unfortunate terminological overlap between “local linear regression” estimators of $\tau $ and “linear” estimators of type $\tau ^=\u2211i=1n\gamma ^iYi$. The word *linear* in these two contexts refers to different things. All “local linear regression” estimators are “linear” but not vice versa.

^{5}

As a matter of intellectual curiosity, it is intriguing to ask whether there exist discrete distributions for which the rectangular kernel may work substantially better than the triangular kernel or whether additional algorithmic tweaks—such as using different bandwidths on different sides of the threshold—may have helped (in the above example, we used the same bandwidth for local linear regression on both sides of the boundary). However, from a practical perspective, estimator, equation (5), removes the need to consider such questions in applied data analysis and automatically adapts to the structure of the data at hand.

^{6}

Armstrong and Kolesár (2018) also consider a more general setting where we assume accuracy of the $k$th order Taylor expansion of $\mu w(x)$ around $c$; in fact, our method also extends to this setting. Here, however, we focus on second-derivative bounds, which are by far the most common in applications.

^{7}

The bound on the relative contribution of any single $\gamma ^i$ is needed to obtain a Gaussian limit distribution for $\tau ^$. In related literature, Armstrong and Kolesár (2018) follow Donoho (1994) and sidestep this issue by assuming Gaussian errors $Yi(w)-\mu w(Xi)$, in which case no central limit theorem is needed. Conversely, Athey, Imbens, & Wager (2018) adopt an approach more similar to ours and explicitly bound $\gamma ^i$ from above during the optimization.

^{8}

In the case where $X$ is univariate, the resulting optimization problem is a classical one, and arguments made by Karlin (1973) imply that the weights $\gamma ^i$ can be written as $\gamma ^i=g(Xi)$, where $g$ is a perfect spline; our proposed optimization strategy reflects this fact. However, in the multivariate case, we are not aware of a similar simple characterization.

^{9}

Here, the algorithm assumes that all observations are of roughly the same quality (i.e., we do not know that $\sigma i2$ is lower for some observations than others). If we have a priori information about the relative magnitudes of the conditional variances of different observations, for example, some pairs outcomes $Yi$ are actually aggregated over many observations, then we should run steps 2 and 3 using appropriate inverse-variance weights. Our software allows for such weighting.

^{10}

Only considering data over an a priori-specified “large plausible relevant interval” around $c$ that safely contains all the data relevant to fitting $\tau (c)$ can also be of computational interest. Our method relies on estimating a smooth nonparametric function over the whole range of $x$, and being able to reduce the relevant range of $x$ a priori reduces the required computation. Although defining such plausibility intervals is of course heuristic, our method ought not be too sensitive to how exactly the interval was chosen. For example, in the setup considered in section IIIA, the optimal bandwidth for local linear regression is around three or six years depending on the amount of assumed smoothness (and choosing a good bandwidth is very important); conversely, using plausibility intervals extending ten, fifteen, or twenty years on both sides of $c$ appears to work reasonably well. When running the method (5), one should always make sure that the weights $\gamma ^i$ get very small near the edge of the plausibility interval; if not, the interval should be made larger.

^{11}

An interesting wrinkle is that if we are able to bound $B$ in large samples—but not uniformly—then confidence intervals built using estimated values of $B$ will have asymptotic but not uniform coverage.

^{14}

We note that if the CATE function $\tau (\xb7)$ is not constant, then our weighted CATE estimand $\tau \xaf*=\u2211iWi\gamma ^i\tau (Xi)$ may vary with $B$. This result should thus formally be interpreted as either a sensitivity analysis for the constant treatment effect parameter $\tau $ if we are willing to assume constant treatment effects or as a robustness check for our rejection of the global null hypothesis $\tau (x)=0$ for all $x$.

^{15}

Matsudaira's (2008) estimator is not exactly comparable to the two we consider. For example, when only focusing on students who passed the reading exam, his estimator effectively averages treatment effects over the math pass/fail boundary but not the reading pass/fail boundary. In contrast, we either estimate the conditional average treatment effect at a point $c$, equation (6), or allow for averaging over the full boundary, equation (7). It is unclear whether the restriction of Matsudaira (2008) to averaging over only one of the two boundary segments targets a meaningfully more interpretable estimand than equation (7).

^{16}

The corresponding headline numbers from Matsudaira (2008) are a 95% confidence interval of $0.093\xb10.029$ for the effect on the math score and $0.046\xb10.045$ for the reading score; see tables 2 and 3, reduced-form estimates for fifth graders. These confidence intervals, however, do not formally account for bias. They estimate the discontinuity parameter using a global cubic fit; such methods, however, do not reliably eliminate bias (Gelman & Imbens, forthcoming).

^{17}

A related idea would be to use the covariates $Zi$ for post-hoc specification testing as in Heckman and Hotz (1989) or Imbens and Lemieux (2008). Their strategy is to obtain weights $\gamma ^i$ without looking at the $Zi$, and then to reject the modeling strategy if balance does not hold approximately.

## REFERENCES

## Author notes

We are grateful for helpful comments from Joshua Angrist, Timothy Armstrong, Max Farrell, Michal Kolesár, Christoph Rothe, and Cun-Hui Zhang; seminar participants at Berkeley, Heidelberg, Munich, Stanford and the University of Chicago; as well as the editor and four anonymous referees.

A supplemental appendix is available online at http://www.mitpressjournals.org/doi/suppl/10.1162/rest_a_00793.