## Abstract

Applied work often studies the effect of a binary variable (“treatment”) using linear models with additive effects. I study the interpretation of the OLS estimands in such models when treatment effects are heterogeneous. I show that the treatment coefficient is a convex combination of two parameters, which under certain conditions can be interpreted as the average treatment effects on the treated and untreated. The weights on these parameters are inversely related to the proportion of observations in each group. Reliance on these implicit weights can have serious consequences for applied work, as I illustrate with two well-known applications. I develop simple diagnostic tools that empirical researchers can use to avoid potential biases. Software for implementing these methods is available in R and Stata. In an important special case, my diagnostics require only the knowledge of the proportion of treated units.

## I. Introduction

The great appeal of the model in equation (1) comes from its simplicity (Angrist & Pischke, 2009). At the same time, however, a large body of evidence demonstrates the importance of heterogeneity in effects (Heckman, 2001; Bitler, Gelbach, & Hoynes, 2006), which is explicitly ruled out by this same model. In this paper, I contribute to the recent literature on interpreting $\tau $, the OLS estimand, when treatment effects are heterogeneous (Angrist, 1998; Humphreys, 2009; Aronow & Samii, 2016). I demonstrate that $\tau $ is a convex combination of two parameters, which under certain conditions can be interpreted as the average treatment effects on the treated (ATT) and untreated (ATU). Surprisingly, the weight that is placed by OLS on the average effect for each group is inversely related to the proportion of observations in this group. The more units are treated, the less weight is placed on ATT. One interpretation of this result is that OLS estimation of the model in equation (1) is generally inappropriate when treatment effects are heterogeneous.

It is also possible, however, to present a more pragmatic view of my main result. I derive a number of corollaries of this result that suggest several diagnostic methods that I recommend to applied researchers. These diagnostics are applicable whenever the researcher is (a) studying the effects of a binary treatment, (b) using OLS, and (c) unwilling to maintain that ATT is exactly equal to ATU. Typically, such a homogeneity assumption would be undesirably strong because those choosing or chosen for treatment may have unusually high or low returns from that treatment, which would directly contradict the equality of ATT and ATU.

In deriving my diagnostics, I assume that the researcher is ultimately interested in ATE, ATT, or both and that she wishes to estimate the model in equation (1) using OLS but is concerned about treatment effect heterogeneity. In this case, my diagnostics are able to detect deviations of the OLS weights from the pattern that would be necessary to consistently estimate a given parameter. These diagnostics are easy to implement and interpret; they are bounded between 0 and 1 in absolute value, and they give the proportion of the difference between ATU and ATT (or between ATT and ATU) that contributes to bias. Thus, if a given diagnostic is close to 0, OLS is likely a reasonable choice, but if a diagnostic is far from 0, other methods should be used.

In an important special case, these diagnostics become particularly simple and immediate to report. If we wish to estimate ATT, this rule-of-thumb variant of my diagnostic is equal to the proportion of treated units, $Pd=1$; if our goal is to estimate ATE, the diagnostic is equal to $2\xd7Pd=1-1$, twice the deviation of $Pd=1$ from 50%. In short, OLS is expected to provide a reasonable approximation to ATE if both groups, treated and untreated, are of similar size. If we wish to estimate ATT, it is necessary that the proportion of treated units is very small.

It follows that OLS might often be substantially biased for ATE, ATT, or both. How common are these biases in practice? In a subset of 37 estimates from Card, Kluve, and Weber (2018), a survey of evaluations of active labor market programs, the mean proportion of treated units is 17.7%.^{1} Using the rule-of-thumb variants of my diagnostics, I establish that on average the difference between the OLS estimand and ATE is expected to correspond to 64.6% of the difference between ATT and ATU. Similarly, the expected difference between OLS and ATT is on average equal to 17.7% of the difference between ATU and ATT. In other words, these biases might often be large.

The remainder of the paper is organized as follows. Section II presents a leading example and the main theoretical results. Section III discusses two empirical applications. In a study of the effects of a training program (LaLonde, 1986), OLS estimates are very similar to $ATT^$. Yet in a study of the effects of cash transfers (Aizer et al., 2016), OLS estimates are similar to $ATU^$. Section IV concludes. Proofs and several extensions are provided in the online appendixes. The main results are implemented in newly developed R and Stata packages, hettreatreg.

## II. A Weighted Average Interpretation of OLS

### A. Leading Example

To illustrate the problem with OLS weights, consider the classic example of the National Supported Work (NSW) program. Because this program originally involved a social experiment, the difference in mean outcomes between the treated and control units provides an unbiased estimate of the effect of treatment. LaLonde (1986) studies the performance of various estimators at reproducing this experimental benchmark when the experimental controls are replaced by an artificial comparison group from the Current Population Survey (CPS) or the Panel Study of Income Dynamics (PSID). Angrist and Pischke (2009) reanalyze the NSW–CPS data and conclude that OLS estimates of the effect of NSW program on earnings in 1978 are similar to the experimental benchmark of $1,794.^{2} In particular, their richest specification delivers an estimate of $794. As I will show, this conclusion is driven by the small proportion of treated units in these data.

In this example, ATT and ATU are likely to be substantially different. This is because the treated group, unlike the CPS comparison (untreated) group, was highly economically disadvantaged. It is plausible that ATU might be 0 or, due to the opportunity cost of program participation, even negative. Also, only 1.1% of the sample was treated, so ATE and ATU will be similar.

To demonstrate this, I modify the model in equation (1) to include all interactions between $d$ and $X$. Estimation of this expanded model, again using OLS, allows us to separately compute $ATE^$, $ATT^$, and $ATU^$. This method is usually referred to as “regression adjustment” (Wooldridge, 2010) or “Oaxaca–Blinder” (Kline, 2011; Graham & Pinto, 2022). Using the control variables that deliver the estimate of $794, we obtain $ATE^=-$4,930$, $ATT^=$796$, and $ATU^=-$4,996$. It turns out that since $ATE^$ and $ATU^$ are indeed negative, the OLS estimate and $ATE^$ have different signs. Moreover, if we represent the OLS estimate as a weighted average of $ATT^$ and $ATU^$ with weights that sum to unity, we can write $$794=w^ATT\xd7$796+1-w^ATT\xd7-$4,996$, where $w^ATT$ is the weight on $ATT^$. Solving for $w^ATT$ yields $w^ATT=99.96%$. In other words, the hypothetical OLS weight on the effect on the treated is similar to the proportion of untreated units, 98.9%.

This “weight reversal” is not a coincidence. As I demonstrate below, the intuition from this example holds more generally, even though the OLS estimand is not necessarily a convex combination of two parameters from a procedure that controls for the full vector $X$.

### B. Main Result

This section presents my main result, which focuses on the algebra of OLS and descriptive estimands that I define below. A causal interpretation of OLS also requires introducing the notion of potential outcomes as well as certain conditions that I discuss in section IIC, including an ignorability assumption. However, this is not needed for my main result.

(i) $E(y2)$ and $E(\u2225X\u22252)$ are finite. (ii) The covariance matrix of $d,X$ is nonsingular.

$VpX\u2223d=1$ and $VpX\u2223d=0$ are nonzero, where $V\xb7\u2223\xb7$ denotes the conditional variance (with respect to $EpX\u2223d=j$, $j=0,1$).

Assumption 1 guarantees the existence and uniqueness of the linear projections in equations (2) and (4). Similarly, assumption 2 ensures that the linear projections in equations (5) and (6) exist and are unique.^{3}

^{4}When the linear projections in equations (5) and (6) represent the conditional mean of $y$, the average partial linear effects of $d$ overlap with its average partial effects. It should be stressed, however, that theorem 1, the main result of this paper, is more general and requires only assumptions 1 and 2.

See online appendix A.

Theorem 1 shows that $\tau $, the OLS estimand, is a convex combination of $\tau APLE,1$ and $\tau APLE,0$. The definition of $\tau APLE,j$ makes it clear that $\tau $ is equivalent to the outcome of a particular three-step procedure. In the first step, we obtain $pX$, the propensity score. Next, in the second step, we obtain $\tau APLE,1$ and $\tau APLE,0$, as in equation (8), from two linear projections of $y$ on $pX$, separately for $d=1$ and $d=0$. This is analogous to the regression adjustment procedure in section IIA, although now we control for $pX$ rather than the full vector $X$. Finally, in the third step, we calculate a weighted average of $\tau APLE,1$ and $\tau APLE,0$. The weight on $\tau APLE,1$, $w1$, is decreasing in $VpX\u2223d=1VpX\u2223d=0$ and $\rho $, and the weight on $\tau APLE,0$, $w0$, is increasing in $VpX\u2223d=1VpX\u2223d=0$ and $\rho $.^{5} This is clearly undesirable, since $\tau APLE=\rho \xd7\tau APLE,1+1-\rho \xd7\tau APLE,0$.

This weighting scheme is also surprising: the more units belong to group $j$, the less weight is placed on $\tau APLE,j$, the effect for this group. There are several ways to provide intuition for this result. One is provided in the next section. Another follows from an alternative proof of theorem 1, which is provided with discussion in online appendix B2. It parallels the intuition in Angrist (1998) and Angrist and Pischke (2009) that OLS gives more weight to treatment effects that are better estimated in finite samples.^{6}

### C. Causal Interpretation

The fact that theorem 1 requires only the existence and uniqueness of several linear projections makes this result very general. However, one concern about this result might be that $\tau APLE,1$ and $\tau APLE,0$ do not necessarily correspond to the usual (causal) objects of interest. To define these objects, we need two potential outcomes, $y(1)$ and $y(0)$, only one of which is observed for each unit, $y=y(d)=y(1)\xd7d+y(0)\xd71-d$. The parameters of interest, ATE, ATT, and ATU, are defined as $\tau ATE=E[y(1)-y(0)]$, $\tau ATT=E[y(1)-y(0)\u2223d=1]$, and $\tau ATU=E[y(1)-y(0)\u2223d=0]$. A causal interpretation of OLS also entails the following assumptions.

(Ignorability in Mean). (i) $Ey(1)\u2223X,d=Ey(1)\u2223X$; and (ii) $Ey(0)\u2223X,d=Ey(0)\u2223X$.

(i) $Ey(1)\u2223X=\alpha 1+\gamma 1\xd7pX$; and (ii) $Ey(0)\u2223X=\alpha 0+\gamma 0\xd7pX$.

Assumptions 3 and 4 ensure that $\tau $ admits a causal interpretation. Assumption 3 is standard in the program evaluation literature (Wooldridge, 2010). Assumption 4 is not commonly used. Sufficient for this assumption, but not necessary, is that the conditional mean of $d$ is linear in $X$ and the conditional means of $y(1)$ and $y(0)$ are linear in the true propensity score, which is now equal to $pX$. Linearity of $Ed\u2223X$ is assumed in Aronow and Samii (2016) and Abadie et al. (2020). This assumption is not necessarily strong, since $X$ might include powers and cross-products of original control variables. It is also satisfied automatically in saturated models, as in Angrist (1998) and Humphreys (2009). The linearity assumption for $Ey(1)\u2223pX$ and $Ey(0)\u2223pX$ dates back to Rosenbaum and Rubin (1983) but is restrictive. See also Imbens and Wooldridge (2009) and Wooldridge (2010) for a discussion.

Assumption 3 implies that $Ey(1)-y(0)\u2223X=Ey\u2223X,d=1-Ey\u2223X,d=0$. Then, assumption 4 implies that $Ey(1)-y(0)\u2223X=\alpha 1-\alpha 0+\gamma 1-\gamma 0\xd7pX$, which in turn implies that $\tau ATT=\tau APLE,1$ and $\tau ATU=\tau APLE,0$. This, together with theorem 1, completes the proof.

Corollary 1 states that under assumptions 1, 2, 3, and 4, the OLS weights from theorem 1 apply to the causal objects of interest, $\tau ATT$ and $\tau ATU$. Hence, $\tau $ has a causal interpretation. The greater the proportion of treated units, the smaller is the OLS weight on $\tau ATT$. Again, this is undesirable since $\tau ATE=\rho \xd7\tau ATT+1-\rho \xd7\tau ATU$.

To aid intuition for this surprising result, recall that an important motivation for using the model in equation (1) and OLS is that the linear projection of $y$ on $d$ and $X$ provides the best linear predictor of $y$ given $d$ and $X$ (Angrist & Pischke, 2009). However, if our goal is to conduct causal inference, then this is not, in fact, a good reason to use this method. Ordinary least squares is “best” in predicting actual outcomes, but causal inference is about predicting missing outcomes, defined as $ym=y(1)\xd71-d+y(0)\xd7d$. In other words, the OLS weights are optimal for predicting “what is.” Instead, we are interested in predicting “what would be” if treatment were assigned differently.

Intuition suggests that if our goal were to predict ``what is'' and, without loss of generality, group 1 were substantially larger than group 0, we would like to place a large weight on the linear projection coefficients of group 1 ($\alpha 1$ and $\gamma 1$), because these coefficients can be used to predict actual outcomes of this group. As noted by Deaton (1997) and Solon et al. (2015), the OLS weights are consistent with this idea. Indeed, theorem 1 apply to the causal objects of interest, $\tau ATT$ and $\tau ATU$. Hence, $\tau $ has a causal interpretation. The greater the proportion of treated units, the smaller is the OLS weight on $\tau ATT$. Again, this is undesirable since $\tau ATE=\rho \xd7\tau ATT+1-\rho \xd7\tau ATU$.

To aid intuition for this surprising result, recall that an important motivation for using the model in equation (1) and OLS is that the linear projection of $y$ on $d$ and $X$ provides the best linear predictor of $y$ given $d$ and $X$ (Angrist & Pischke, 2009). However, if our goal is to conduct causal inference, then this is not, in fact, a good reason to use this method. Ordinary least squares is “best” in predicting actual outcomes, but causal inference is about predicting missing outcomes, defined as $ym=y(1)\xd71-d+y(0)\xd7d$. In other words, the OLS weights are optimal for predicting “what is.” Instead, we are interested in predicting “what would be” if treatment were assigned differently.

where $\beta 1$ and $\beta 0$ are the coefficients on $X$ in the conditional means of $y(1)$ and $y(0)$, respectively. Equations (9) and (10) reiterate the point of corollary 1 that $\tau $ and $\tau ATE$ have a very similar structure but differ substantially in how they assign weights. Indeed, in the case of $\tau ATE$, when group 1 is large, the weight on $\beta 1$ is small, the opposite of what we have seen for OLS.^{7}

### D. Implications of Theorem 1

There are several practical implications of my main result. Throughout this section, I assume that the researcher is interested in estimating $\tau ATE$, $\tau ATT$, or both, and wishes to use OLS to estimate the model in equation (1) but is concerned about the implications of theorem 1 and corollary 1. In corollaries 2 and 3, I show how to decompose the difference between $\tau $ and $\tau ATE$ or $\tau $ and $\tau ATT$ into components attributable to (a) the difference between $\tau APLE,1$ and $\tau ATT$, (b) the difference between $\tau APLE,0$ and $\tau ATU$ (jointly referred to as “bias from nonlinearity”), and (c) the OLS weights on $\tau ATT$ and $\tau ATU$ (“bias from heterogeneity”).^{8} Because this paper generally focuses on what I now term “bias from heterogeneity,” my discussion below is restricted to this source of bias, which is equivalent to implicitly making assumptions 3 and 4.

The proofs of corollaries 2 and 3 follow from simple algebra and are omitted. These results show that regardless of whether we focus on $\tau ATE$ or $\tau ATT$, the bias from heterogeneity is equal to the product of a particular measure of heterogeneity, namely, the difference between $\tau ATU$ and $\tau ATT$, and an additional parameter that is easy to estimate, $\delta $ for $\tau ATE$ and $w0$ for $\tau ATT$. While $w0$ is guaranteed to be positive under assumptions 1 and 2, $\delta $ may be positive or negative. Both $w0$ and $\delta $, however, are bounded between 0 and 1 in absolute value. Thus, $w0$ and $|\delta |$ can be interpreted as the percentage of our measure of heterogeneity, $\tau ATU-\tau ATT$, which contributes to bias.^{9} It might be useful to report estimates of $w0$ and $\delta $ in studies that use OLS to estimate the model in equation (1).

As an example, consider the empirical application in section IIA. In this case, $w^0=0.017$ and $\delta ^=-0.971$. The interpretation of these estimates is as follows: if our goal is to estimate $\tau ATT$, using the model in equation (1) and OLS is expected to bias our estimates by only 1.7% of the difference between $\tau ATU$ and $\tau ATT$. If instead we wanted to interpret $\tau $ as $\tau ATE$, our estimates would be biased by an estimated 97.1% of the difference between $\tau ATT$ and $\tau ATU$. Thus, in this application, it might perhaps be acceptable to interpret $\tau $ as $\tau ATT$ but clearly not as $\tau ATE$.

$VpX\u2223d=1=VpX\u2223d=0$.

The calculation of $\delta $ and $w0$ is further simplified under assumption 5. If we use $\delta *$ and $w0*$ to denote the values of $\delta $ and $w0$ in this special case, we can write $\delta *=2\rho -1$ and $w0*=\rho $. In this setting, the knowledge of $\delta $ and $w0$ requires only information on $\rho $, the proportion of units with $d=1$. Of course, the special case where $VpX\u2223d=1=VpX\u2223d=0$ is hardly to be expected in practice. Still, $\delta *=2\rho -1$ and $w0*=\rho $ can potentially serve as a rule of thumb.

The practical implications of assumption 5 are particularly clear when $\rho $ is close to 0%, 50%, or 100%. When few units are treated, $\tau \u2243\tau ATT$. When most of the units are treated, $\tau \u2243\tau ATU$. Finally, when both groups are of similar size, $\tau \u2243\tau ATE$. This can also be seen from corollary 4:

The proof follows immediately from simple algebra. Corollary 4 provides conditions under which OLS reverses the natural weights on $\tau APLE,1$ and $\tau APLE,0$ (or $\tau ATT$ and $\tau ATU$). Indeed, under assumption 5, $\tau $ is a convex combination of group-specific average effects, with reversed weights attached to these parameters. Namely, the proportion of units with $d=1$ is used to weight the average effect of $d$ on group 0, and vice versa.

The results in this section allow empirical researchers to interpret the OLS estimand when treatment effects are heterogeneous. Alternatively, it might be sensible to use any of the standard estimators for average treatment effects under ignorability, such as regression adjustment (see section IIA), weighting, matching, and various combinations of these approaches.^{10} It might also help to estimate a model with homogeneous effects using weighted least squares (WLS). Indeed, in online appendix B3, I demonstrate that when we regress $y$ on $d$ and $pX$, with weights of $1-\rho w0$ for units with $d=1$ and $\rho w1$ for units with $d=0$, the WLS estimand is equal to $\tau APLE$. In practice, of course, $\tau APLE$ can also be obtained directly from equation (7).

### E. Related Work

where $\tau s=Ey\u2223d=1,xs=1-Ey\u2223d=0,xs=1$. In online appendix B4, I demonstrate that this result follows from corollary 1 when the model for $y$ is saturated.^{11} At the same time, the interpretation of OLS in Angrist (1998) is different from theorem 1 and corollary 1. On the one hand, unlike corollary 1 and Humphreys (2009), Angrist (1998) does not restrict the relationship between $\tau s$ and $Pd=1\u2223xs=1$ in any way. On the other hand, theorem 1 and corollary 1 make it arguably easier to identify whether in a given application the OLS estimand will be close to any of the parameters of interest (cf. corollaries 2 to 4). In particular, Angrist (1998) does not recover a pattern of weight reversal, which is discussed in detail in this paper.

Unlike Angrist (1998), Humphreys (2009) does not derive a new representation of $\tau n$, instead presenting further analysis of the result in equation (12). In particular, Humphreys (2009) notes that $\tau n$ can take any value between $min(\tau s)$ and $max(\tau s)$. Then he demonstrates that $\tau n$ is also bounded by $\tau ATT$ and $\tau ATU$ if we restrict the relationship between $\tau s$ and $Pd=1\u2223xs=1$ to be monotonic. According to corollary 1, $\tau $ is a convex combination of $\tau ATT$ and $\tau ATU$ if, among other things, both potential outcomes are linear in $pX$, which also implies a linear relationship between $\tau s$ and $Pd=1\u2223xs=1$ when the model for $y$ is saturated. Of course, this linearity assumption is stronger than the monotonicity assumption in Humphreys (2009). However, in return, we are able to derive a closed-form expression for $\tau $ in terms of $\tau ATT$ and $\tau ATU$, a major advantage over the earlier literature, such as Angrist (1998) and Humphreys (2009).^{12}

## III. Empirical Applications

This section discusses two empirical illustrations of theorem 1 and its corollaries.^{13} In online appendixes C and D, I discuss the implementation of these results in Stata and R. Throughout this section, $\tau APLE$, $\tau APLE,1$, and $\tau APLE,0$ are implicitly treated as equivalent to $\tau ATE$, $\tau ATT$, and $\tau ATU$, respectively. Although this might be restrictive, I also demonstrate that in both applications sample analogs of $\tau APLE$, $\tau APLE,1$, and $\tau APLE,0$, reported in the body of the paper, are similar to other estimates of $\tau ATE$, $\tau ATT$, and $\tau ATU$, reported in online appendix E.

### A. The Effects of a Training Program on Earnings

I first consider the example from section IIA in more detail. This replication of the study of the effects of NSW program in Angrist and Pischke (2009) constitutes an optimistic scenario for OLS. In this application, as I explained in section IIA, the effect for the treated group (ATT) is likely to be substantially larger than the effect for the CPS comparison group (ATU). Moreover, since the experimental benchmark of $1,794 corresponds to $ATT^$ and not to $ATU^$, the researcher should also focus on ATT. It turns out that my diagnostic for estimating ATT, $w^0$, indicates that this parameter should approximately be recovered by OLS, even if treatment effects are heterogeneous.^{14}

The top and middle panels of table 1 reproduce the estimates from Angrist and Pischke (2009) and report my diagnostics. The specification in column 4 was discussed in section IIA. It turns out that $w^0$ is between 0.1% and 1.9% for all specifications; similarly, the rule-of-thumb value of this diagnostic, $w^0*$, is, as always, equal to the proportion of treated units (only 1.1% in this sample). These results are very simple to interpret. As in section IID, we estimate that the difference between the OLS estimand and ATT is less than 2% of the difference between ATU and ATT. In this case, it might indeed be sensible to rely on the OLS estimates of the effect of treatment.

. | (1) . | (2) . | (3) . | (4) . |
---|---|---|---|---|

Original estimates | ||||

OLS | −3,437^{***} | −78 | 623 | 794 |

(612) | (596) | (610) | (619) | |

Diagnostics | ||||

$w^0$ | 0.019 | 0.001 | 0.017 | 0.017 |

$w^0*=\rho ^$ | 0.011 | 0.011 | 0.011 | 0.011 |

$\delta ^$ | −0.970 | −0.987 | −0.971 | −0.971 |

$\delta ^*=2\rho ^-1$ | −0.977 | −0.977 | −0.977 | −0.977 |

Decomposition | ||||

$ATT^$ | −3,373^{***} | −69 | 754 | 928 |

(620) | (595) | (619) | (630) | |

$w^1$ | 0.981 | 0.999 | 0.983 | 0.983 |

$ATU^$ | −6,753^{***} | −6,289^{**} | −6,841^{***} | −6,840^{***} |

(1,219) | (2,807) | (1,294) | (1,319) | |

$w^0$ | 0.019 | 0.001 | 0.017 | 0.017 |

$ATE^$ | −6,714^{***} | −6,218^{**} | −6,754^{***} | −6,751^{***} |

(1,206) | (2,777) | (1,281) | (1,305) | |

Demographic controls | ✓ | ✓ | ✓ | |

Earnings in 1974 | ✓ | |||

Earnings in 1975 | ✓ | ✓ | ✓ | |

$\rho ^=P^d=1$ | 0.011 | 0.011 | 0.011 | 0.011 |

Observations | 16,177 | 16,177 | 16,177 | 16,177 |

. | (1) . | (2) . | (3) . | (4) . |
---|---|---|---|---|

Original estimates | ||||

OLS | −3,437^{***} | −78 | 623 | 794 |

(612) | (596) | (610) | (619) | |

Diagnostics | ||||

$w^0$ | 0.019 | 0.001 | 0.017 | 0.017 |

$w^0*=\rho ^$ | 0.011 | 0.011 | 0.011 | 0.011 |

$\delta ^$ | −0.970 | −0.987 | −0.971 | −0.971 |

$\delta ^*=2\rho ^-1$ | −0.977 | −0.977 | −0.977 | −0.977 |

Decomposition | ||||

$ATT^$ | −3,373^{***} | −69 | 754 | 928 |

(620) | (595) | (619) | (630) | |

$w^1$ | 0.981 | 0.999 | 0.983 | 0.983 |

$ATU^$ | −6,753^{***} | −6,289^{**} | −6,841^{***} | −6,840^{***} |

(1,219) | (2,807) | (1,294) | (1,319) | |

$w^0$ | 0.019 | 0.001 | 0.017 | 0.017 |

$ATE^$ | −6,714^{***} | −6,218^{**} | −6,754^{***} | −6,751^{***} |

(1,206) | (2,777) | (1,281) | (1,305) | |

Demographic controls | ✓ | ✓ | ✓ | |

Earnings in 1974 | ✓ | |||

Earnings in 1975 | ✓ | ✓ | ✓ | |

$\rho ^=P^d=1$ | 0.011 | 0.011 | 0.011 | 0.011 |

Observations | 16,177 | 16,177 | 16,177 | 16,177 |

The estimates in the top panel correspond to column 2 in table 3.3.3 in Angrist and Pischke (2009, p. 89). The dependent variable is earnings in 1978. Demographic controls include age, age squared, years of schooling, and indicators for married, high school dropout, Black, and Hispanic. For treated individuals, earnings in 1974 correspond to real earnings in months 13 to 24 prior to randomization, which overlaps with calendar year 1974 for a number of individuals. Formulas for $w0$, $w1$, and $\delta $ are given in theorem 1 and corollary 2. Following these results, $OLS=w^1\xd7ATT^+w^0\xd7ATU^$. Estimates of ATE, ATT, and ATU are sample analogs of $\tau APLE$, $\tau APLE,1$, and $\tau APLE,0$, respectively. Also, $ATE^=\rho ^\xd7ATT^+1-\rho ^\xd7ATU^$. Huber–White standard errors (OLS) and bootstrap standard errors ($ATE^$, $ATT^$, and $ATU^$) are in parentheses. Statistically significant at $*$10%, $**$5%, and $***$1%.

The bottom panel of table 1 provides an application of corollary 1 to these results. In other words, the estimates from Angrist and Pischke (2009) are now decomposed into two components, $ATT^$ and $ATU^$. The difference between these estimates is substantial. In column 4, while the estimate of ATT is $928, ATU is estimated to be −$6,840. In other words, the OLS estimate of $794, reported in Angrist and Pischke (2009) and discussed in section IIA, is actually a weighted average of these two estimates. The fact that it is close to $928, and not to −$6,840, is a consequence of the small proportion of treated units in this sample, 1.1%. The weight on $928, $w^1$, is 98.3%, and the weight on −$6,840, $w^0$, is only 1.7%.

We might expect that if the proportion of treated units was larger, the weight on $ATT^$ would be smaller and the performance of OLS in replicating the experimental benchmark would deteriorate. I confirm this conjecture in online appendix E1 by quasi-discarding random subsamples of untreated units over a range of sample sizes. In particular, I reestimate the model in equation (1) using WLS, with weights of 1 for treated and $1k$ for untreated units. Figures E1.1 to E1.4 show that in this application WLS estimates become more negative as $k$ increases. This is because larger values of $k$ correspond to greater proportions of untreated units being “discarded,” and hence larger weights on $ATU^$, which is substantially more negative than $ATT^$.

Additional extensions of my analysis are also presented in online appendix E1. For each specification in table 1, I provide both a linear and a nonparametric estimate of the conditional mean of the outcome given $pX$, separately for treated and untreated units (figures E1.5 to E1.8). A visual comparison of both estimates provides an informal test of assumption 4, which is necessary for a causal interpretation of $\tau APLE$, $\tau APLE,1$, and $\tau APLE,0$. The linearity assumption appears to be approximately satisfied for the treated but usually not for the untreated units.

Thus, as a robustness check, I also report a number of alternative estimates of the effects of NSW program in table E1.1. I consider regression adjustment, as in section IIA, as well as matching on $pX$ and on the logit propensity score.^{15} In each case, I separately estimate ATE, ATT, and ATU. These estimates are consistent with the claim that the general pattern of results in table 1 is driven by the OLS weights. The estimates of ATE and ATU are always negative and large in magnitude; the estimates of ATT are much closer to the experimental benchmark.

Finally, I repeat the following exercise from section IIA. When we match the OLS estimates in table 1 with the corresponding estimates of ATT and ATU in table E1.1, we can write $\tau ^=w^ATT\xd7\tau ^ATT+1-w^ATT\xd7\tau ^ATU$. Unless $\tau ^ATT$ and $\tau ^ATU$ are sample analogs of $\tau APLE,1$ and $\tau APLE,0$, $w^ATT$ does not need to be bounded between 0 and 1. Yet we can solve for $w^ATT$ for each set of estimates. The mean of $w^ATT$ across all sets of estimates in table E1.1 is 98.3%, which is nearly identical to the sample proportion of untreated units, 98.9%. This is reassuring for my claims.

### B. The Effects of Cash Transfers on Longevity

In my second application, I replicate a recent paper by Aizer et al. (2016) and study the effects of cash transfers on longevity of the children of their beneficiaries, as measured by their log age at death. In particular, Aizer et al. (2016) analyze the administrative records of applicants to the Mothers' Pension (MP) program, which supported poor mothers with dependent children in pre--World War II United States. In this study, the untreated group consists only of children of mothers who applied for a transfer and were initially deemed eligible but were ultimately rejected. This strategy is used to ensure that treated and untreated individuals are broadly comparable, and hence an ignorability assumption might be plausible. Nevertheless, rejected mothers were slightly older and came from slightly smaller and richer families than accepted mothers. Thus, as before, there is no reason to believe that ATT and ATU are equal, although it is perhaps less clear a priori which is larger. Unlike in section III-A, it seems plausible that the researcher might be interested in either the average effect of cash transfers, ATE, or in their average effect for accepted applicants, ATT.

The top and middle panels of table 2 reproduce the baseline estimates from Aizer et al. (2016) and report my diagnostics. While the OLS estimates are positive and statistically significant, my diagnostics indicate that these results should be approached with caution. Namely, treated units constitute the vast majority (or 87.5%) of the sample. It follows that OLS is expected to place a disproportionately large weight on $ATU^$, in which case the OLS estimates might be very biased for both ATE and ATT (see corollaries 2 and 3). Indeed, my estimates of $\delta $ suggest that the difference between the OLS estimand and ATE is equal to 65.9% to 74.5% of the difference between ATU and ATT. Also, the estimates of $w0$ suggest that the difference between OLS and ATT corresponds to 78.4% to 87.0% of this measure of heterogeneity. The estimates of $\delta *$ and $w0*$ are similar. It turns out that in this application the OLS estimates might be substantially biased for both of our parameters of interest. This would be a pessimistic scenario for OLS.

. | (1) . | (2) . | (3) . | (4) . |
---|---|---|---|---|

Original estimates | ||||

OLS | 0.0157^{***} | 0.0158^{***} | 0.0182^{***} | 0.0167^{***} |

(0.0058) | (0.0059) | (0.0062) | (0.0061) | |

Diagnostics | ||||

$w^0$ | 0.861 | 0.870 | 0.784 | 0.784 |

$w^0*=\rho ^$ | 0.875 | 0.875 | 0.875 | 0.875 |

$\delta ^$ | 0.736 | 0.745 | 0.659 | 0.659 |

$\delta ^*=2\rho ^-1$ | 0.750 | 0.750 | 0.750 | 0.750 |

Decomposition | ||||

$ATT^$ | 0.0129^{**} | 0.0149^{**} | 0.0097 | 0.0089 |

(0.0064) | (0.0071) | (0.0078) | (0.0079) | |

$w^1$ | 0.139 | 0.130 | 0.216 | 0.216 |

$ATU^$ | 0.0162^{***} | 0.0160^{***} | 0.0206^{***} | 0.0188^{***} |

(0.0057) | (0.0059) | (0.0063) | (0.0064) | |

$w^0$ | 0.861 | 0.870 | 0.784 | 0.784 |

$ATE^$ | 0.0133^{**} | 0.0150^{**} | 0.0110 | 0.0102 |

(0.0063) | (0.0068) | (0.0073) | (0.0074) | |

State fixed effects | ✓ | |||

County fixed effects | ✓ | ✓ | ||

Cohort fixed effects | ✓ | ✓ | ✓ | ✓ |

State characteristics | ✓ | ✓ | ✓ | |

County characteristics | ✓ | |||

Individual characteristics | ✓ | ✓ | ✓ | |

$\rho ^=P^d=1$ | 0.875 | 0.875 | 0.875 | 0.875 |

Observations | 7,860 | 7,859 | 7,859 | 7,857 |

. | (1) . | (2) . | (3) . | (4) . |
---|---|---|---|---|

Original estimates | ||||

OLS | 0.0157^{***} | 0.0158^{***} | 0.0182^{***} | 0.0167^{***} |

(0.0058) | (0.0059) | (0.0062) | (0.0061) | |

Diagnostics | ||||

$w^0$ | 0.861 | 0.870 | 0.784 | 0.784 |

$w^0*=\rho ^$ | 0.875 | 0.875 | 0.875 | 0.875 |

$\delta ^$ | 0.736 | 0.745 | 0.659 | 0.659 |

$\delta ^*=2\rho ^-1$ | 0.750 | 0.750 | 0.750 | 0.750 |

Decomposition | ||||

$ATT^$ | 0.0129^{**} | 0.0149^{**} | 0.0097 | 0.0089 |

(0.0064) | (0.0071) | (0.0078) | (0.0079) | |

$w^1$ | 0.139 | 0.130 | 0.216 | 0.216 |

$ATU^$ | 0.0162^{***} | 0.0160^{***} | 0.0206^{***} | 0.0188^{***} |

(0.0057) | (0.0059) | (0.0063) | (0.0064) | |

$w^0$ | 0.861 | 0.870 | 0.784 | 0.784 |

$ATE^$ | 0.0133^{**} | 0.0150^{**} | 0.0110 | 0.0102 |

(0.0063) | (0.0068) | (0.0073) | (0.0074) | |

State fixed effects | ✓ | |||

County fixed effects | ✓ | ✓ | ||

Cohort fixed effects | ✓ | ✓ | ✓ | ✓ |

State characteristics | ✓ | ✓ | ✓ | |

County characteristics | ✓ | |||

Individual characteristics | ✓ | ✓ | ✓ | |

$\rho ^=P^d=1$ | 0.875 | 0.875 | 0.875 | 0.875 |

Observations | 7,860 | 7,859 | 7,859 | 7,857 |

The estimates in the top panel correspond to columns 1 to 4 in panel A of table 4 in Aizer et al. (2016, p. 952). The dependent variable is log age at death, as reported in the MP records (columns 1 to 3) or on the death certificate (column 4). State, county, and individual characteristics are listed in table E2.1 in online appendix E2. Formulas for $w0$, $w1$, and $\delta $ are given in theorem 1 and corollary 2. Following these results, $OLS=w^1\xd7ATT^+w^0\xd7ATU^$. Estimates of ATE, ATT, and ATU are sample analogs of $\tau APLE$, $\tau APLE,1$, and $\tau APLE,0$, respectively. Also, $ATE^=\rho ^\xd7ATT^+1-\rho ^\xd7ATU^$. Huber–White standard errors (OLS) and bootstrap standard errors ($ATE^$, $ATT^$, and $ATU^$) are in parentheses. Statistically significant at $*$10%, $**$5%, and $***$1%.

The results in the bottom panel of table 2 suggest that these biases are indeed substantial. In this panel, following corollary 1, each OLS estimate from Aizer et al. (2016) is represented as a weighted average of estimates of two effects, on accepted (ATT) and rejected (ATU) applicants. The estimates of ATU are consistently larger than those of ATT. Thus, OLS overestimates both ATE (since $\delta ^>0$) and ATT. While the implicit OLS estimates of these parameters remain statistically significant in columns 1 and 2, this is no longer the case in columns 3 and 4, following the inclusion of county fixed effects. Perhaps more importantly, these estimates of ATT are half smaller than the corresponding OLS estimates. Clearly, this difference is economically quite meaningful.

To assess the robustness of these findings, I present several extensions of my analysis in online appendix E2. The informal test of assumption 4, as discussed in section IIIA, appears to suggest that the conditional mean of the outcome given $pX$ is approximately linear for both the treated and untreated units (see figures E2.5 to E2.8). I also report a number of alternative estimates of the effects of cash transfers in table E2.1. These additional results support my conclusion. Only one in twelve estimates of ATT is statistically different from 0, and four of the insignificant estimates are negative. While it is possible that cash transfers increase longevity, the OLS estimates reported in Aizer et al. (2016) are almost certainly too large. Interestingly, this bias appears to be driven by the implicit OLS weights on ATT and ATU, the focus of this paper.^{16}

## IV. Conclusion

This paper proposes a new interpretation of the OLS estimand for the effect of a binary treatment in the standard linear model with additive effects. According to the main result of this paper, the OLS estimand is a convex combination of two parameters, which under certain conditions are equivalent to the average treatment effects on the treated (ATT) and untreated (ATU). Surprisingly, the weights on these parameters are inversely related to the proportion of observations in each group, which can lead to substantial biases when interpreting the OLS estimand as ATE or ATT.

One lesson from this result is that it might be preferable, as suggested by a body of work in econometrics, to use any of the standard estimators of average treatment effects under ignorability, such as regression adjustment, weighting, matching, and various combinations of these approaches. Empirical researchers with a preference for OLS might instead want to use the diagnostic tools that this paper also provides. These diagnostics, which are implemented in the hettreatreg package in R and Stata, are applicable whenever the researcher is studying the effects of a binary treatment, using OLS, and unwilling to maintain that ATT is exactly equal to ATU. In an important special case, these diagnostics require only the knowledge of the proportion of treated units.

## Notes

^{1}

This sample is restricted to studies that Card et al. (2018) coded as “selection on observables” and “regression.”

^{2}

Subsequent to LaLonde (1986), these data were studied by Dehejia and Wahba (1999), Smith and Todd (2005), and many others. Angrist and Pischke (2009) analyze the subsample of the experimental treated units constructed by Dehejia and Wahba (1999), combined with CPS-1 or CPS-3, two of the nonexperimental comparison groups from CPS, constructed by LaLonde (1986). In this replication, I focus on CPS-1.

^{3}

Both assumptions are generally innocuous, although assumption 2 rules out a small number of interesting applications, such as regression adjustments in Bernoulli trials and completely randomized experiments. In these cases, however, OLS is consistent for the average treatment effect under general conditions (Imbens & Rubin, 2015).

^{5}

A formal proof that the relationship between $\rho $ and $w1$ ($w0$) is indeed always negative (positive) is provided in online appendix B1. This proof additionally assumes that the conditional mean of $d$ is linear in $X$.

^{7}

Note that the (infeasible) linear projection of the missing outcome, $ym$, on $d$ and $X$ would solve our problem of weight reversal. The weights on $\tau ATT$ and $\tau ATU$ would still be different from $\rho $ and $1-\rho $ if $VpX\u2223d=1$ and $VpX\u2223d=0$ were different; but at least the weight on $\tau ATT$ ($\tau ATU$) would be increasing (decreasing) in $\rho $.

^{9}

To be precise, $|\delta |$ can be interpreted as the percentage of $sgn(\delta )\xd7\tau ATU-\tau ATT$ that contributes to bias when focusing on $\tau ATE$. Both $\delta $ and $w0$ also have an intuitive interpretation as the difference between the weight that we should place on $\tau ATT$ when focusing on $\tau ATE$ or $\tau ATT$ and the weight that OLS actually places on this parameter. Indeed, $\delta $ is equal to the difference between $\rho $ and $w1$. Similarly, $w0=1-w1$.

^{11}

Also, note that Aronow and Samii (2016) show that this result in Angrist (1998) is not specific to saturated models; instead, it is sufficient to assume that the model for $d$ is linear in $X$. My analysis in online appendix B4 covers the results in both Angrist (1998) and Aronow and Samii (2016).

^{12}

Humphreys (2009) also provides a brief informal remark that the OLS estimand, as represented in Angrist (1998), is similar to $\tau ATT$ ($\tau ATU$) if propensity scores are small (large) in every stratum. This is a special case of the rule of thumb derived from corollaries 3 and 4. My rule of thumb does not impose any such restrictions on the propensity score other than the requirement that the unconditional probability of treatment is close to 0 or 1.

^{13}

In a follow-up paper, I apply these results in the study of racial gaps in test scores and wages (Słoczyński, 2020).

^{15}

In particular, the estimates discussed in section IIA are reported in column 4 of the bottom panel of table E1.1.

^{16}

I also repeat two further exercises from section IIIA. First, after I reestimate the model in equation (1) using WLS, with weights of 1 for treated and $1k$ for untreated units, I demonstrate in figures E2.1 to E2.4 that these estimates become more positive as $k$ increases. As before, larger values of $k$ translate into larger weights on $ATU^$, which is now greater than $ATT^$. Second, when I use the estimates of ATT and ATU in table E2.1 to recover the hypothetical OLS weights, I obtain 22.8% as the mean of $w^ATT$. This is reasonably similar to the proportion of untreated units, 12.5%.

## REFERENCES

*Econometrica*

*Annual Review of Economics*

*American Economic Review*

*Quarterly Journal of Economics*

*Econometrica*

*Mostly Harmless Econometrics: An Empiricist's Companion*

*American Journal of Political Science*

*American Economic Review*

*Journal of the European Economic Association*

*The Analysis of Household Surveys: A Microeconometric Approach to Development Policy*

*Journal of the American Statistical Association*

*Labour Economics*

*Journal of Econometrics*

*Journal of Political Economy*

*Journal of Human Resources*

*Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction*

*Journal of Economic Literature*

*American Economic Review: Papers and Proceedings*

*American Economic Review*

*Biometrika*

*Industrial and Labor Relations Review*

*Journal of Econometrics*

*Journal of Human Resources*

*Quarterly Journal of Economics*

*Econometric Analysis of Cross Section and Panel Data*

## Author notes

This paper is based on portions of my previous working paper (Słoczyński, 2018). I thank the editor and two anonymous referees for their helpful comments. I am very grateful to Alberto Abadie, Max Kasy, Pedro Sant'Anna, and Jeff Wooldridge for many comments and discussions. I also thank Arun Advani, Isaiah Andrews, Josh Angrist, Orley Ashenfelter, Richard Blundell, Stéphane Bonhomme, Carol Caetano, Marco Caliendo, Matias Cattaneo, Gary Chamberlain, Todd Elder, Alfonso Flores-Lagunes, Brigham Frandsen, Josh Goodman, Florian Gunsilius, Andreas Hagemann, James Heckman, Kei Hirano, Peter Hull, Macartan Humphreys, Guido Imbens, Krzysztof Karbownik, Shakeeb Khan, Toru Kitagawa, Pat Kline, Paweł Królikowski, Nicholas Longford, James MacKinnon, Łukasz Marć, Doug Miller, Michał Myck, Mateusz Myśliwski, Gary Solon, Jann Spiess, Michela Tincani, Alex Torgovitsky, Joanna Tyrowicz, Takuya Ura, and Rudolf Winter-Ebmer; seminar participants at BC, Brandeis, Harvard-MIT, Holy Cross, IHS Vienna, Lehigh, MSU, Potsdam, SDU Odense, SGH, Temple, UCL, Upjohn, and WZB Berlin; and many conference participants for useful feedback. I thank Mark McAvoy for his excellent assistance in developing the R package hettreatreg that implements the results in this paper. I also thank David Card, Jochen Kluve, and Andrea Weber for providing me with supplementary data on the articles surveyed in Card, Kluve, and Weber (2018). I acknowledge financial support from the National Science Centre (grant DEC-2012/05/N/HS4/00395), the Foundation for Polish Science (a “Start” scholarship), the ``Weź stypendium—dla rozwoju” scholarship program, and the Theodore and Jane Norman Fund.

A supplemental appendix is available online at https://doi.org/10.1162/rest_a_00953.