## Abstract

Models with multiway fixed effects are frequently used to address selection on unobservables. The data used for estimating these models often contain few observations per value of either indexing variable (sparsely matched data). I show that this sparsity has important implications for inference and propose an asymptotically valid inference method based on subsetting. Sparsity also has important implications for point estimation when covariates or instrumental variables are sequentially exogenous (e.g., dynamic models), and I propose a new estimator for these models. Finally, I illustrate these methods by providing estimates of the effect of class size reductions on student achievement.

## I. Introduction

MODELS with several sources of unobserved heterogeneity are frequently used in empirical work as an attempt to control for potentially confounding factors or because features of the distribution of unobserved heterogeneity are of interest. In the analysis of education data, student and teacher or school fixed effects are routinely used as a way to control for selection on unobservables (Hanushek et al., 2003; Rivkin, Hanushek, & Kain, 2005; Sund, 2009; Clotfelter, Ladd, & Vigdor, 2010; Kingdon & Teal, 2010; Rothstein, 2010). In the analysis of employer-employee matched data, models with worker and firm unobserved heterogeneity are often used to study features of the distribution of worker and firm productivity (Abowd et al., 1999; Postel-Vinay & Robin, 2002; Card, Heining, & Kline, 2013; Bonhomme, Lamadon, & Manresa, 2019).^{1}

Many of these empirical applications use data with relatively few observations per value of the variables indexing unobserved heterogeneity, a structure that I call sparsely matched data.^{2} With education data, for instance, students are usually followed over only a few grades, and class size is limited by state laws or common practices, so that the number of observations per student and per teacher is generally small. As a result, neither the unobserved ability of a particular student nor the teaching effectiveness of a particular teacher can be estimated precisely. Similarly, considering an employer-employee matched data structure such that neither worker nor firm productivity can be estimated precisely is also likely to be relevant given the small number of workers per firm commonly encountered, the limited movement of workers between firms, and the likely presence of cross-sectional and serial dependence in productivity
shocks.^{3}

Here I study estimation and inference for the effect of finitely many observed covariates on an observed outcome of interest using a model with multiway fixed effects and sparsely matched data. While the estimation of features of the distribution of unobserved heterogeneity is of interest in many empirical studies, I leave the analysis of this question within the framework of this paper to ongoing work and concentrate here on the effect of covariates, for several reasons. These parameters are of primary interest in many empirical applications; they need to be estimated before estimating features of the distribution of unobserved heterogeneity; and estimating features of the distribution of unobserved heterogeneity necessitates stronger assumptions than those used in this paper.^{4}

With strictly exogenous covariates, a commonly used estimator relies on a two-way fixed-effects regression, which in the education example above would take the form of a regression on a set of indicator variables for each student and each teacher.^{5} With sparsely matched data, the estimated fixed effects from such a regression are contaminated with estimation noise that cannot be ignored. This noise will generally lead to robust variance estimation methods being inconsistent because they rely on residuals that do not approximate the error term of the model well. In addition, unlike one-way fixed-effects regression, two-way fixed-effects regression generates complex patterns of dependence in the transformed data, so that one-way or multiway cluster robust inference will also be invalid in general.

I propose basing inference on a different estimation method that relies on many small and potentially overlapping subsets of data. This estimation procedure applies a two-way fixed-effects regression transformation within each subset, and the object of interest is then estimated by a pooled regression across all subsets. The resulting dependence structure of the transformed data is sparse; the transformed data can be shown to be M-dependent, so that this estimator can be combined with M-dependence robust standard errors to conduct asymptotically valid inference.

With dynamic models (i.e., with sequentially exogenous instrumental variables), two-way fixed-effects regression using sparsely matched data would lead to inconsistent estimators of the object of interest because of an incidental parameter bias similar to the bias found in the context of dynamic models with one-way fixed effects by Nickell (1981). Here I propose a new estimation method for dynamic models with two-way fixed effects that is consistent, locally efficient, and simple to compute given existing tools. Asymptotically valid inference can also be conducted by relying on subsetting, as in the case of strictly exogenous covariates.

## II. Inference with Strictly Exogenous Covariates

where $yit$ is an observed outcome variable and $xit$ is an observed covariate.^{6}$ci$ is an unobserved effect that can vary in the cross-section but is assumed to be constant over time.^{7}$edit$ is an additional unobserved effect indexed by an observed indexing variable $dit\u2208Dn$, $Dn={1,\u2026,Nn}$. $edit$ can vary across values of $dit$, but it is assumed to be constant across all observations that share the same value of $dit$. $uit$ represents all other unobserved factors. $X$ contains all values of $xit$, $X={xit}(i,t)\u2208N\xd7T$, and $D$ contains all values of $dit$, $D={dit}(i,t)\u2208N\xd7T$.

The object of interest throughout the paper is $\beta 0$, the coefficient on the observed covariate $xit$. In the running empirical example of this paper, $yit$ would be student achievement as measured by test score, $xit$ would be student $i$'s class size in year $t$, $ci$ would be a student's unobserved learning ability, $dit$ would be student $i$'s teacher in year $t$, $ed$ would be the unobserved teaching effectiveness of teacher $d$, and $uit$ are all other unobserved shocks to learning. The advantage of the model given by equation (1) is that it does not restrict the relationship between class size and unobserved student and teacher quality, so that, for instance, the model accommodates classroom formation processes where more able students and teachers are systematically assigned to larger classrooms. On the other hand, equation (1) requires that transitory shocks $uit$ are mean independent of past, present, and future values of class size and teacher assignment.^{8}

In the empirical example of this paper, students are observed for only two time periods, and on average, teachers teach only 40 students over all time periods of observation, which is small compared to the 200,000 students observed in the data. This corresponds to a data structure frequently encountered, which I call sparsely matched, where the number of observations corresponding to each value of $i$ and of $dit$ is small or, conversely, the total number of values taken by $i$ and $dit$ is large. Define $Nn,d=|{(i,t)\u2208N\xd7T:dit=d}|$ to be the number of observations assigned to the value $d$ of $dit$, where $|.|$ denotes the cardinality of a finite set. I capture a sparsely matched data structure by using an asymptotic framework where $n$ grows to infinity while $T$ remains fixed and the number of observations for each $dit$ remains uniformly bounded: $Nn,d\u2264C$$\u2200d\u2208Dn$$\u2200n$ for a constant $C$. As a result, the total number of values that $i$ and $dit$ can take, $n$ and $Nn$, respectively, are proportional to sample size $nT$.^{9}

### A. Variance Estimation for Two-Way Fixed Effects Regression

where $x\u02d9n=Mgnxn$ and $x\u02d9it$ the element of $x\u02d9n$ corresponding to the observation $(i,t)$.

However, empirical researchers frequently wish to rely on inference methods that are robust to patterns of heteroskedasticity and correlation. With sparsely matched data, robust variance estimation methods applied to a two-way fixed-effects regression will generally not be consistent, leading to a nonstandard inference problem that originates from the dimension of the control covariates contained in $gn,1$ and $gn,2$, $n$ and $Nn$, respectively, being proportional to sample size $nT$.

where $pit,it$ is the diagonal element of $Pgn$ corresponding to observation $(i,t)$. While the second right-hand-side term above will generally converge to 0, two-way fixed-effects regression with sparsely matched data leads to the first right-hand-side term, $\u2211i,tx\u02d9it2pit,it\u2211i,tx\u02d9it2$, being nonignorable, which will lead to an asymptotic bias in the heteroskedasticity-consistent standard errors.

The term $pit,it$ can be interpreted as the contribution of observation $(i,t)$ in predicting the term $ci+edit$ based on a regression on the covariates $g1,n$, $g2,n$. Consistency of the heteroskedasticity-consistent estimated variance would generally require that this contribution becomes negligible as sample size increases. Heuristically, with sparsely matched data, the amount of information for predicting the term $ci+edit$ from observations other than $(i,t)$ is bounded in a setting where student $i$ is observed for only a few years ($T$) and teacher $dit$ is observed teaching few students ($Nn,dit$), so that the contribution of observation $(i,t)$ in predicting its own term of heterogeneity $ci+edit$ does not vanish.^{10}

Cattaneo, Jansson, and Newey (2017) have studied a similar problem of heteroskedasticity-robust inference with high-dimensional control covariates in a setting that is not specific to two-way fixed-effects regression. They propose an estimated variance, which they show can be consistent if the term $pit,it$ is not too large uniformly across all observations $(i,t)\u2208N\xd7T$. Next, I propose a new representation for two-way fixed-effects regressions based on random walks on graphs, which can be used to derive easily interpretable conditions for this condition on $pit,it$ to hold.

#### Variance estimation under conditional uncorrelatedness.

^{11}

where $\u2243$ denotes a convergence in probability at a rate fast enough to eventually obtain consistency of the estimated variance, that is, to obtain $Var^(\beta ^n|X)Var(\beta ^n|X)-1=op(1)$.

Therefore, under the regularity conditions of Cattaneo et al. (2017), a consistent estimator for the variance of the two-way fixed-effects regression estimator can be obtained under the assumption that the original error terms $uit$ are uncorrelated. One important regularity condition used to establish consistency is that the maximum absolute sum of the weights be uniformly bounded, that is, $maxi\u2208N\xd7T\u2211j\u2208N\xd7T|\kappa i,j|\u2264C$$\u2200n$ for some constant $C$. In turn, they show that a sufficient condition for this property of the weights to hold is that the minimum diagonal element of $Mgn$ be uniformly bounded above $12$, or equivalently that the maximum diagonal element of $Pgn$ be uniformly bounded below $12$. A contribution of this paper is to propose a new representation for $Mgn$ based on random walks on graphs and to show that this representation can be used to provide easily interpretable conditions for this uniform boundedness condition to hold.

Proposition ^{1} shows that the minimum diagonal element of $Mgn$ is uniformly bounded above $12$ if more than three time periods are observed and if for all pairs of values $d,d'\u2208Dn$, the number of cross-sectional observations assigned to both $d$ and $d'$ is either 0 or large enough. In the running example of this paper, if we impose the simplification that each student is taught by a particular teacher at most once, this result shows that the main regularity condition for the standard errors of Cattaneo et al. (2017) to be valid holds if $T$ is three or larger and if, whenever two teachers have taught a common set of students, they have taught at least five students in common when $T=3$, at least four students in common when $T=4$, and at least three students in common when $T\u22655$. In order to state proposition ^{1}, define $Nd(i)$ to be the number of times observation $i$ was assigned to value $d$ of the indexing variable $dit$: $Nd(i)=|{t\u2208T:dit=d}|$.

The proofs for all the results in this paper are in the online appendix. The appendix also contains additional results; in particular, it includes results for inference when $T=2$ and under other relaxations.^{12}

In many applications, it is likely that empirical researchers believe that there exists serial or cross-sectional correlation in $uit$. For instance, even conditional on a student's ability, there could be shocks to a student's learning that are persistent over time, such as making new friends or changes in the environment at home. Even conditional on a teacher's effectiveness, there could be correlated shocks among students in the same classroom, such as the introduction of a new technology, after-school homework sessions, or friendship networks nested within classrooms that lead to correlated shocks.^{13} Therefore, one might wish to use inference methods that are not only robust to heteroskedasticity but also to some forms of serial and cross-sectional correlation. The appendix shows that Cattaneo et al.'s (2017) method can be extended to the case of multiway cluster dependence but does not include results on the asymptotic properties of this extension.

### B. Robust Inference and Subsetting

Cattaneo et al.'s (2017) method relies on finding a sequence of weights such that $\u2211i\u2208N\xd7Tx\u02d9i2\u2211j\u2208N\xd7T\kappa iju\u02d9j2\u2243\u2211i\u2208N\xd7Tx\u02d9i2\u2211j\u2208N\xd7T\kappa ijE(u\u02d9j2|X)=\u2211i\u2208N\xd7Tx\u02d9i2E(ui2|X)$. Its extension to correlated errors defined in the appendix relies on a similar argument but using cross-products of the transformed error term $u\u02d9i$ in addition to squared terms. The sequence of weights ${\kappa ij}i,j\u2208N\xd7T$ is key in determining the asymptotic properties of the resulting estimated variances. Proposition ^{1} lists conditions for consistency with uncorrelated errors and two-way fixed-effects regression, but these conditions are unknown for models with correlated errors or more than two-way fixed effects. In addition, the Monte Carlo simulation results in this paper find a relatively large finite sample bias when implementing the extension defined in the appendix of Cattaneo et al.'s (2017) method to correlated errors, pointing to the sequence of weights used in that case not being sparse enough for good properties in small samples.^{14}

An alternative approach to inference with high-dimensional control covariates is to use estimated variances that are robust to the dependence that exists in the transformed error term $u\u02d9i$. The structure of this dependence is determined by the dependence in the original error term $ui$ and the form of the transformation used.

For instance, with one-way fixed-effects regression and uncorrelated errors, the transformed error term $u\u02d9i$ would be serially correlated since it can be written as a time-demeaned transformed variable, $u\u02d9i=ui-1T\u2211j\u2208I(i)uj$, where $I(i)$ collects the observations across all time periods for the same cross-sectional unit as $i$; $I(i)={(i,s):s\u2208T}$ where $i=(i,t)$. In this case, since the transformed error term is cross-sectionally uncorrelated, $E(u\u02d9iu\u02d9j|X)=0$ if $j\u2209I(i)$, cluster robust variance estimation will generally be consistent: $\u2211i\u2208N\xd7T\u2211j\u2208I(i)x\u02d9iu\u02d9iu\u02d9jx\u02d9j\u2243\u2211i,j\u2208N\xd7Tx\u02d9iE(u\u02d9iu\u02d9j|X)x\u02d9j=\u2211i\u2208N\xd7Tx\u02d9i2E(ui2|X)$ (see Wooldridge, 2010).^{15}

This approach cannot be extended to the case of two-way fixed-effects regression because in general, the transformation matrix $Mgn$ can be fully dense, so that the transformed error term $u\u02d9i$ for observation $i$ can generally be correlated with all other observations of the transformed error term even if the original error term $ui$ is uncorrelated. This can be understood using a graph formed by teachers as nodes and students as edges, formally a graph $Gn={Dn,En}$, where the set of nodes is given by the set of teachers $Dn$ and an edge exists between two teachers if they have taught at least one student in common, $(d,d')\u2208En$ if $\u2211i\u2208NNd(i)Nd'(i)>0$, where $Nd(i)$ is the number of times student $i$ was taught by teacher $d$. The graph corresponding to the assignment of students and teachers used in the Monte Carlo simulations is represented in figure 1, and the graph corresponding to the assignment used in the empirical application is represented in the appendix.

for $i,j,t,s,t',s'$ such that $dit,djt'=d$, $dis,djs'=d'$.

where $dit,djt'=d$, $dis,dks''=d'$, $djs',dkt''=d1$.

For any path of any length in $Gn$ from $dit$ to $dis$, one can form a valid moment condition for $\beta 0$ in this way by comparing the differenced term $yit-yis-\beta 0(xit-xis)$ to the sum of the differences along that path in $Gn$ from $dit$ to $dis$. The appendix shows that the two-way fixed-effects regression is a transformation that obtains a linear combination of these comparisons, so that the transformed error term $u\u02d9it$ is a weighted sum of the comparisons between ${uit-uis:s\u2208T}$ and the sums of differences in the original error term ${ujs}(j,s)\u2208N\xd7T$ along paths in $Gn$ from $dit$ to ${dis:s\u2208T}$.^{16} It is therefore possible that each observation of the original error term ${ujs}(j,s)\u2208N\xd7T$ enters the definition of the transformed error term $u\u02d9it$, possibly leading to a non-0 correlation between all observations of the transformed error term.

In this section, I propose basing inference for $\beta 0$ on a different estimator which only makes use of moment conditions for estimating $\beta 0$ obtained by comparisons that are local in $Gn$, that is, moment conditions of the same form as equation (16), ignoring additional moment conditions that are obtained by comparisons on paths of length greater than 1 such as equation (17). The shortcoming of this approach is that it purposely does not make use of all the implications of the model given by equation (1) for estimating $\beta 0$. This will generally lead to a loss in efficiency. The advantage of the approach proposed here is that by using local comparisons only, it relies on transformed data with a tractable dependence structure, so that an analog of cluster robust standard errors with one-way fixed-effects models can be defined in this context.

In order to introduce this estimator, consider the subset of data obtained by collecting the observations assigned to teachers $d$ or $d'$ and corresponding to students who have been taught by both $d$ and $d'$, that is, the subset of data defined by $S(d,d')={(i,t)\u2208N\xd7T:dit\u2208{d,d'},Nd(i)Nd'(i)>0}$, where, recall, that $Nd(i)$ is the number of times student $i$ was taught by teacher $d$. To shorten notation, write $e=(d,d')$, and define the data corresponding to this subset: $yn(e)=[yit](i,t)\u2208S(e)$, $xn(e)=[xit](i,t)\u2208S(e)$, $gn(e)=[gn,it](i,t)\u2208S(e)$, $un(e)=[uit](i,t)\u2208S(e)$, where $gn,it=[[1[i=j]]j\u2208N,[1[dit=d]]d\u2208Dn]$ is the row of $gn$ corresponding to observation $(i,t)$.

The moment condition (18) holds for any pair of teachers $d$, $d'$ who have taught at least one student in common, so that the subset $S(d,d')$ is nonempty; that is, equation (18) holds $\u2200e\u2208En$. With sparsely matched data, the number of such subsets will be proportional to sample size ($nT)$ since the number of teachers ($Nn$) is proportional to sample size, and each teacher teaches finitely many students ($Nn,d\u2264C$$\u2200d,n$), and each student has been taught by finitely many teachers ($T$ is fixed), so that there can only be finitely many edges connected to each teacher. In addition, since each teacher teaches finitely many students, the number of observations inside each subset $S(e)$, $e\u2208En$ will be finite.

^{17}

^{18}We can then write

where $u^n(e)=y\u02d9n(e)-\beta \u02dcnx\u02d9n(e)$ are the residuals from the pooled regression.

In the remainder of this section, I establish the consistency of M-dependence robust variance estimation in the more general case where some correlation exists in the original error term $uit$ of the model. The first assumption restricts the extent of this correlation.

This assumption allows for arbitrary correlation in $uit$ across observations that correspond to the same student or to the same teacher, but it imposes independence otherwise. I discuss at the end of this section possible generalizations of this assumption.

The next assumption imposes restrictions on the moments of the data.

The support of $xit$ and $uit$ is compact.

$1n\u2211e\u2208Enx\u02d9n(e)'x\u02d9n(e)\u2265c$$\u2200n\u2265C$ for constants $c>0$ and $C$ a.s.

$1nVar(\u2211e\u2208Enx\u02d9n(e)'u\u02d9n(e)|X)\u2265c$$\u2200n\u2265C$ for constants $c>0$ and $C$ a.s.

Assumption 2a is standard and could easily be replaced by higher moments of $xit$ and $uit$ being uniformly bounded. Assumption 2b imposes that the amount of identifying variation in the transformed covariate $x\u02d9n(e)$ is proportional to sample size. A sufficient condition for assumption 2b to hold is that there is some identifying variation in the transformed covariates $x\u02d9n(e)$ in a nonvanishing fraction of subsets, that is, $x\u02d9n(e)'x\u02d9n(e)\u2265c>0$ for $e\u2208En+$ where $|En+||En|\u2265c>0$$\u2200n$. This in turn would hold, for instance, if for a nonvanishing fraction of edges $(d,d')\u2208En$, we can find two students who have been taught by both $d$ and $d'$ but experienced a different change in their covariate $x$ between being taught by $d$ and $d'$.^{19}

Assumption 2c is a standard regularity condition for the application of central limit theorems and requires that the variance of the numerator of the pooled regression estimator be of order proportional to sample size. The appendix shows that this would hold if assumptions 2a and 2b hold and $\lambda min(Var(un|X))\u2265c$ for a constant $c>0$ where $\lambda min$ is the smallest eigenvalue of a matrix. This latter condition holds if, for instance, $uit$ can be written $uit=eit+\epsilon it$, where $Cov(eit,\epsilon js|X)=0$$\u2200(i,t),(j,s)$, $Cov(\epsilon it,\epsilon js|X)=0$ if $(i,t)\u2260(j,s)$, and $Var(\epsilon it|X)\u2265c$$\u2200(i,t)$. This decomposition of $uit$ uses the term $eit$ to capture all forms of dependence, while the term $\epsilon it$ captures the idiosyncratic variation in $uit$. Informally, we can say that assumption 2c holds if assumptions 2a and 2b hold and if each observation of $uit$ is potentially correlated with but not linearly dependent on other observations of $uit$.

Proposition ^{4} shows that the pooled regression estimator is asymptotically normal under assumptions ^{2} and ^{3} in a sparsely matched data asymptotic framework.

^{2}and

^{3}, as $n\u2192\u221e$ while $T$ remains fixed and $Nn,d\u2264C$$\u2200d\u2208Dn$, $\u2200n$, for a constant $C$:

^{2}allows for serial and cross-sectional correlation in the error term $uit$. As a result, the transformed error term $u\u02d9n(e)$ for the subset of data indexed by $e\u2208En$ will potentially be correlated with the transformed error term of all other subsets that share either students or teachers in common with subset $S(e)$. Formally, we redefine the dependence neighborhood of $u\u02d9n(e)$ to be

The number of subsets ($|En|$) is proportional to sample size, and as in the simpler case of uncorrelated errors discussed above, the cardinality of the new dependence neighborhoods ($|D(e)|$) is uniformly bounded, which leads to consistency of this variance estimator as shown by the next proposition:

^{2}and

^{3}, as $n\u2192\u221e$ while $T$ remains fixed and $Nn,d\u2264C$$\u2200d\u2208Dn$, $\u2200n$, for a constant $C$:

Propositions ^{4} and ^{5} show that researchers can rely on Wald tests using the pooled regression estimator ($\beta \u02dcn$) and M-dependence robust estimated variance ($v^n2$) to perform asymptotically valid inference for models with two-way fixed effects using sparsely matched data.

#### Extensions.

In some empirical applications, it might be desirable to modify assumption ^{2} in order to allow for more general patterns of dependence. For instance in the empirical example of this paper, I only impose independence for different students in different schools instead of independence for different students in different classrooms.

In addition, in some empirical applications, it is possible that subsetting at the edge level would either be needlessly computationally cumbersome or would leave out comparisons in $xit$ that one knows to account for a significant share of the identifying variation for $\beta 0$.

The method above can be extended to account for different patterns of dependence and different definitions of subsets as long as the M-dependence robust variance estimator relies on many subsets with relatively small dependence neighborhoods. In practice, an empirical researcher would calculate the number of subsets and the cardinality of each dependence neighborhoods to verify that the M-dependence robust estimated variance relies on many subsets with relatively few cross-products per subset.

In addition, note that it is straightforward to show that the two-way fixed-effects regression estimator discussed in section IIA is also $n$-consistent under assumptions ^{2} and ^{3} above and a sparsely matched data asymptotic framework. Since this estimator is frequently used in empirical work and is likely to be more precise than the pooled regression estimator, one could report the results from two-way fixed-effects regression as a preferred-point estimate and use the method discussed in this section based on the pooled regression estimator for inference only.

Finally, the method described here is easily extended to models where neither variable indexing unobserved heterogeneity corresponds to the cross-sectional index. For instance, data where observations are indexed by student and year could be used with a model with teacher and school fixed effects. In this case, one could, for instance, define subsets by collecting all observations corresponding to pairs of connected schools. With dependence indexed by students and teachers, as is considered in this section, the resulting dependence neighborhoods for each subset would still collect all subsets with either a student or a teacher in common with the subset under consideration. More generally, the method discussed can be extended to any setting with high-dimensional but sparse control covariates, that is, $yn=\beta 0xn+wn\eta n,0+un$, $E(un|xn,wn)=0$, where the number of variables in $wn$ is proportional to sample size but for each observation $wn,i$, only finitely many variables are non-0, which includes models with three or more fixed effects. An example is provided in section V.

#### Small-Sample Corrections.

While leading to asymptotically valid inference, robust standard errors are well known to exhibit biases in small samples and to lead to Wald tests with size distortions in small samples (see MacKinnon & White, 1985; Imbens & Kolesár, 2015; and Cameron & Miller, 2015, for a review in the context of uncorrelated and cluster dependent errors).

In the appendix, I show that the small sample corrections discussed in Bell and McCaffrey (2002) and Imbens and Kolesár (2015) for uncorrelated and cluster dependent errors (local debiasing and using critical values from the $t$-distribution instead of the standard normal distribution) can be extended to obtain small sample adjustments for the test relying on the pooled regression estimator and M-dependence robust variance estimation discussed above.

## III. Estimation with Dynamic Models

^{20}This would lead to a dynamic model of student achievement such as

where $Xt={xis}i\u2208N,s\u2264t$ and $xit$ is a scalar covariate. All results extend to finite-dimensional and endogenous covariates with sequentially exogenous instrumental variables in a straightforward way.

Estimation of $\beta 0$ in this model by two-way fixed-effects regression or by the pooled regression estimator discussed in the previous section would generally lead to inconsistent estimators of $\beta 0$ with sparsely matched data because the transformed error term $u\u02d9it$, where $u\u02d9n=Mgnun$ and $u\u02d9it$ is the element of $u\u02d9n$ corresponding to observation $(i,t)$, will be correlated with the covariate $xit$, and this correlation is not asymptotically ignorable with sparsely matched data. This is a similar source of bias as for dynamic models with one-way fixed effects discussed in Nickell (1981).

With one-way fixed effects, the solution discussed in Anderson and Hsiao (1981), Arellano and Bond (1991), Chamberlain (1992), and subsequent papers is to rely on recursive transformations that guarantee that the transformed error term in each time period $t$ is orthogonal to the covariate $xit$, leading to estimation by method of moments.

^{21}This estimator is given by an instrumental variable regression estimator,

Using data only from time period $t$ to $T$, calculate the residuals from a two-way fixed effects regression of $y$ and $x$ on the set of indicator variables for both indexing variables. Keep these residuals for time period $t$ only, and denote these two transformed variables by $y\u02d8it$ and $x\u02d8it$.

Using a parsimonious model, calculate an estimated version of $E(xis|Xt)$ for $s\u2265t$; denote this new variable $zis\u2605$, $s\u2265t$.

^{22}Calculate the residuals at time period $t$ from a two-way fixed-effects regression of $z\u2605$ on the set of indicator variables for both indexing variables using only data from time period $t+1$ to $T$. Denote this transformed variable by $z\u02c7it$.

The appendix discusses this estimator in more detail, and in particular provides an example for a parsimonious model of $E(xis|Xt)$, which is used in the Monte Carlo simulations and the empirical application below. Note that in practice, the model for $E(xis|Xt)$ does not need to be correctly specified, so the objective of modeling this “first stage” is to obtain exactly identifying instrumental variables, which are as predictive as possible while relying on few nuisance parameters in order to avoid incidental parameter issues known to arise in methods that rely on many moment conditions (see Alvarez & Arellano, 2003, or Windmeijer, 2005, for a discussion in the context of dynamic models with one-way fixed effects).

The appendix also shows that the estimator proposed here is locally efficient; it is asymptotically efficient in the class of estimators consistent under equation (27) when the error term $uit$ is homoskedastic and uncorrelated and when the exactly identifying instrumental variables used ($z\u02c7it$) are constructed using the true conditional mean functions $E(xis|Xt)$, a result analogous to the local efficiency, obtained by the Gauss-Markov theorem, of the two-way fixed-effects regression for estimating models with strictly exogenous covariates.

Intuitively, this result of local efficiency follows from the particular transformations applied in steps 1 and 2 above to the covariates, outcome variable, and instrumental variables. The transformation in step 1 guarantees that the transformed moment functions $(y\u02d8it-\beta 0x\u02d8it)$ are serially uncorrelated under the auxiliary assumption that the error term $uit$ is homoskedastic and uncorrelated, so that pooling over time yields an efficient estimator. The transformation in step 2 guarantees that the exactly identifying instrumental variables $z\u02c7it$ are optimal instruments for the transformed moment functions $(y\u02d8it-\beta 0x\u02d8it)$ under the auxiliary assumption, so that pooling in the cross-section yields an efficient estimator.

Note also that the estimator for dynamic models proposed here can be computed by combining results from a sequence of two-way fixed-effects regressions so that existing tools are sufficient for computation.

## IV. Monte Carlo Simulation Results

In this section I present two groups of results from a small Monte Carlo simulation study of estimation and inference for linear models of panel data with two-way unobserved heterogeneity and sparsely matched data.

The assignment ${dit}{i,t}\u2208N\xd7T$ used for the simulations of this section was obtained from a subset of the administrative data on public school students in North Carolina, which are discussed in section V. The variable $dit$ is taken to be school assignment, of which there are 205 possible values in these simulations ($Nn=205$). Observations are indexed by a student identifier denoted by $i$, which takes 508 values, and a year indicator denoted by $t$, which takes 4 values ($n=508$ and $T=4$). The form of the resulting assignment is depicted in figure 1 by a representation of the graph where schools are nodes and edges are students (the graph $Gn$ defined in section IIB).

### A. Inference

where $oit$ is an ordering of all of the observations assigned to a value $d$ of $dit$, which was set at random.

The data-generating process used here leads to an error term $uit$, which exhibits heteroskedasticity and correlation across observations that correspond to either the same student or the same school.

In this section, I evaluate the finite sample performance of the method defined in section IIB, which relies on the pooled regression estimator, M-dependence robust standard errors, and a Wald test (M-robust). With the assignment used here and depicted in figure 1, the pooled regression estimator and M-dependence robust standard errors are based on 254 subsets with dependence neighborhoods that have an average cardinality of 6.15. I also implement two small sample corrections discussed in the appendix: locally debiased standard errors (debiased) and locally debiased standard errors combined with critical values from the $t$-distribution rather than the standard normal distribution ($t$-distribution).

I compare the performance of the methods above with Wald tests using the two-way fixed-effects regression estimator and classical standard errors defined at the beginning of section II, which are not robust to heteroskedasticity or serial or cross-sectional dependence (homoskedasticity), one-way clustered standard errors along the indexing variable $i$ (cluster 1) or $dit$ (cluster 2); two-way clustered standard errors along both indexing variables $i$ and $dit$ as in Cameron et al. (2011) (CGM); the standard errors of Cattaneo et al. (2017), which are not robust to the presence of correlation in the original error term $uit$ (CJN); and the extension of CJN to two-way cluster dependence in the original error term $uit$ defined in the appendix (CJN correlated).

I also show the performance of methods that rely on grouped data combined with a $t$-test as in Ibragimov and Müller (2010; hereafter IM) or a randomization test as in Canay, Romano, and Shaikh (2017, hereafter CRS). Seven or ten groups were defined for these methods by using the SCPS clustering algorithm (Nepusz, Sasidharan, & Paccanaro, 2010) applied to the graph $Gn$ defined in section IIB. This algorithm assigned each school to one of seven or ten groups using a $k$-means clustering algorithm based on the eigenvectors of the symmetric normalized Laplacian of $Gn$. The assignment of school nodes to each group is shown in figure 1 for the case where seven groups were used, from which it is evident that there is indeed a limited amount of overlap across these groups.^{23}

Ninety-five percent confidence intervals were obtained by inverting the tests defined above. Table 1 displays, where relevant, the relative bias of each estimated variance and standard error compared to the variance and standard deviation of its corresponding estimator (either the pooled regression estimator or the two-way fixed effects regression estimator). It also displays coverage probability and the average and median of the width of 95% confidence intervals. Ten thousand replications were used to obtain these results.^{24} The appendix also contains a power plot for the methods based on subsetting proposed in this paper and the methods based on grouped data.

. | $E(Var^(\beta ^))Var(\beta ^)$ . | $E(se(\beta ^))sd(\beta ^)$ . | $P(coverage)$ . | $E(width)$ . | $Q.5(width)$ . |
---|---|---|---|---|---|

M-robust | 0.953 | 0.924 | 0.925 | 0.850 | 0.790 |

Debiased | 1.023 | 0.957 | 0.934 | 0.880 | 0.817 |

Debiased and critical values from the $t$-distribution | 1.023 | 0.957 | 0.943 | 0.918 | 0.853 |

Homoskedasticity | 0.504 | 0.691 | 0.826 | 0.629 | 0.597 |

Cluster1 | 0.435 | 0.640 | 0.795 | 0.583 | 0.553 |

Cluster2 | 0.692 | 0.791 | 0.876 | 0.720 | 0.669 |

CGM | 0.756 | 0.829 | 0.894 | 0.755 | 0.704 |

CJN | 0.473 | 0.667 | 0.816 | 0.607 | 0.576 |

CJN (correlated) | 0.774 | 0.844 | 0.902 | 0.768 | 0.719 |

IM7 | 1.000 | 10.474 | 3.077 | ||

IM10 | 1.000 | 101.296 | 3.564 | ||

CRS7 | 0.954 | 3.947 | 1.290 | ||

CRS10 | 0.936 | 24.136 | 1.135 |

. | $E(Var^(\beta ^))Var(\beta ^)$ . | $E(se(\beta ^))sd(\beta ^)$ . | $P(coverage)$ . | $E(width)$ . | $Q.5(width)$ . |
---|---|---|---|---|---|

M-robust | 0.953 | 0.924 | 0.925 | 0.850 | 0.790 |

Debiased | 1.023 | 0.957 | 0.934 | 0.880 | 0.817 |

Debiased and critical values from the $t$-distribution | 1.023 | 0.957 | 0.943 | 0.918 | 0.853 |

Homoskedasticity | 0.504 | 0.691 | 0.826 | 0.629 | 0.597 |

Cluster1 | 0.435 | 0.640 | 0.795 | 0.583 | 0.553 |

Cluster2 | 0.692 | 0.791 | 0.876 | 0.720 | 0.669 |

CGM | 0.756 | 0.829 | 0.894 | 0.755 | 0.704 |

CJN | 0.473 | 0.667 | 0.816 | 0.607 | 0.576 |

CJN (correlated) | 0.774 | 0.844 | 0.902 | 0.768 | 0.719 |

IM7 | 1.000 | 10.474 | 3.077 | ||

IM10 | 1.000 | 101.296 | 3.564 | ||

CRS7 | 0.954 | 3.947 | 1.290 | ||

CRS10 | 0.936 | 24.136 | 1.135 |

$\beta 0$ is estimated by the pooled regression estimator for the first three methods (bias = .002, standard deviation = .235). $\beta 0$ is estimated by the two-way fixed-effects regression estimator for the following six methods (bias $=-$.001, standard deviation = .232). The first three methods rely on subsetting M-dependence robust standard errors, local debiasing, critical values from a $t$-distribution. The average number of degrees of freedom used for the $t$-distribution is $29.924$. The next four methods correspond to classical standard errors, one-way clustering along either indexing variable, two-way clustering. CJN is the method of Cattaneo et al. (2017), while CJN correlated is its extension to correlated errors defined in this paper. The last three columns are for 95% confidence intervals, which were obtained by inverting Wald or $t$-tests for the first nine methods. Confidence intervals for IM and CRS rely on grouped data and invert either a $t$-test or a randomization test, respectively. IM7 and CRS7 rely on seven groups, IM10 and CRS10 on ten groups.

We see that the methods based on subsetting proposed in this paper yield standard errors with a relatively low bias and tests with moderate size distortions, particularly when small sample adjustments are applied. Classical standard errors, clustered standard errors, or the standard errors of Cattaneo et al. (2017) exhibit a large bias. The extension of the approach of Cattaneo et al. (2017) to correlated error terms leads to an improvement in bias and size of the resulting test but does not perform as well as the methods based on subsetting. Inference for grouped data using a $t$-test (IM) is very conservative and exhibits extremely low power away from the null hypothesis. Inference for grouped data using a randomization tests (CRS) exhibits good size control since the groups used here have very little overlap. However, power, while significantly higher than IM, is still very low compared to the Wald or $t$-tests using the estimator and standard errors proposed here. For instance, the average width of 95% confidence intervals using CRS and seven groups is more than four times larger than using the methods proposed here.^{25}

### B. Estimation of Dynamic Models

where $Yt-1=[yis]i=1,\u2026,n,s=1,\u2026,t-1$.

This is a dynamic model of panel data with one strictly exogenous covariate, $xit$, which will be taken to be binary and will be labeled as a “policy variable” so that policy evaluation here would amount to estimating $\beta 0$ precisely. This labeling is used only to emphasize that the results discussed in section III matter not only for estimating the coefficients on sequentially exogenous covariates ($\rho 0$ here) but also on strictly exogenous covariates ($\beta 0$ here).

I present results for the dynamic two-way fixed effects estimator defined in section III based on the full sample (DTW f.s., defined by an extension of equation [28] to multiple covariates) and the estimator defined in section III based on subsetting (DTW sub., defined by an extension of equation [29] to multiple covariates). Both estimators are discussed in more detail in the appendix. The first alternative estimator considered is the two-way fixed-effects regression estimator, which treats $yit-1$ as a strictly exogenous covariate. I call this estimator the static two-way fixed-effects estimator (STW). The second alternative estimator is an estimator as in section III but treats $[1[dit=d]]d\u2208Dn$ as a finite-dimensional vector of strictly exogenous covariates, so that only $ci$ is removed by sequential transformations of the data. I call this estimator the naive two-way fixed-effects estimator (NTW). The third alternative is a dynamic estimator as in section III but ignores $edit$ and estimates a dynamic model with cross-sectional unobserved heterogeneity only. I call this estimator the one-way fixed-effects estimator 1 (OW1). The fourth alternative estimator is obtained by a regression of $yit$ on $yit-1$, $xit$, and the indicator variables ${1[dit=d]}d\u2208Dn$ (OW2). Finally, the fifth alternative estimator ignores all forms of unobserved heterogeneity and estimates $\rho 0$ and $\beta 0$ by a regression of $yit$ on $yit-1$ and $xit$ (OLS).

In table 2, I report summary statistics from the simulated distributions of the estimators of $\rho 0$ and $\beta 0$: expected bias, mean absolute error, and the spread between the 5th and 95th percentiles of the distribution of the estimators. Ten thousand replications were used to obtain the simulated distributions of the estimators.

. | DTW (f.s.) . | DTW (sub.) . | STW . | NTW . | OW1 . | OW2 . | OLS . |
---|---|---|---|---|---|---|---|

$E(\rho ^-\rho 0)$ | −0.006 | −0.006 | −0.216 | −0.116 | −0.094 | 0.273 | 0.390 |

$E(|\rho ^-\rho 0|)$ | 0.023 | 0.024 | 0.216 | 0.116 | 0.116 | 0.273 | 0.390 |

$Q0.95(\rho ^)-Q0.05(\rho ^)$ | 0.095 | 0.096 | 0.141 | 0.161 | 0.348 | 0.049 | 0.100 |

$E(\beta ^-\beta 0)$ | −0.002 | −0.002 | −0.069 | −0.030 | 1.016 | 0.292 | 1.209 |

$E(|\beta ^-\beta 0|)$ | 0.077 | 0.079 | 0.097 | 0.095 | 1.016 | 0.293 | 1.209 |

$Q0.95(\beta ^)-Q0.05(\beta ^)$ | 0.316 | 0.328 | 0.318 | 0.377 | 0.766 | 0.301 | 0.793 |

. | DTW (f.s.) . | DTW (sub.) . | STW . | NTW . | OW1 . | OW2 . | OLS . |
---|---|---|---|---|---|---|---|

$E(\rho ^-\rho 0)$ | −0.006 | −0.006 | −0.216 | −0.116 | −0.094 | 0.273 | 0.390 |

$E(|\rho ^-\rho 0|)$ | 0.023 | 0.024 | 0.216 | 0.116 | 0.116 | 0.273 | 0.390 |

$Q0.95(\rho ^)-Q0.05(\rho ^)$ | 0.095 | 0.096 | 0.141 | 0.161 | 0.348 | 0.049 | 0.100 |

$E(\beta ^-\beta 0)$ | −0.002 | −0.002 | −0.069 | −0.030 | 1.016 | 0.292 | 1.209 |

$E(|\beta ^-\beta 0|)$ | 0.077 | 0.079 | 0.097 | 0.095 | 1.016 | 0.293 | 1.209 |

$Q0.95(\beta ^)-Q0.05(\beta ^)$ | 0.316 | 0.328 | 0.318 | 0.377 | 0.766 | 0.301 | 0.793 |

DTW (f.s.) and DTW (sub.) are the estimators proposed in section III relying on the full sample without subsetting or on subsetting, respectively. STW treats covariates as strictly exogenous, NTW treats one of the fixed effects as finite-dimensional. OW1 and OW2 each ignore one of the fixed effects. OLS ignores both fixed effects.

Looking at bias or the mean absolute error of the estimators, we see that it is not only accounting for both sources of heterogeneity that is important for estimating both $\rho 0$ and $\beta 0$ accurately (DTW outperforms OW1, OW2, and OLS), but also accounting for the fact that one of the covariates is sequentially exogenous (DTW outperforms STW), and using an estimator that treats neither $ci$ nor $ed$ as consistently estimated (DTW outperforms NTW). It is important to note that the bias in terms of policy evaluation, which here is captured by the bias in estimating $\beta 0$, can be large when using estimators that do not account for both sources of unobserved heterogeneity. In other words, when $\beta 0$ is the only object of interest, the argument stating that $yit-1$ is a proxy for either term of unobserved heterogeneity does not hold with the specific data-generating process used in this section. This is consistent with the results in section V, where estimators that do not account for several sources of unobserved heterogeneity lead to results that are significantly different from the results obtained with the estimator developed in this paper. The appendix contains additional results on estimation bias with static models.

## V. Effect of Class Size Reductions on Student Achievement

In this section I estimate the effect of class size reductions on student achievement among elementary school students in North Carolina using a dynamic model of learning that accounts for selection based on student unobserved ability and teacher and school unobserved quality.

where $yit$ is student $i$'s achievement in year $t$ as measured by a standardized test score, $gradeit$ is a student's grade in year $t$, $\alpha gradeit,t$ captures year/grade heterogeneity in student achievement, $yit-1$ is past achievement, $xit$ is a fourth-order polynomial in class size used to allow for nonlinearities in the effect of class size on achievement, $ci$ is student unobserved learning ability, $edit$ is teacher unobserved teaching effectiveness, $fschoolit$ are unobserved school-level factors, $uit$ are transitory shocks, $Yt-1$ collects all prior test scores ($Yt-1={yis}i\u2208N,s\u2264t-1$), and $X$ collects all class size assignments ($X={xit}(i,t)\u2208N\xd7T$). The appendix contains a longer discussion of this model.^{26}

The estimation sample is composed of observations on students in fourth and fifth grade between the years 2009 and 2012.^{27} In order to avoid using the polynomial in class size to extrapolate outside the effective range of class size observed in the sample, I restrict the estimation sample to observations with class size between ten and thirty students.^{28} Table 3 shows several statistics from the sample used for estimation, such as unconditional average and standard deviation of test scores, and average test scores for various values of class size. Test scores are left in their original units here. These units can be contextualized by considering percentiles of the distribution of test scores. An increase in test score by 1 would switch a 50th percentile student to the 54th percentile in mathematics and a 48th percentile student to the 53rd
percentile in reading. On average, an increase in test score by 1 is associated with moving up in the grade distribution by around 3 percentiles. The appendix contains a longer discussion of the data.

. | Mathematics . | Reading . |
---|---|---|

Students | 206,722 | |

Teachers | 11,960 | 12,332 |

Years | 2009 to 2012 | |

Grades | 4 and 5 | |

Test scores | ||

Average | 355.04 | 349.16 |

SD | 9.38 | 9.52 |

Class size | ||

Average | 21.99 | 21.44 |

SD | 4.06 | 4.53 |

Average test score by class size | ||

10–14 students | 353.60 | 347.87 |

15–19 students | 353.29 | 347.42 |

20–24 students | 354.82 | 349.03 |

24–30 students | 356.81 | 351.05 |

. | Mathematics . | Reading . |
---|---|---|

Students | 206,722 | |

Teachers | 11,960 | 12,332 |

Years | 2009 to 2012 | |

Grades | 4 and 5 | |

Test scores | ||

Average | 355.04 | 349.16 |

SD | 9.38 | 9.52 |

Class size | ||

Average | 21.99 | 21.44 |

SD | 4.06 | 4.53 |

Average test score by class size | ||

10–14 students | 353.60 | 347.87 |

15–19 students | 353.29 | 347.42 |

20–24 students | 354.82 | 349.03 |

24–30 students | 356.81 | 351.05 |

The model given by equations (37) and (38) is estimated as described in section III using parsimonious exactly identifying instrumental variables, which are detailed in the appendix, and the transformation defined in section III. As discussed in sections II and III, I report results using the estimator without subsetting, defined by an extension of equation (28) to multiple covariates, as the preferred-point estimate. As in section II, inference is based on the pooled regression estimator, which relies on subsetting, defined by an extension of equation (29) to multiple covariates. Details for all estimators
discussed in this section are provided in the appendix. Each subset corresponds to a pair of teachers who have taught at least two students in common, and the subset is composed of all observations corresponding to students who have been taught by both teachers.^{29} Standard errors used here are designed to be robust to arbitrary dependence at the school and student levels, which leads to a minor modification compared to the standard errors defined in section II. The standard errors used here rely on dependence neighborhoods that collect all subsets containing observations indexed by the same school or student, while the standard errors defined in Section II would use dependence neighborhoods that collect all subsets containing observations indexed by the same teacher or student. This procedure leads to around 25,000 subsets with dependence neighborhoods that
have an average cardinality around 37, so that the asymptotic approximation treating the number of subsets as large and the dependence neighborhoods as relatively small seems appropriate here.

In order to evaluate the difference between the results obtained with the methods introduced in this paper and existing methods, I also present results obtained with several alternative estimators. I show results obtained with an estimator for dynamic models with one-way unobserved heterogeneity indexed by student identity only, as used, for instance, in Andrabi et al. (2011), Rothstein (2010), and Verdier (2016). I also show results obtained with an estimator for models without student-level unobserved heterogeneity and which treats teacher- and school-level unobserved heterogeneity as finite-dimensional parameters that can be estimated consistently. This type of estimator is used, for instance, in Kane and Staiger (2008), Chetty, Friedman, and Rockoff (2014), and Guarino, Reckase, and Wooldridge (2015). A heuristic argument frequently found in the empirical literature for using estimators that correspond to dynamic models without student-level unobserved heterogeneity is that past test scores serve as proxies for student unobserved ability. Finally, I show results obtained with an estimator for models without unobserved heterogeneity, which is simply an ordinary least squares regression of test score on past test score, grade-year indicators, and the polynomial in class size. In the appendix, I also consider models in gains ($\rho 0$ is assumed to equal 1), models without dynamics ($\rho 0$ is assumed to equal zero), and models with measurement error where test scores are noisy measures of student achievement.

The results are presented in table 4. In mathematics, on a range of class size from 15 to 30 students, reductions in class size are estimated to have a positive effect on student achievement, and the evolution of the magnitude of this effect over the range of class size corresponds to the hypothesis of decreasing marginal returns to teacher time per pupil. The magnitude of this effect is relatively modest; for example, reducing class size from 30 to 15 students is estimated to lead to an average increase in test score of around 0.8 (i.e., around 10% of a standard deviation) or a move up in the distribution of test scores of around 3 percentiles on average. For comparison, Krueger (1999) finds that a smaller class size reduction of 24 to 16 students on average is associated with a gain in test scores of around 3 percentiles on average for third graders.^{30} In reading, reductions in class size are not estimated to have a statistically significant effect on student achievement. In mathematics, on a range of class size from 10 to 15 students, reductions in class size are estimated to lead to a decrease in student achievement. This reversal in the sign of the effect of class size reduction could be due to factors that are not controlled for here but are related to class size. In particular, ongoing work is interested in how these results could change when accounting for peer effects. It appears from these results that there is positive selection not only of students into larger class sizes, but also of teachers, since estimators that do not account for student and teacher unobserved heterogeneity simultaneously lead to smaller estimates of the effect of class size reduction on student achievement than the estimator proposed in this study.

. | Three-Way Unobserved Heterogeneity . | Only Student Unobserved Heterogeneity . | No Student Unobserved Heterogeneity . | No Unobserved Heterogeneity . |
---|---|---|---|---|

Mathematics | ||||

Persistence | 0.089 [0.070, 0.102] | 0.109 (0.007) | 0.787 (0.001) | 0.816 (0.001) |

Class size reduction from 30 to: | ||||

25 | 0.437 [0.071, 0.730] | 0.061 (0.100) | 0.291 (0.128) | −0.253 (0.140) |

20 | 0.724 [0.405, 1.020] | 0.398 (0.092) | 0.434 (0.120) | −0.309 (0.129) |

15 | 0.875 [0.519, 1.225] | 0.417 (0.107) | 0.498 (0.138) | −0.214 (0.147) |

10 | 0.610 [0.209, 1.029] | 0.070 (0.126) | 0.266 (0.164) | −0.643 (0.163) |

Reading | ||||

Persistence | 0.230 [0.213, 0.235] | 0.256 (0.004) | 0.690 (0.001) | 0.719 (0.001) |

Class size reduction from 30 to: | ||||

25 | −0.018 [−0.438, 0.240] | −0.099 (0.090) | −0.089 (0.107) | −0.438 (0.100) |

20 | 0.169 [−0.283, 0.351] | 0.102 (0.082) | −0.032 (0.100) | −0.660 (0.092) |

15 | 0.261 [−0.192, 0.511] | 0.180 (0.094) | −0.045 (0.114) | −0.637 (0.103) |

10 | 0.245 [−0.246, 0.535] | 0.125 (0.105) | 0.053 (0.122) | −0.562 (0.109) |

. | Three-Way Unobserved Heterogeneity . | Only Student Unobserved Heterogeneity . | No Student Unobserved Heterogeneity . | No Unobserved Heterogeneity . |
---|---|---|---|---|

Mathematics | ||||

Persistence | 0.089 [0.070, 0.102] | 0.109 (0.007) | 0.787 (0.001) | 0.816 (0.001) |

Class size reduction from 30 to: | ||||

25 | 0.437 [0.071, 0.730] | 0.061 (0.100) | 0.291 (0.128) | −0.253 (0.140) |

20 | 0.724 [0.405, 1.020] | 0.398 (0.092) | 0.434 (0.120) | −0.309 (0.129) |

15 | 0.875 [0.519, 1.225] | 0.417 (0.107) | 0.498 (0.138) | −0.214 (0.147) |

10 | 0.610 [0.209, 1.029] | 0.070 (0.126) | 0.266 (0.164) | −0.643 (0.163) |

Reading | ||||

Persistence | 0.230 [0.213, 0.235] | 0.256 (0.004) | 0.690 (0.001) | 0.719 (0.001) |

Class size reduction from 30 to: | ||||

25 | −0.018 [−0.438, 0.240] | −0.099 (0.090) | −0.089 (0.107) | −0.438 (0.100) |

20 | 0.169 [−0.283, 0.351] | 0.102 (0.082) | −0.032 (0.100) | −0.660 (0.092) |

15 | 0.261 [−0.192, 0.511] | 0.180 (0.094) | −0.045 (0.114) | −0.637 (0.103) |

10 | 0.245 [−0.246, 0.535] | 0.125 (0.105) | 0.053 (0.122) | −0.562 (0.109) |

Intervals in brackets are 95% confidence intervals; numbers in parentheses are standard errors. Confidence intervals for the first estimator rely on subsetting and pooled regression estimation and are robust to arbitrary dependence indexed by school and student. Standard errors for the rest of the estimators are two-way clustered standard errors by student and teacher. Note that it is likely that there is no asymptotic justification for using the third estimator (teacher and school unobserved heterogeneity treated as parameters that can be estimated consistently) jointly with two-way clustered standard errors. Units for the effect of class size reductions are the original test score units as described in the text.

## VI. Conclusion

This paper provides asymptotically valid methods for inference with linear models of panel data that include multiway unobserved heterogeneity even when few observations are assigned to each value of the variables that index unobserved heterogeneity and when dependence is present. It also defines a new estimator for dynamic models (models with sequentially exogenous instruments), so that practitioners do not have to make a choice between including one-way unobserved heterogeneity only, suppressing the dynamics of the model, or treating all but one term of unobserved heterogeneity as finite-dimensional parameters that can be estimated consistently.

## Notes

^{1}

Models with two-way unobserved heterogeneity are not restricted to education or worker-firm matched data. For instance, Chetty and Hendren (2018) and Finkelstein, Gentzkow, and Williams (2016) consider households and patients assigned to geographical locations and models for income and health measures with two-way unobserved heterogeneity.

^{3}

A case where the framework used here clearly does not correspond to the structure of the data is when there is an observation for each combination of values of both indexing variables (exhaustively matched data). This is the case, for instance, when a balanced longitudinal data set is used with a model that includes individual and time heterogeneity. In such cases, estimators for linear models can be defined after a simple three-way demeaning of the original data. With exhaustively matched data, complications arise only when using nonlinear models, as studied, for instance, in Charbonneau (2014), Graham (2017), and Fernández-Val and Weidner (2016).

^{4}

Bonhomme et al. (2019) estimate features of the distribution of unobserved heterogeneity while relaxing the requirement on sparsity that would be imposed when using classical theory by relying on a consistent classification of one out of two fixed effects in a first stage. While treating one of the fixed effects as a coefficient on an approximately observed finite-dimensional covariate allows for greater flexibility in the rest of the model, the restrictions imposed on the density of matching and the nature of the matching process itself are tighter than what is needed here.

^{6}

I consider here only the case where the covariate of interest is a scalar since all results are generalized to finite-dimensional covariates in a straightforward way.

^{7}

I consider the case where the first term of unobserved heterogeneity is specific to a cross-sectional observation as it is most frequently of interest in empirical work. Extending the results presented here to the case where the observed variable indexing the first term of unobserved heterogeneity is other than $i$ is straightforward. All results for inference with the proposed estimator based on subsetting would also extend to having more than two terms of unobserved heterogeneity. This is not considered in this section for simplicity.

^{8}

Strict exogeneity of $xit$ will be relaxed in the next section, while strict exogeneity of $dit$ is essential to the results obtained here. Relaxing this latter assumption within the asymptotic framework of this paper is the subject of ongoing work but falls outside the scope of this paper.

^{9}

Treating the number of teachers $Nn$ as finitely bounded (as in Abowd, Kramarz, & Margolis, 1999) would amount to analyzing a linear model with one-way unobserved heterogeneity and finite-dimensional covariates, although one can also note that the type of dependence in transitory shocks considered in most of this paper would imply that only finite sample inference could be considered. It is likely that all results that follow would generalize to $Nn,d$ or $T$ growing unboundedly but at a rate slower than $n$. This is not considered here for simplicity.

^{10}

Note that the same issue is well known to arise in the context of models with one-way fixed effects and short panels. With one-way fixed-effects models, cluster robust standard errors are consistent (see, e.g., Wooldridge, 2010), while I discuss below that with two- (or multi-) way fixed effects, the dependence structure of the transformed error term $Mgnun$ will generally not have a cluster-dependence structure, so that cluster robust standard errors cannot be used.

^{11}

Note that this definition of $\kappa n$ is equivalent to $\u2211j\u2208N\xd7T\kappa ijmjk2=1[i=k]$$\u2200i,k\u2208N\xd7T$.

^{12}

In a recent contribution, Kline, Saggio, and Sølvsten (2018) discuss a leave-out estimator for variance components with two-way fixed-effects models. One condition they impose for consistency is that $Mgn[{i,t},{i,t}]$ be uniformly bounded away from 0. Proposition ^{1} shows that this is obtained if $T\u22652$ and every pair of teachers has taught either no student in common or at least two pairs of same-student observations in common, both of which may originate from the same student. Formally, $Mgn[{i,t},{i,t}]\u2265c$$\u2200i,t\u2208N\xd7T$, $\u2200n$, for a constant $c>0$ if for all $i\u2208N$, $t,s\u2208T$ s.t. $dit\u2260dis$, we have either $Nn,dis(i),Nn,dit(i)\u22652$ or $\u2203j\u2208N$, $j\u2260i$ s.t. $Nn,dit(j)Nn,dis(j)\u22651$.

^{13}

Cameron and Miller (2015) discuss the existence of correlated shocks even after controlling for fixed effects in the context of one-way fixed effects, and their discussion extends to the case of two-way fixed-effects models.

^{14}

In addition, these methods are computationally intensive, as they require solving linear systems in multiples of $nT$ unknowns.

^{15}

The appendix also shows that in the case of one-way fixed-effects regression, Cattaneo et al.'s (2017) method extended to serially correlated errors coincides exactly with cluster robust standard errors.

^{16}

The appendix shows that we can think of $Gn$ as an electrical network and consider current flowing from $dit$ to $dis$ through $Gn$. The paths used by electricity to flow through $Gn$ are the paths for which the associated sums of differences will receive positive weight in the definition of $u\u02d9it$.

^{17}

Note that this estimator, as well as the estimated variance defined below, is simple to compute even with large data sets because it lends itself well to parallelization.

^{18}

The term *M-dependence* was originally used to refer to time-series data where draws more than $m$ time periods apart are uncorrelated or independent for some constant $m$. With spatial data, M-dependence refers to draws apart by a distance greater than $m$ being uncorrelated or independent. In the context of network dependence considered here, we have uncorrelation between $u\u02d9n(e)$ and $u\u02d9n(e')$ if the edges $e$ and $e'$ are apart by a distance greater than 1 in $Gn$.

^{19}

In the empirical application of this paper where $x$ is class size, the condition requires that fourth-grade teachers are observed in two different years, say years 1 and 2, and that for a nonvanishing fraction of fourth-grade teachers, one student in year 1 and one student in year 2 were assigned to the same fifth-grade teacher, but that the change in class size experienced by these two students was different. In practice, $x\u02d9n(e)$ is observed, so one can verify whether the subsetting of the data defined above leads to enough variation in the transformed covariate.

^{20}

See Todd and Wolpin (2003) for a discussion of the conditions under which this approach is valid.

^{22}

Note that in this simplified example, where the covariate is sequentially exogenous, $E(xis|Xt)$ is simply $xit$ for $s=t$.

^{23}

An alternative inference method could be bootstrap resampling. Even with uncorrelated errors, Bickel and Freedman (1982) have shown that bootstrap approximations are invalid for linear regression models with many covariates. This is easy to understand in the context of two-way fixed-effects regression since selecting observations at random would not preserve a match structure close to that found in the original data set. For instance, a significant fraction of students in bootstrap samples would be associated with only one time period, even if the original data set is a balanced panel. One could attempt to solve this issue by relying on residual bootstrap resampling, but as discussed, the OLS residuals and estimated fixed effects are not close approximations to the error term and unobserved heterogeneity terms of the original model with sparsely matched data.

^{24}

The bisection algorithm for inverting the CRS test did not converge within four hours per replication for 138 replications, so results are reported conditional on this algorithm having converged—9,862 replications.

^{25}

As noted above, the inference method based on subsetting and pooled regression proposed in this paper would remain asymptotically valid if $T$ and $Nn,d$ grow as sample size $n$ increases but at slow enough rates. In other words, the method proposed here remains valid for data with a moderately dense match structure. With weak serial and cross-sectional dependence and densely matched data, heteroskedasticity-autocorrelation consistent standard errors that treat the fixed effects as parameters that can be estimated precisely would also lead to asymptotically valid inference. Note that this second approach imposes stronger restrictions on dependence than assumption ^{2} does. It is unknown whether the two-way clustered standard errors of Cameron et al. (2011) can be shown to be asymptotically valid for two-way fixed-effects regression with densely matched data. A study of the performance of these methods, or of alternative methods, with varying degrees of match density is left for future work.

^{26}

An alternative model to the one used here would rely on a teacher-by-school fixed effect, with equation (37) replaced by $yit=\alpha gradeit,t+\rho 0yit-1+xit\beta 0+ci+fdit,schoolit+uit$. Here a model with three-way fixed effects is used to exemplify that the methods proposed above are extended in a straightforward way to models with more than two fixed effects.

^{27}

I selected this period of observation in order to have a unique version of the standardized tests in mathematics and reading. The state of North Carolina switches to a new version of its tests periodically, and it is unclear how test scores can be compared across versions of the test. The state's standardized tests rely on item response theory for scores to be comparable across both students and over time, but this comparison cannot be performed across versions of tests. While attempts have been made to compare scores across versions of the test by standardizing at the year-grade level (see Rothstein, 2010), there is no theoretical justification for such a procedure. For simplicity, the results shown here rely on data corresponding to only one version of the test in each subject.

^{28}

Results obtained from restricting the sample to classrooms of between five and thirty students, which nearly encompasses the entire sample, are not reported here but lead to similar results.

^{29}

Here, since $T=2$, a subset composed of only one student would not lead to any identifying variation, so that one need only consider pairs of teachers who have taught two or more students in common. Additionally, recall from note 19 that identification here requires that many fourth-grade teachers be observed teaching two or more classrooms over time, with students from both classrooms being assigned to the same fifth-grade teacher but experiencing a different change over time in class size.

## REFERENCES

## Author notes

I thank the editor, Bryan Graham, as well as three anonymous referees for very helpful comments and suggestions. I also thank Stéphane Bonhomme, Colin Cameron, Matias Cattaneo, Jane Cooley Fruehwirth, Tiago Pires, and Jeffrey Wooldridge for useful discussions and suggestions; Paul Hanselman for providing code used to process part of the data; and Chunxiao Li, Kristina Vaughan, and Josh Horvath for their research assistance.

A supplemental appendix is available online at http://www.mitpressjournals.org/doi/suppl/10.1162/rest_a_00807.