## Abstract

Models with multiway fixed effects are frequently used to address selection on unobservables. The data used for estimating these models often contain few observations per value of either indexing variable (sparsely matched data). I show that this sparsity has important implications for inference and propose an asymptotically valid inference method based on subsetting. Sparsity also has important implications for point estimation when covariates or instrumental variables are sequentially exogenous (e.g., dynamic models), and I propose a new estimator for these models. Finally, I illustrate these methods by providing estimates of the effect of class size reductions on student achievement.

## I. Introduction

MODELS with several sources of unobserved heterogeneity are frequently used in empirical work as an attempt to control for potentially confounding factors or because features of the distribution of unobserved heterogeneity are of interest. In the analysis of education data, student and teacher or school fixed effects are routinely used as a way to control for selection on unobservables (Hanushek et al., 2003; Rivkin, Hanushek, & Kain, 2005; Sund, 2009; Clotfelter, Ladd, & Vigdor, 2010; Kingdon & Teal, 2010; Rothstein, 2010). In the analysis of employer-employee matched data, models with worker and firm unobserved heterogeneity are often used to study features of the distribution of worker and firm productivity (Abowd et al., 1999; Postel-Vinay & Robin, 2002; Card, Heining, & Kline, 2013; Bonhomme, Lamadon, & Manresa, 2019).1

Many of these empirical applications use data with relatively few observations per value of the variables indexing unobserved heterogeneity, a structure that I call sparsely matched data.2 With education data, for instance, students are usually followed over only a few grades, and class size is limited by state laws or common practices, so that the number of observations per student and per teacher is generally small. As a result, neither the unobserved ability of a particular student nor the teaching effectiveness of a particular teacher can be estimated precisely. Similarly, considering an employer-employee matched data structure such that neither worker nor firm productivity can be estimated precisely is also likely to be relevant given the small number of workers per firm commonly encountered, the limited movement of workers between firms, and the likely presence of cross-sectional and serial dependence in productivity shocks.3

Here I study estimation and inference for the effect of finitely many observed covariates on an observed outcome of interest using a model with multiway fixed effects and sparsely matched data. While the estimation of features of the distribution of unobserved heterogeneity is of interest in many empirical studies, I leave the analysis of this question within the framework of this paper to ongoing work and concentrate here on the effect of covariates, for several reasons. These parameters are of primary interest in many empirical applications; they need to be estimated before estimating features of the distribution of unobserved heterogeneity; and estimating features of the distribution of unobserved heterogeneity necessitates stronger assumptions than those used in this paper.4

With strictly exogenous covariates, a commonly used estimator relies on a two-way fixed-effects regression, which in the education example above would take the form of a regression on a set of indicator variables for each student and each teacher.5 With sparsely matched data, the estimated fixed effects from such a regression are contaminated with estimation noise that cannot be ignored. This noise will generally lead to robust variance estimation methods being inconsistent because they rely on residuals that do not approximate the error term of the model well. In addition, unlike one-way fixed-effects regression, two-way fixed-effects regression generates complex patterns of dependence in the transformed data, so that one-way or multiway cluster robust inference will also be invalid in general.

I propose basing inference on a different estimation method that relies on many small and potentially overlapping subsets of data. This estimation procedure applies a two-way fixed-effects regression transformation within each subset, and the object of interest is then estimated by a pooled regression across all subsets. The resulting dependence structure of the transformed data is sparse; the transformed data can be shown to be M-dependent, so that this estimator can be combined with M-dependence robust standard errors to conduct asymptotically valid inference.

With dynamic models (i.e., with sequentially exogenous instrumental variables), two-way fixed-effects regression using sparsely matched data would lead to inconsistent estimators of the object of interest because of an incidental parameter bias similar to the bias found in the context of dynamic models with one-way fixed effects by Nickell (1981). Here I propose a new estimation method for dynamic models with two-way fixed effects that is consistent, locally efficient, and simple to compute given existing tools. Asymptotically valid inference can also be conducted by relying on subsetting, as in the case of strictly exogenous covariates.

## II. Inference with Strictly Exogenous Covariates

For simplicity, I begin by considering the case where covariates are strictly exogenous. I assume that one has panel data, with cross-sectional observations denoted by $i∈N$, $N={1,…,n}$ and time periods $t∈T$, $T={1,…,T}$. The model of interest in this section is given by
$yit=β0xit+ci+edit+uit,E(uit|X,D)=0,∀(i,t)∈N×T,$
(1)

where $yit$ is an observed outcome variable and $xit$ is an observed covariate.6$ci$ is an unobserved effect that can vary in the cross-section but is assumed to be constant over time.7$edit$ is an additional unobserved effect indexed by an observed indexing variable $dit∈Dn$, $Dn={1,…,Nn}$. $edit$ can vary across values of $dit$, but it is assumed to be constant across all observations that share the same value of $dit$. $uit$ represents all other unobserved factors. $X$ contains all values of $xit$, $X={xit}(i,t)∈N×T$, and $D$ contains all values of $dit$, $D={dit}(i,t)∈N×T$.

The object of interest throughout the paper is $β0$, the coefficient on the observed covariate $xit$. In the running empirical example of this paper, $yit$ would be student achievement as measured by test score, $xit$ would be student $i$'s class size in year $t$, $ci$ would be a student's unobserved learning ability, $dit$ would be student $i$'s teacher in year $t$, $ed$ would be the unobserved teaching effectiveness of teacher $d$, and $uit$ are all other unobserved shocks to learning. The advantage of the model given by equation (1) is that it does not restrict the relationship between class size and unobserved student and teacher quality, so that, for instance, the model accommodates classroom formation processes where more able students and teachers are systematically assigned to larger classrooms. On the other hand, equation (1) requires that transitory shocks $uit$ are mean independent of past, present, and future values of class size and teacher assignment.8

In the empirical example of this paper, students are observed for only two time periods, and on average, teachers teach only 40 students over all time periods of observation, which is small compared to the 200,000 students observed in the data. This corresponds to a data structure frequently encountered, which I call sparsely matched, where the number of observations corresponding to each value of $i$ and of $dit$ is small or, conversely, the total number of values taken by $i$ and $dit$ is large. Define $Nn,d=|{(i,t)∈N×T:dit=d}|$ to be the number of observations assigned to the value $d$ of $dit$, where $|.|$ denotes the cardinality of a finite set. I capture a sparsely matched data structure by using an asymptotic framework where $n$ grows to infinity while $T$ remains fixed and the number of observations for each $dit$ remains uniformly bounded: $Nn,d≤C$$∀d∈Dn$$∀n$ for a constant $C$. As a result, the total number of values that $i$ and $dit$ can take, $n$ and $Nn$, respectively, are proportional to sample size $nT$.9

Before discussing estimation and inference, I define a few elements of notation that will help the exposition that follows. Stacking the data over cross-sectional observations and time, define $yn=[yit](i,t)∈N×T$, $xn=[xit](i,t)∈N×T$, $cn=[ci]i∈N$, $en=[ed]d∈Dn$, $un=[uit](i,t)∈N×T$, $gn,1=[1[i=j]](i,t)∈N×Tj∈N$, $gn,2=[1[dit=d]](i,t)∈N×Td∈Dn$, $gn=[gn,1,gn,2]$. Finally since all statements in this paper are made conditional on assignment to values of the indexing variable $dit$, I suppress this conditioning throughout the rest of the paper in order to shorten notation. Equation (1) can then be rewritten as
$yn=β0xn+gncnen+un,E(un|X)=0.$
(2)

### A. Variance Estimation for Two-Way Fixed Effects Regression

I begin by considering a natural estimator for $β0$ under the model given by equation (1), which is obtained by an OLS regression of the outcome variable $y$ on the observed covariate $x$ and the indicator variables for each cross-sectional observation and each value of the indexing variable $dit$. This estimator is generally referred to as the two-way fixed-effects regression estimator and is formally defined by
$β^n=(xn'Mgnxn)-1xn'Mgnyn,$
(3)
where $Pgn=gn(gn'gn)-gn'$ and $Mgn=InT-Pgn$ are the projection matrices corresponding to a two-way fixed-effects regression and the operator $(.)-$ denotes the generalized inverse of a matrix.
In the remainder of this section, I consider the problem of estimating the variance of the two-way fixed-effects regression estimator, which is given by
$Var(β^n|X)=(xn'Mgnxn)-2∑(i,t),(j,s)∈N×Tx˙itx˙jsE(uitujs|X)$
(4)

where $x˙n=Mgnxn$ and $x˙it$ the element of $x˙n$ corresponding to the observation $(i,t)$.

Under an additional assumption of uncorrelated and homoskedastic error term, $Var(un|X)=σ2InT$, we have $Var(β^n|X)=σ2(xn'Mgnxn)-1$. The classical estimator for $σ2$ is
$σ^2=1trace(Mgn)-1∑(i,t)∈N×Tu^it2,$
(5)
where $u^n=Mgn(yn-xnβ^n)$ denotes the residuals obtained from the two-way fixed-effects regression described above and $u^it$ is the element of $u^n$ corresponding to observations $(i,t)$. This estimator will generally be consistent for $σ2$ even with sparsely matched data, so that classical OLS variance estimation will generally be consistent under homoskedasticity and uncorrelation of the error term.

However, empirical researchers frequently wish to rely on inference methods that are robust to patterns of heteroskedasticity and correlation. With sparsely matched data, robust variance estimation methods applied to a two-way fixed-effects regression will generally not be consistent, leading to a nonstandard inference problem that originates from the dimension of the control covariates contained in $gn,1$ and $gn,2$, $n$ and $Nn$, respectively, being proportional to sample size $nT$.

For instance, the heteroskedasticity consistent estimated variance for $β^n$ is
$Var^EHW(β^n)=(xn'Mgnxn)-2∑(i,t)∈N×Tx˙it2u^it2,$
(6)
and we can easily show that it will generally be an inconsistent estimator for $Var(β^n|X)$ in the context of two-way fixed-effects regression even when the error term is uncorrelated, that is, when $Cov(uit,ujs|X)=0$ if $(i,t)≠(j,s)$, by considering the special case where $Var(un|X)=σ2InT$. Indeed, under this assumption, we have
$E(Var^EHW(β^n)|X)Var(β^n|X)-1=-∑(i,t)∈N×Tx˙it2pit,it∑(i,t)∈N×Tx˙it2+∑(i,t)∈N×Tx˙it4(∑(i,t)∈N×Tx˙it2)2$
(7)

where $pit,it$ is the diagonal element of $Pgn$ corresponding to observation $(i,t)$. While the second right-hand-side term above will generally converge to 0, two-way fixed-effects regression with sparsely matched data leads to the first right-hand-side term, $∑i,tx˙it2pit,it∑i,tx˙it2$, being nonignorable, which will lead to an asymptotic bias in the heteroskedasticity-consistent standard errors.

The term $pit,it$ can be interpreted as the contribution of observation $(i,t)$ in predicting the term $ci+edit$ based on a regression on the covariates $g1,n$, $g2,n$. Consistency of the heteroskedasticity-consistent estimated variance would generally require that this contribution becomes negligible as sample size increases. Heuristically, with sparsely matched data, the amount of information for predicting the term $ci+edit$ from observations other than $(i,t)$ is bounded in a setting where student $i$ is observed for only a few years ($T$) and teacher $dit$ is observed teaching few students ($Nn,dit$), so that the contribution of observation $(i,t)$ in predicting its own term of heterogeneity $ci+edit$ does not vanish.10

Cattaneo, Jansson, and Newey (2017) have studied a similar problem of heteroskedasticity-robust inference with high-dimensional control covariates in a setting that is not specific to two-way fixed-effects regression. They propose an estimated variance, which they show can be consistent if the term $pit,it$ is not too large uniformly across all observations $(i,t)∈N×T$. Next, I propose a new representation for two-way fixed-effects regressions based on random walks on graphs, which can be used to derive easily interpretable conditions for this condition on $pit,it$ to hold.

#### Variance estimation under conditional uncorrelatedness.

In this section, I consider heteroskedasticity-robust estimation of the variance of the two-way fixed-effects regression estimator with sparsely matched data and when the error term is uncorrelated both cross-sectionally and serially, that is, $Cov(uit,ujs|X)=0$ for $(i,t)≠(j,s)$. In order to shorten notation, I use $i=(i,t)$ to denote an observation in $N×T$. Under the assumption of conditionally uncorrelated errors, we have
$Var(β^n|X)=1(xn'Mgnxn)2∑i∈N×Tx˙i2E(ui2|X)$
(8)
Define $u˙n=Mgnun$, $u˙i$ to be the element of $u˙n$ corresponding to observation $i$, and recall $u^n=Mgn(yn-β^nxn)$. Define $mi,j$ to be the element of $Mgn$ corresponding to observations $i$ and $j$, so that $u˙i=∑j∈N×Tmi,juj$. The estimator for $Var(β^n|X)$ proposed in Cattaneo et al. (2017) is
$Var^(β^n|X)=1(xn'Mgnxn)2∑i∈N×Tx˙i2∑j∈N×Tκi,ju^j2,$
(9)
where $κi,j$ are weights chosen to be $κn=(Mgn⊙Mgn)-1$, where $⊙$ is the Hadamard (element-by-element) product operator and the matrix $κn$ collects all weights $κi,j$, that is, $κn=[κi,j]i∈N×Tj∈N×T.$11
The sequence of weights $κn$ is chosen to map the second moments of the transformed error term $u˙n$ to the second moments of the original error term $un$. Under regularity conditions on the transformed covariates $x˙n$, the error term $un$, and the sequence of weights $κn$, one obtains
$Var^(β^n|X)≃1(xn'Mgnxn)2∑i∈N×Tx˙i2∑j∈N×Tκi,jE(u˙j2|X)$
(10)

where $≃$ denotes a convergence in probability at a rate fast enough to eventually obtain consistency of the estimated variance, that is, to obtain $Var^(β^n|X)Var(β^n|X)-1=op(1)$.

From the definition of $u˙i$ and the assumption of uncorrelatedness we have
$∑i∈N×Tx˙i2∑j∈N×Tκi,jE(u˙j2|X)=∑i∈N×Tx˙i2∑j∈N×Tκi,j∑k∈N×Tmj,k2E(uk2|X),$
(11)
and the definition of the weights $κi,j$ implies
$∑i∈N×Tx˙i2∑k∈N×TE(uk2|X)∑j∈N×Tκi,jmj,k2=∑i∈N×Tx˙i2E(ui2|X).$
(12)

Therefore, under the regularity conditions of Cattaneo et al. (2017), a consistent estimator for the variance of the two-way fixed-effects regression estimator can be obtained under the assumption that the original error terms $uit$ are uncorrelated. One important regularity condition used to establish consistency is that the maximum absolute sum of the weights be uniformly bounded, that is, $maxi∈N×T∑j∈N×T|κi,j|≤C$$∀n$ for some constant $C$. In turn, they show that a sufficient condition for this property of the weights to hold is that the minimum diagonal element of $Mgn$ be uniformly bounded above $12$, or equivalently that the maximum diagonal element of $Pgn$ be uniformly bounded below $12$. A contribution of this paper is to propose a new representation for $Mgn$ based on random walks on graphs and to show that this representation can be used to provide easily interpretable conditions for this uniform boundedness condition to hold.

Proposition 1 shows that the minimum diagonal element of $Mgn$ is uniformly bounded above $12$ if more than three time periods are observed and if for all pairs of values $d,d'∈Dn$, the number of cross-sectional observations assigned to both $d$ and $d'$ is either 0 or large enough. In the running example of this paper, if we impose the simplification that each student is taught by a particular teacher at most once, this result shows that the main regularity condition for the standard errors of Cattaneo et al. (2017) to be valid holds if $T$ is three or larger and if, whenever two teachers have taught a common set of students, they have taught at least five students in common when $T=3$, at least four students in common when $T=4$, and at least three students in common when $T≥5$. In order to state proposition 1, define $Nd(i)$ to be the number of times observation $i$ was assigned to value $d$ of the indexing variable $dit$: $Nd(i)=|{t∈T:dit=d}|$.

Proposition 1.
The two-way fixed-effects regression satisfies
$Mgn[{i,t},{i,t}]≥1-1T-T-Ndit(i)Tmaxs'∈T:dis'≠ditNdis'(i)∑j∈NNdit(j)Ndis'(j)∀i,t∈N×T.$
(13)
Therefore, if $T≥3$ remains fixed and if $∀i,t∈N×T,∀n$ we have either $Ndit(i)=T$ or
$Ndis(i)∑j∈NNdit(j)Ndis(j)<12T-2T-Ndit(i)∀s∈Ts.t.dis≠dit.$
(14)
Then:
$Mgn[{i,t},{i,t}]≥12+c∀i,t∈N×T,∀n,foraconstantc>0.$
(15)

The proofs for all the results in this paper are in the online appendix. The appendix also contains additional results; in particular, it includes results for inference when $T=2$ and under other relaxations.12

In many applications, it is likely that empirical researchers believe that there exists serial or cross-sectional correlation in $uit$. For instance, even conditional on a student's ability, there could be shocks to a student's learning that are persistent over time, such as making new friends or changes in the environment at home. Even conditional on a teacher's effectiveness, there could be correlated shocks among students in the same classroom, such as the introduction of a new technology, after-school homework sessions, or friendship networks nested within classrooms that lead to correlated shocks.13 Therefore, one might wish to use inference methods that are not only robust to heteroskedasticity but also to some forms of serial and cross-sectional correlation. The appendix shows that Cattaneo et al.'s (2017) method can be extended to the case of multiway cluster dependence but does not include results on the asymptotic properties of this extension.

### B. Robust Inference and Subsetting

Cattaneo et al.'s (2017) method relies on finding a sequence of weights such that $∑i∈N×Tx˙i2∑j∈N×Tκiju˙j2≃∑i∈N×Tx˙i2∑j∈N×TκijE(u˙j2|X)=∑i∈N×Tx˙i2E(ui2|X)$. Its extension to correlated errors defined in the appendix relies on a similar argument but using cross-products of the transformed error term $u˙i$ in addition to squared terms. The sequence of weights ${κij}i,j∈N×T$ is key in determining the asymptotic properties of the resulting estimated variances. Proposition 1 lists conditions for consistency with uncorrelated errors and two-way fixed-effects regression, but these conditions are unknown for models with correlated errors or more than two-way fixed effects. In addition, the Monte Carlo simulation results in this paper find a relatively large finite sample bias when implementing the extension defined in the appendix of Cattaneo et al.'s (2017) method to correlated errors, pointing to the sequence of weights used in that case not being sparse enough for good properties in small samples.14

An alternative approach to inference with high-dimensional control covariates is to use estimated variances that are robust to the dependence that exists in the transformed error term $u˙i$. The structure of this dependence is determined by the dependence in the original error term $ui$ and the form of the transformation used.

For instance, with one-way fixed-effects regression and uncorrelated errors, the transformed error term $u˙i$ would be serially correlated since it can be written as a time-demeaned transformed variable, $u˙i=ui-1T∑j∈I(i)uj$, where $I(i)$ collects the observations across all time periods for the same cross-sectional unit as $i$; $I(i)={(i,s):s∈T}$ where $i=(i,t)$. In this case, since the transformed error term is cross-sectionally uncorrelated, $E(u˙iu˙j|X)=0$ if $j∉I(i)$, cluster robust variance estimation will generally be consistent: $∑i∈N×T∑j∈I(i)x˙iu˙iu˙jx˙j≃∑i,j∈N×Tx˙iE(u˙iu˙j|X)x˙j=∑i∈N×Tx˙i2E(ui2|X)$ (see Wooldridge, 2010).15

This approach cannot be extended to the case of two-way fixed-effects regression because in general, the transformation matrix $Mgn$ can be fully dense, so that the transformed error term $u˙i$ for observation $i$ can generally be correlated with all other observations of the transformed error term even if the original error term $ui$ is uncorrelated. This can be understood using a graph formed by teachers as nodes and students as edges, formally a graph $Gn={Dn,En}$, where the set of nodes is given by the set of teachers $Dn$ and an edge exists between two teachers if they have taught at least one student in common, $(d,d')∈En$ if $∑i∈NNd(i)Nd'(i)>0$, where $Nd(i)$ is the number of times student $i$ was taught by teacher $d$. The graph corresponding to the assignment of students and teachers used in the Monte Carlo simulations is represented in figure 1, and the graph corresponding to the assignment used in the empirical application is represented in the appendix.

Figure 1.

Graph $Gn$ for the Monte Carlo Simulations Section

Nodes are schools, and edges are given by observations on students over four years. Self-loops were removed for clarity. Largest connected component: 201 nodes out of 205, diameter = 20, radius = 10, average degree = 3.498, characteristic path length = 8.340. Nodes are represented with different patterns and shades to indicate the group membership used for grouped data inference.

Figure 1.

Graph $Gn$ for the Monte Carlo Simulations Section

Nodes are schools, and edges are given by observations on students over four years. Self-loops were removed for clarity. Largest connected component: 201 nodes out of 205, diameter = 20, radius = 10, average degree = 3.498, characteristic path length = 8.340. Nodes are represented with different patterns and shades to indicate the group membership used for grouped data inference.

Consider an existing edge $(d,d')$ in $Gn$, meaning that there is at least one student who has been taught by both teacher $d$ and teacher $d'$. If two different students have been taught by both teachers $d$ and $d'$, the model given by equation (1) leads to a moment condition for estimating $β0$ that relies on comparing the change over time in outcome ($y$) and covariate ($x$) of these two students,
$E(yit-yis-(yjt'-yjs')-β0(xit-xis-(xjt'-xjs'))|X)=0,$
(16)

for $i,j,t,s,t',s'$ such that $dit,djt'=d$, $dis,djs'=d'$.

Suppose a path of length 2 from $d$ to $d'$ exists in $Gn$; that is, there is $d1∉{d,d'}$ such that the edges $(d,d1)$ and $(d1,d')$ exist in $Gn$. Another moment condition obtained from equation (1) relies on comparing the change over time in outcome and covariate for the student who was taught by $d$ and $d'$ with the sum of the change over time in outcome and covariate for a student who was taught by $d$ and $d1$ and a student who was taught by $d1$ and $d'$:
$E(yit-yis-(yjt'-yjs')-(ykt''-yks'')-β0(xit-xis-(xjt'-xjs')-(xkt''-xks''))|X)=0$
(17)

where $dit,djt'=d$, $dis,dks''=d'$, $djs',dkt''=d1$.

For any path of any length in $Gn$ from $dit$ to $dis$, one can form a valid moment condition for $β0$ in this way by comparing the differenced term $yit-yis-β0(xit-xis)$ to the sum of the differences along that path in $Gn$ from $dit$ to $dis$. The appendix shows that the two-way fixed-effects regression is a transformation that obtains a linear combination of these comparisons, so that the transformed error term $u˙it$ is a weighted sum of the comparisons between ${uit-uis:s∈T}$ and the sums of differences in the original error term ${ujs}(j,s)∈N×T$ along paths in $Gn$ from $dit$ to ${dis:s∈T}$.16 It is therefore possible that each observation of the original error term ${ujs}(j,s)∈N×T$ enters the definition of the transformed error term $u˙it$, possibly leading to a non-0 correlation between all observations of the transformed error term.

In this section, I propose basing inference for $β0$ on a different estimator which only makes use of moment conditions for estimating $β0$ obtained by comparisons that are local in $Gn$, that is, moment conditions of the same form as equation (16), ignoring additional moment conditions that are obtained by comparisons on paths of length greater than 1 such as equation (17). The shortcoming of this approach is that it purposely does not make use of all the implications of the model given by equation (1) for estimating $β0$. This will generally lead to a loss in efficiency. The advantage of the approach proposed here is that by using local comparisons only, it relies on transformed data with a tractable dependence structure, so that an analog of cluster robust standard errors with one-way fixed-effects models can be defined in this context.

In order to introduce this estimator, consider the subset of data obtained by collecting the observations assigned to teachers $d$ or $d'$ and corresponding to students who have been taught by both $d$ and $d'$, that is, the subset of data defined by $S(d,d')={(i,t)∈N×T:dit∈{d,d'},Nd(i)Nd'(i)>0}$, where, recall, that $Nd(i)$ is the number of times student $i$ was taught by teacher $d$. To shorten notation, write $e=(d,d')$, and define the data corresponding to this subset: $yn(e)=[yit](i,t)∈S(e)$, $xn(e)=[xit](i,t)∈S(e)$, $gn(e)=[gn,it](i,t)∈S(e)$, $un(e)=[uit](i,t)∈S(e)$, where $gn,it=[[1[i=j]]j∈N,[1[dit=d]]d∈Dn]$ is the row of $gn$ corresponding to observation $(i,t)$.

We can obtain a valid moment condition for estimating $β0$ that only involves data from this subset by applying a two-way fixed-effects regression at the subset level. Define $y˙n(e)=Mgn(e)yn(e)$, $x˙n(e)=Mgn(e)xn(e)$, and $u˙n(e)=Mgn(e)un(e)$; then equation (1) implies
$E(y˙n(e)-β0x˙n(e)|X)=0.$
(18)

The moment condition (18) holds for any pair of teachers $d$, $d'$ who have taught at least one student in common, so that the subset $S(d,d')$ is nonempty; that is, equation (18) holds $∀e∈En$. With sparsely matched data, the number of such subsets will be proportional to sample size ($nT)$ since the number of teachers ($Nn$) is proportional to sample size, and each teacher teaches finitely many students ($Nn,d≤C$$∀d,n$), and each student has been taught by finitely many teachers ($T$ is fixed), so that there can only be finitely many edges connected to each teacher. In addition, since each teacher teaches finitely many students, the number of observations inside each subset $S(e)$, $e∈En$ will be finite.

Therefore, the moment condition (18) holds for many small subsets of data indexed by $e∈En$, and we can consider estimating $β0$ by a pooled regression and define17
$β˜n=∑e∈Enx˙n(e)'y˙n(e)∑e∈Enx˙n(e)'x˙n(e)=β0+∑e∈Enx˙n(e)'u˙n(e)∑e∈Enx˙n(e)'x˙n(e).$
(19)
Before considering correlated errors, I first consider for simplicity the case where the original error term $uit$ is uncorrelated: $Cov(ui,uj|X)=0$ if $i≠j$. With uncorrelated error terms, the transformed error term $u˙n(e)$ for the subset indexed by $e∈En$ will be correlated with other transformed error terms $u˙n(e')$ only if the two corresponding subsets $S(e)$ and $S(e')$ share some observation in common, that is, if $S(e)∩S(e')≠∅$. I call this pattern of dependence M-dependence and call the set of indices $D(e)={e'∈En:S(e')∩S(e)≠∅}$ the dependence neighborhood of $e$.18 We can then write
$Var(β˜n|X)=1(∑e∈Enx˙n(e)'x˙n(e))2∑e∈En,e'∈D(e)x˙n(e)'E(u˙n(e)u˙n(e')'|X)x˙n(e').$
(20)
As discussed above, the number of subsets ($|En|$) is proportional to sample size ($nT$). In addition the cardinality of the dependence neighborhoods ($|D(e)|$) is uniformly bounded because each subset $S(e)$ contains finitely many observations, and each observation belongs to finitely many subsets. Therefore, one can expect M-dependence robust variance estimation to be consistent, with the estimated variance defined by
$Var^(β˜n|X)=1(∑e∈Enx˙n(e)'x˙n(e))2∑e∈En,e'∈D(e)x˙n(e)'u^n(e)u^n(e')'x˙n(e'),$
(21)

where $u^n(e)=y˙n(e)-β˜nx˙n(e)$ are the residuals from the pooled regression.

In the remainder of this section, I establish the consistency of M-dependence robust variance estimation in the more general case where some correlation exists in the original error term $uit$ of the model. The first assumption restricts the extent of this correlation.

Assumption 1.
Let $A⊆N×T$ be any subset of observations. Then
${uit}(i,t)∈A⊥{ujs,(j,s)∈N×T:∀(i,t)∈A,i≠janddit≠djs}|Xa.s.$
(22)

This assumption allows for arbitrary correlation in $uit$ across observations that correspond to the same student or to the same teacher, but it imposes independence otherwise. I discuss at the end of this section possible generalizations of this assumption.

The next assumption imposes restrictions on the moments of the data.

Assumption 2.

1. The support of $xit$ and $uit$ is compact.

2. $1n∑e∈Enx˙n(e)'x˙n(e)≥c$$∀n≥C$ for constants $c>0$ and $C$ a.s.

3. $1nVar(∑e∈Enx˙n(e)'u˙n(e)|X)≥c$$∀n≥C$ for constants $c>0$ and $C$ a.s.

Assumption 2a is standard and could easily be replaced by higher moments of $xit$ and $uit$ being uniformly bounded. Assumption 2b imposes that the amount of identifying variation in the transformed covariate $x˙n(e)$ is proportional to sample size. A sufficient condition for assumption 2b to hold is that there is some identifying variation in the transformed covariates $x˙n(e)$ in a nonvanishing fraction of subsets, that is, $x˙n(e)'x˙n(e)≥c>0$ for $e∈En+$ where $|En+||En|≥c>0$$∀n$. This in turn would hold, for instance, if for a nonvanishing fraction of edges $(d,d')∈En$, we can find two students who have been taught by both $d$ and $d'$ but experienced a different change in their covariate $x$ between being taught by $d$ and $d'$.19

Assumption 2c is a standard regularity condition for the application of central limit theorems and requires that the variance of the numerator of the pooled regression estimator be of order proportional to sample size. The appendix shows that this would hold if assumptions 2a and 2b hold and $λmin(Var(un|X))≥c$ for a constant $c>0$ where $λmin$ is the smallest eigenvalue of a matrix. This latter condition holds if, for instance, $uit$ can be written $uit=eit+εit$, where $Cov(eit,εjs|X)=0$$∀(i,t),(j,s)$, $Cov(εit,εjs|X)=0$ if $(i,t)≠(j,s)$, and $Var(εit|X)≥c$$∀(i,t)$. This decomposition of $uit$ uses the term $eit$ to capture all forms of dependence, while the term $εit$ captures the idiosyncratic variation in $uit$. Informally, we can say that assumption 2c holds if assumptions 2a and 2b hold and if each observation of $uit$ is potentially correlated with but not linearly dependent on other observations of $uit$.

Proposition 4 shows that the pooled regression estimator is asymptotically normal under assumptions 2 and 3 in a sparsely matched data asymptotic framework.

Proposition 2.
Under equation (1) and assumptions 2 and 3, as $n→∞$ while $T$ remains fixed and $Nn,d≤C$$∀d∈Dn$, $∀n$, for a constant $C$:
$1vnn(β˜n-β0)→dN(0,1),$
(23)
where $vn2=1(1n∑e∈Enx˙n(e)'x˙n(e))21nVar(∑e∈Enx˙n(e)'u˙n(e)|X)$.
Assumption 2 allows for serial and cross-sectional correlation in the error term $uit$. As a result, the transformed error term $u˙n(e)$ for the subset of data indexed by $e∈En$ will potentially be correlated with the transformed error term of all other subsets that share either students or teachers in common with subset $S(e)$. Formally, we redefine the dependence neighborhood of $u˙n(e)$ to be
$D(e)={e'∈En:∃(i,t)∈S(e),(j,s)∈S(e')withi=jordit=djs}.$
(24)
The M-dependence robust estimated variance for $β˜n$ is then given by
$v^n2=1(∑e∈Enx˙n(e)'x˙n(e))2∑e∈En,e'∈D(e)x˙n(e)'u^n(e)u^n(e')'x˙n(e').$
(25)

The number of subsets ($|En|$) is proportional to sample size, and as in the simpler case of uncorrelated errors discussed above, the cardinality of the new dependence neighborhoods ($|D(e)|$) is uniformly bounded, which leads to consistency of this variance estimator as shown by the next proposition:

Proposition 3.
Under equation (1) and assumptions 2 and 3, as $n→∞$ while $T$ remains fixed and $Nn,d≤C$$∀d∈Dn$, $∀n$, for a constant $C$:
$vn2v^n2-1=op(1).$
(26)

Propositions 4 and 5 show that researchers can rely on Wald tests using the pooled regression estimator ($β˜n$) and M-dependence robust estimated variance ($v^n2$) to perform asymptotically valid inference for models with two-way fixed effects using sparsely matched data.

#### Extensions.

In some empirical applications, it might be desirable to modify assumption 2 in order to allow for more general patterns of dependence. For instance in the empirical example of this paper, I only impose independence for different students in different schools instead of independence for different students in different classrooms.

In addition, in some empirical applications, it is possible that subsetting at the edge level would either be needlessly computationally cumbersome or would leave out comparisons in $xit$ that one knows to account for a significant share of the identifying variation for $β0$.

The method above can be extended to account for different patterns of dependence and different definitions of subsets as long as the M-dependence robust variance estimator relies on many subsets with relatively small dependence neighborhoods. In practice, an empirical researcher would calculate the number of subsets and the cardinality of each dependence neighborhoods to verify that the M-dependence robust estimated variance relies on many subsets with relatively few cross-products per subset.

In addition, note that it is straightforward to show that the two-way fixed-effects regression estimator discussed in section IIA is also $n$-consistent under assumptions 2 and 3 above and a sparsely matched data asymptotic framework. Since this estimator is frequently used in empirical work and is likely to be more precise than the pooled regression estimator, one could report the results from two-way fixed-effects regression as a preferred-point estimate and use the method discussed in this section based on the pooled regression estimator for inference only.

Finally, the method described here is easily extended to models where neither variable indexing unobserved heterogeneity corresponds to the cross-sectional index. For instance, data where observations are indexed by student and year could be used with a model with teacher and school fixed effects. In this case, one could, for instance, define subsets by collecting all observations corresponding to pairs of connected schools. With dependence indexed by students and teachers, as is considered in this section, the resulting dependence neighborhoods for each subset would still collect all subsets with either a student or a teacher in common with the subset under consideration. More generally, the method discussed can be extended to any setting with high-dimensional but sparse control covariates, that is, $yn=β0xn+wnηn,0+un$, $E(un|xn,wn)=0$, where the number of variables in $wn$ is proportional to sample size but for each observation $wn,i$, only finitely many variables are non-0, which includes models with three or more fixed effects. An example is provided in section V.

#### Small-Sample Corrections.

While leading to asymptotically valid inference, robust standard errors are well known to exhibit biases in small samples and to lead to Wald tests with size distortions in small samples (see MacKinnon & White, 1985; Imbens & Kolesár, 2015; and Cameron & Miller, 2015, for a review in the context of uncorrelated and cluster dependent errors).

In the appendix, I show that the small sample corrections discussed in Bell and McCaffrey (2002) and Imbens and Kolesár (2015) for uncorrelated and cluster dependent errors (local debiasing and using critical values from the $t$-distribution instead of the standard normal distribution) can be extended to obtain small sample adjustments for the test relying on the pooled regression estimator and M-dependence robust variance estimation discussed above.

## III. Estimation with Dynamic Models

Dynamic models frequently arise in empirical applications where there exists a feedback mechanism from past outcomes to the current covariates. For instance, when estimating the effect of class-size reductions on student achievement, one might wish to capture the potentially confounding effect of prior learning by including lagged achievement as an additional covariate.20 This would lead to a dynamic model of student achievement such as
$yit=ρ0yit-1+γ0wit+uit,E(uit|Yt-1,W)=0,$
where $yit$ is student $i$'s achievement in year $t$, $wit$ is class size, $Yt-1$ collects all past outcomes $Yt-1={yis}i∈N,s≤t-1$, and $W$ collects all records of class size $W={wit}i,t∈N×T$. In this model, past achievement $yit-1$ captures the otherwise confounding effect of past educational inputs on current achievement, so that $γ0$ is the contemporaneous effect of increasing class size on student achievement.
With dynamic models, covariates cannot be treated as strictly exogenous, and I consider what consequences this has for estimation in this section. In the remainder of this section, I consider for simplicity models that can be written as
$yit=β0xit+ci+edit+uit,E(uit|Xt)=0∀(i,t)∈N×T,$
(27)

where $Xt={xis}i∈N,s≤t$ and $xit$ is a scalar covariate. All results extend to finite-dimensional and endogenous covariates with sequentially exogenous instrumental variables in a straightforward way.

Estimation of $β0$ in this model by two-way fixed-effects regression or by the pooled regression estimator discussed in the previous section would generally lead to inconsistent estimators of $β0$ with sparsely matched data because the transformed error term $u˙it$, where $u˙n=Mgnun$ and $u˙it$ is the element of $u˙n$ corresponding to observation $(i,t)$, will be correlated with the covariate $xit$, and this correlation is not asymptotically ignorable with sparsely matched data. This is a similar source of bias as for dynamic models with one-way fixed effects discussed in Nickell (1981).

With one-way fixed effects, the solution discussed in Anderson and Hsiao (1981), Arellano and Bond (1991), Chamberlain (1992), and subsequent papers is to rely on recursive transformations that guarantee that the transformed error term in each time period $t$ is orthogonal to the covariate $xit$, leading to estimation by method of moments.

Here I extend this approach to two-way fixed effects and propose estimating dynamic models with two- (or multi-) way fixed effects using a recursive transformation of the data, which would lead to a locally efficient estimator when combined with an auxiliary model of optimal instruments.21 This estimator is given by an instrumental variable regression estimator,
$β^n=∑i∈N,t≤T-1zˇity˘it∑i∈N,t≤T-1zˇitx˘it,$
(28)
where $y˘it$ is a transformed outcome variable, $x˘it$ is a transformed covariate, and $zˇit$ is an instrumental variable. These variables for time period $t$ are defined using the following steps:
1. Using data only from time period $t$ to $T$, calculate the residuals from a two-way fixed effects regression of $y$ and $x$ on the set of indicator variables for both indexing variables. Keep these residuals for time period $t$ only, and denote these two transformed variables by $y˘it$ and $x˘it$.

2. Using a parsimonious model, calculate an estimated version of $E(xis|Xt)$ for $s≥t$; denote this new variable $zis★$, $s≥t$.22 Calculate the residuals at time period $t$ from a two-way fixed-effects regression of $z★$ on the set of indicator variables for both indexing variables using only data from time period $t+1$ to $T$. Denote this transformed variable by $zˇit$.

The appendix discusses this estimator in more detail, and in particular provides an example for a parsimonious model of $E(xis|Xt)$, which is used in the Monte Carlo simulations and the empirical application below. Note that in practice, the model for $E(xis|Xt)$ does not need to be correctly specified, so the objective of modeling this “first stage” is to obtain exactly identifying instrumental variables, which are as predictive as possible while relying on few nuisance parameters in order to avoid incidental parameter issues known to arise in methods that rely on many moment conditions (see Alvarez & Arellano, 2003, or Windmeijer, 2005, for a discussion in the context of dynamic models with one-way fixed effects).

The appendix also shows that the estimator proposed here is locally efficient; it is asymptotically efficient in the class of estimators consistent under equation (27) when the error term $uit$ is homoskedastic and uncorrelated and when the exactly identifying instrumental variables used ($zˇit$) are constructed using the true conditional mean functions $E(xis|Xt)$, a result analogous to the local efficiency, obtained by the Gauss-Markov theorem, of the two-way fixed-effects regression for estimating models with strictly exogenous covariates.

Intuitively, this result of local efficiency follows from the particular transformations applied in steps 1 and 2 above to the covariates, outcome variable, and instrumental variables. The transformation in step 1 guarantees that the transformed moment functions $(y˘it-β0x˘it)$ are serially uncorrelated under the auxiliary assumption that the error term $uit$ is homoskedastic and uncorrelated, so that pooling over time yields an efficient estimator. The transformation in step 2 guarantees that the exactly identifying instrumental variables $zˇit$ are optimal instruments for the transformed moment functions $(y˘it-β0x˘it)$ under the auxiliary assumption, so that pooling in the cross-section yields an efficient estimator.

Note also that the estimator for dynamic models proposed here can be computed by combining results from a sequence of two-way fixed-effects regressions so that existing tools are sufficient for computation.

Local efficiency and ease of computation for this estimator make it a desirable choice for point estimation with dynamic models and multiple fixed effects. However, robust variance estimation for this estimator with sparsely matched data will suffer from the same issues discussed in section II for the two-way fixed-effects regression estimator. For the purpose of inference, I propose using the estimator based on data transformed at the subset level:
$β˜n=∑e∈En∑(i,t)∈S(e),t≤T-1zˇit(e)y˘it(e)∑e∈En∑(i,t)∈S(e),t≤T-1zˇit(e)x˘it(e)$
(29)
where the subsets of data $S(e)$, $e∈En$ are defined as in section II, and $x˘it(e)$, $y˘it(e)$, $zˇit(e)$ are defined as were $y˘it$, $x˘it$, $zˇit$ but using observations from subset $S(e)$ only to obtain these transformed variables. Robust inference for this new estimator can then be based on M-dependence robust estimated variance. Details and asymptotic results are provided in the appendix. As for the case with strictly exogenous covariates discussed in section II, one could use the estimator without subsetting ($β^n$) as a preferred-point estimator and use the estimator that relies on subsetting ($β˜n$) for inference only.

## IV. Monte Carlo Simulation Results

In this section I present two groups of results from a small Monte Carlo simulation study of estimation and inference for linear models of panel data with two-way unobserved heterogeneity and sparsely matched data.

The assignment ${dit}{i,t}∈N×T$ used for the simulations of this section was obtained from a subset of the administrative data on public school students in North Carolina, which are discussed in section V. The variable $dit$ is taken to be school assignment, of which there are 205 possible values in these simulations ($Nn=205$). Observations are indexed by a student identifier denoted by $i$, which takes 508 values, and a year indicator denoted by $t$, which takes 4 values ($n=508$ and $T=4$). The form of the resulting assignment is depicted in figure 1 by a representation of the graph where schools are nodes and edges are students (the graph $Gn$ defined in section IIB).

### A. Inference

I evaluate the finite sample properties of the inference methods discussed in section II for $β0$ in the model with a strictly exogenous covariate:
$yit=β0xit+ci+edit+uit,E(uit|X)=0$
(30)
The data-generating process used for $xit$, $ci$, $edit$is
$ed∼i.i.d.χ3,c2vi∼i.i.d.χ3,c2ci=1T∑t∈Tedit+vi$
(31)
$wit∼i.i.d.χ3,c2xit=1[ci+edit+wit-1+wit>0],$
(32)
where $χ3,c2$ is the distribution obtained by standardizing the chi-squared distribution with 3 degrees of freedom. Random variables are drawn independently wherever dependence is not shown explicitly. The data-generating process for $uit$ is obtained from $uit=(1+xit)(rit+qdit,oit+vit)$ where
$ri0=0rit=rit-1+wiwi∼i.i.d.χ3,c2,$
(33)
$pd∼i.i.d.χ3,c2qd,oit=qd,oit-1+pdvit∼i.i.d.χ3,c2$
(34)

where $oit$ is an ordering of all of the observations assigned to a value $d$ of $dit$, which was set at random.

The data-generating process used here leads to an error term $uit$, which exhibits heteroskedasticity and correlation across observations that correspond to either the same student or the same school.

In this section, I evaluate the finite sample performance of the method defined in section IIB, which relies on the pooled regression estimator, M-dependence robust standard errors, and a Wald test (M-robust). With the assignment used here and depicted in figure 1, the pooled regression estimator and M-dependence robust standard errors are based on 254 subsets with dependence neighborhoods that have an average cardinality of 6.15. I also implement two small sample corrections discussed in the appendix: locally debiased standard errors (debiased) and locally debiased standard errors combined with critical values from the $t$-distribution rather than the standard normal distribution ($t$-distribution).

I compare the performance of the methods above with Wald tests using the two-way fixed-effects regression estimator and classical standard errors defined at the beginning of section II, which are not robust to heteroskedasticity or serial or cross-sectional dependence (homoskedasticity), one-way clustered standard errors along the indexing variable $i$ (cluster 1) or $dit$ (cluster 2); two-way clustered standard errors along both indexing variables $i$ and $dit$ as in Cameron et al. (2011) (CGM); the standard errors of Cattaneo et al. (2017), which are not robust to the presence of correlation in the original error term $uit$ (CJN); and the extension of CJN to two-way cluster dependence in the original error term $uit$ defined in the appendix (CJN correlated).

I also show the performance of methods that rely on grouped data combined with a $t$-test as in Ibragimov and Müller (2010; hereafter IM) or a randomization test as in Canay, Romano, and Shaikh (2017, hereafter CRS). Seven or ten groups were defined for these methods by using the SCPS clustering algorithm (Nepusz, Sasidharan, & Paccanaro, 2010) applied to the graph $Gn$ defined in section IIB. This algorithm assigned each school to one of seven or ten groups using a $k$-means clustering algorithm based on the eigenvectors of the symmetric normalized Laplacian of $Gn$. The assignment of school nodes to each group is shown in figure 1 for the case where seven groups were used, from which it is evident that there is indeed a limited amount of overlap across these groups.23

Ninety-five percent confidence intervals were obtained by inverting the tests defined above. Table 1 displays, where relevant, the relative bias of each estimated variance and standard error compared to the variance and standard deviation of its corresponding estimator (either the pooled regression estimator or the two-way fixed effects regression estimator). It also displays coverage probability and the average and median of the width of 95% confidence intervals. Ten thousand replications were used to obtain these results.24 The appendix also contains a power plot for the methods based on subsetting proposed in this paper and the methods based on grouped data.

Table 1.
Results from Monte Carlo Simulations for Different Inference Methods for $β0=1$
$E(Var^(β^))Var(β^)$$E(se(β^))sd(β^)$$P(coverage)$$E(width)$$Q.5(width)$
M-robust 0.953 0.924 0.925 0.850 0.790
Debiased 1.023 0.957 0.934 0.880 0.817
Debiased and critical values from the $t$-distribution 1.023 0.957 0.943 0.918 0.853
Homoskedasticity 0.504 0.691 0.826 0.629 0.597
Cluster1 0.435 0.640 0.795 0.583 0.553
Cluster2 0.692 0.791 0.876 0.720 0.669
CGM 0.756 0.829 0.894 0.755 0.704
CJN 0.473 0.667 0.816 0.607 0.576
CJN (correlated) 0.774 0.844 0.902 0.768 0.719
IM7   1.000 10.474 3.077
IM10   1.000 101.296 3.564
CRS7   0.954 3.947 1.290
CRS10   0.936 24.136 1.135
$E(Var^(β^))Var(β^)$$E(se(β^))sd(β^)$$P(coverage)$$E(width)$$Q.5(width)$
M-robust 0.953 0.924 0.925 0.850 0.790
Debiased 1.023 0.957 0.934 0.880 0.817
Debiased and critical values from the $t$-distribution 1.023 0.957 0.943 0.918 0.853
Homoskedasticity 0.504 0.691 0.826 0.629 0.597
Cluster1 0.435 0.640 0.795 0.583 0.553
Cluster2 0.692 0.791 0.876 0.720 0.669
CGM 0.756 0.829 0.894 0.755 0.704
CJN 0.473 0.667 0.816 0.607 0.576
CJN (correlated) 0.774 0.844 0.902 0.768 0.719
IM7   1.000 10.474 3.077
IM10   1.000 101.296 3.564
CRS7   0.954 3.947 1.290
CRS10   0.936 24.136 1.135

$β0$ is estimated by the pooled regression estimator for the first three methods (bias = .002, standard deviation = .235). $β0$ is estimated by the two-way fixed-effects regression estimator for the following six methods (bias $=-$.001, standard deviation = .232). The first three methods rely on subsetting M-dependence robust standard errors, local debiasing, critical values from a $t$-distribution. The average number of degrees of freedom used for the $t$-distribution is $29.924$. The next four methods correspond to classical standard errors, one-way clustering along either indexing variable, two-way clustering. CJN is the method of Cattaneo et al. (2017), while CJN correlated is its extension to correlated errors defined in this paper. The last three columns are for 95% confidence intervals, which were obtained by inverting Wald or $t$-tests for the first nine methods. Confidence intervals for IM and CRS rely on grouped data and invert either a $t$-test or a randomization test, respectively. IM7 and CRS7 rely on seven groups, IM10 and CRS10 on ten groups.

We see that the methods based on subsetting proposed in this paper yield standard errors with a relatively low bias and tests with moderate size distortions, particularly when small sample adjustments are applied. Classical standard errors, clustered standard errors, or the standard errors of Cattaneo et al. (2017) exhibit a large bias. The extension of the approach of Cattaneo et al. (2017) to correlated error terms leads to an improvement in bias and size of the resulting test but does not perform as well as the methods based on subsetting. Inference for grouped data using a $t$-test (IM) is very conservative and exhibits extremely low power away from the null hypothesis. Inference for grouped data using a randomization tests (CRS) exhibits good size control since the groups used here have very little overlap. However, power, while significantly higher than IM, is still very low compared to the Wald or $t$-tests using the estimator and standard errors proposed here. For instance, the average width of 95% confidence intervals using CRS and seven groups is more than four times larger than using the methods proposed here.25

### B. Estimation of Dynamic Models

I also evaluate small sample properties of the proposed estimator for dynamic models with two-way fixed effects compared to existing methods. The coefficients of interest are $ρ0$ and $β0$ in the dynamic model given by, $∀t=1,…,T$:
$yit=ρ0yit-1+β0xit+ci+edit+uit,E(uit|Yt-1,X)=0,$
(35)

where $Yt-1=[yis]i=1,…,n,s=1,…,t-1$.

This is a dynamic model of panel data with one strictly exogenous covariate, $xit$, which will be taken to be binary and will be labeled as a “policy variable” so that policy evaluation here would amount to estimating $β0$ precisely. This labeling is used only to emphasize that the results discussed in section III matter not only for estimating the coefficients on sequentially exogenous covariates ($ρ0$ here) but also on strictly exogenous covariates ($β0$ here).

I present results for the dynamic two-way fixed effects estimator defined in section III based on the full sample (DTW f.s., defined by an extension of equation [28] to multiple covariates) and the estimator defined in section III based on subsetting (DTW sub., defined by an extension of equation [29] to multiple covariates). Both estimators are discussed in more detail in the appendix. The first alternative estimator considered is the two-way fixed-effects regression estimator, which treats $yit-1$ as a strictly exogenous covariate. I call this estimator the static two-way fixed-effects estimator (STW). The second alternative estimator is an estimator as in section III but treats $[1[dit=d]]d∈Dn$ as a finite-dimensional vector of strictly exogenous covariates, so that only $ci$ is removed by sequential transformations of the data. I call this estimator the naive two-way fixed-effects estimator (NTW). The third alternative is a dynamic estimator as in section III but ignores $edit$ and estimates a dynamic model with cross-sectional unobserved heterogeneity only. I call this estimator the one-way fixed-effects estimator 1 (OW1). The fourth alternative estimator is obtained by a regression of $yit$ on $yit-1$, $xit$, and the indicator variables ${1[dit=d]}d∈Dn$ (OW2). Finally, the fifth alternative estimator ignores all forms of unobserved heterogeneity and estimates $ρ0$ and $β0$ by a regression of $yit$ on $yit-1$ and $xit$ (OLS).

The data-generating process for this Monte Carlo analysis uses the values $ρ0=0.5$, $β0=1$. $ci$, $edit$, and $xit$ are generated as in the previous subsection. The rest of the data-generating process is given by
$ui0∼i.i.d.χ3,c2yi0=ci1-ρ0+11-ρ02ui0uit∼i.i.d.χ3,c2.$
(36)

In table 2, I report summary statistics from the simulated distributions of the estimators of $ρ0$ and $β0$: expected bias, mean absolute error, and the spread between the 5th and 95th percentiles of the distribution of the estimators. Ten thousand replications were used to obtain the simulated distributions of the estimators.

Table 2.
Results from Monte Carlo Simulations for Different Estimators of $ρ0=.5$ and $β0=1$
DTW (f.s.)DTW (sub.)STWNTWOW1OW2OLS
$E(ρ^-ρ0)$ −0.006 −0.006 −0.216 −0.116 −0.094 0.273 0.390
$E(|ρ^-ρ0|)$ 0.023 0.024 0.216 0.116 0.116 0.273 0.390
$Q0.95(ρ^)-Q0.05(ρ^)$ 0.095 0.096 0.141 0.161 0.348 0.049 0.100
$E(β^-β0)$ −0.002 −0.002 −0.069 −0.030 1.016 0.292 1.209
$E(|β^-β0|)$ 0.077 0.079 0.097 0.095 1.016 0.293 1.209
$Q0.95(β^)-Q0.05(β^)$ 0.316 0.328 0.318 0.377 0.766 0.301 0.793
DTW (f.s.)DTW (sub.)STWNTWOW1OW2OLS
$E(ρ^-ρ0)$ −0.006 −0.006 −0.216 −0.116 −0.094 0.273 0.390
$E(|ρ^-ρ0|)$ 0.023 0.024 0.216 0.116 0.116 0.273 0.390
$Q0.95(ρ^)-Q0.05(ρ^)$ 0.095 0.096 0.141 0.161 0.348 0.049 0.100
$E(β^-β0)$ −0.002 −0.002 −0.069 −0.030 1.016 0.292 1.209
$E(|β^-β0|)$ 0.077 0.079 0.097 0.095 1.016 0.293 1.209
$Q0.95(β^)-Q0.05(β^)$ 0.316 0.328 0.318 0.377 0.766 0.301 0.793

DTW (f.s.) and DTW (sub.) are the estimators proposed in section III relying on the full sample without subsetting or on subsetting, respectively. STW treats covariates as strictly exogenous, NTW treats one of the fixed effects as finite-dimensional. OW1 and OW2 each ignore one of the fixed effects. OLS ignores both fixed effects.

Looking at bias or the mean absolute error of the estimators, we see that it is not only accounting for both sources of heterogeneity that is important for estimating both $ρ0$ and $β0$ accurately (DTW outperforms OW1, OW2, and OLS), but also accounting for the fact that one of the covariates is sequentially exogenous (DTW outperforms STW), and using an estimator that treats neither $ci$ nor $ed$ as consistently estimated (DTW outperforms NTW). It is important to note that the bias in terms of policy evaluation, which here is captured by the bias in estimating $β0$, can be large when using estimators that do not account for both sources of unobserved heterogeneity. In other words, when $β0$ is the only object of interest, the argument stating that $yit-1$ is a proxy for either term of unobserved heterogeneity does not hold with the specific data-generating process used in this section. This is consistent with the results in section V, where estimators that do not account for several sources of unobserved heterogeneity lead to results that are significantly different from the results obtained with the estimator developed in this paper. The appendix contains additional results on estimation bias with static models.

## V. Effect of Class Size Reductions on Student Achievement

In this section I estimate the effect of class size reductions on student achievement among elementary school students in North Carolina using a dynamic model of learning that accounts for selection based on student unobserved ability and teacher and school unobserved quality.

Class size reduction has been a policy of interest in education for several decades, and the appendix contains a small discussion of previous empirical studies on class size reduction. Here I use student-level longitudinal data made available by the North Carolina Education Research Center to estimate a dynamic model of student learning and the effect of class size reductions on student achievement (see Todd & Wolpin, 2003, and Andrabi et al., 2011). I use the model:
$yit=αgradeit,t+ρ0yit-1+xitβ0+ci+edit+fschoolit+uit$
(37)
$E(uit|Yt-1,X)=0,$
(38)

where $yit$ is student $i$'s achievement in year $t$ as measured by a standardized test score, $gradeit$ is a student's grade in year $t$, $αgradeit,t$ captures year/grade heterogeneity in student achievement, $yit-1$ is past achievement, $xit$ is a fourth-order polynomial in class size used to allow for nonlinearities in the effect of class size on achievement, $ci$ is student unobserved learning ability, $edit$ is teacher unobserved teaching effectiveness, $fschoolit$ are unobserved school-level factors, $uit$ are transitory shocks, $Yt-1$ collects all prior test scores ($Yt-1={yis}i∈N,s≤t-1$), and $X$ collects all class size assignments ($X={xit}(i,t)∈N×T$). The appendix contains a longer discussion of this model.26

The estimation sample is composed of observations on students in fourth and fifth grade between the years 2009 and 2012.27 In order to avoid using the polynomial in class size to extrapolate outside the effective range of class size observed in the sample, I restrict the estimation sample to observations with class size between ten and thirty students.28 Table 3 shows several statistics from the sample used for estimation, such as unconditional average and standard deviation of test scores, and average test scores for various values of class size. Test scores are left in their original units here. These units can be contextualized by considering percentiles of the distribution of test scores. An increase in test score by 1 would switch a 50th percentile student to the 54th percentile in mathematics and a 48th percentile student to the 53rd percentile in reading. On average, an increase in test score by 1 is associated with moving up in the grade distribution by around 3 percentiles. The appendix contains a longer discussion of the data.

Table 3.
Description of the Estimation Sample
Students 206,722
Teachers 11,960 12,332
Years 2009 to 2012
Test scores
Average 355.04 349.16
SD 9.38 9.52
Class size
Average 21.99 21.44
SD 4.06 4.53
Average test score by class size
10–14 students 353.60 347.87
15–19 students 353.29 347.42
20–24 students 354.82 349.03
24–30 students 356.81 351.05
Students 206,722
Teachers 11,960 12,332
Years 2009 to 2012
Test scores
Average 355.04 349.16
SD 9.38 9.52
Class size
Average 21.99 21.44
SD 4.06 4.53
Average test score by class size
10–14 students 353.60 347.87
15–19 students 353.29 347.42
20–24 students 354.82 349.03
24–30 students 356.81 351.05

The model given by equations (37) and (38) is estimated as described in section III using parsimonious exactly identifying instrumental variables, which are detailed in the appendix, and the transformation defined in section III. As discussed in sections II and III, I report results using the estimator without subsetting, defined by an extension of equation (28) to multiple covariates, as the preferred-point estimate. As in section II, inference is based on the pooled regression estimator, which relies on subsetting, defined by an extension of equation (29) to multiple covariates. Details for all estimators discussed in this section are provided in the appendix. Each subset corresponds to a pair of teachers who have taught at least two students in common, and the subset is composed of all observations corresponding to students who have been taught by both teachers.29 Standard errors used here are designed to be robust to arbitrary dependence at the school and student levels, which leads to a minor modification compared to the standard errors defined in section II. The standard errors used here rely on dependence neighborhoods that collect all subsets containing observations indexed by the same school or student, while the standard errors defined in Section II would use dependence neighborhoods that collect all subsets containing observations indexed by the same teacher or student. This procedure leads to around 25,000 subsets with dependence neighborhoods that have an average cardinality around 37, so that the asymptotic approximation treating the number of subsets as large and the dependence neighborhoods as relatively small seems appropriate here.

In order to evaluate the difference between the results obtained with the methods introduced in this paper and existing methods, I also present results obtained with several alternative estimators. I show results obtained with an estimator for dynamic models with one-way unobserved heterogeneity indexed by student identity only, as used, for instance, in Andrabi et al. (2011), Rothstein (2010), and Verdier (2016). I also show results obtained with an estimator for models without student-level unobserved heterogeneity and which treats teacher- and school-level unobserved heterogeneity as finite-dimensional parameters that can be estimated consistently. This type of estimator is used, for instance, in Kane and Staiger (2008), Chetty, Friedman, and Rockoff (2014), and Guarino, Reckase, and Wooldridge (2015). A heuristic argument frequently found in the empirical literature for using estimators that correspond to dynamic models without student-level unobserved heterogeneity is that past test scores serve as proxies for student unobserved ability. Finally, I show results obtained with an estimator for models without unobserved heterogeneity, which is simply an ordinary least squares regression of test score on past test score, grade-year indicators, and the polynomial in class size. In the appendix, I also consider models in gains ($ρ0$ is assumed to equal 1), models without dynamics ($ρ0$ is assumed to equal zero), and models with measurement error where test scores are noisy measures of student achievement.

The results are presented in table 4. In mathematics, on a range of class size from 15 to 30 students, reductions in class size are estimated to have a positive effect on student achievement, and the evolution of the magnitude of this effect over the range of class size corresponds to the hypothesis of decreasing marginal returns to teacher time per pupil. The magnitude of this effect is relatively modest; for example, reducing class size from 30 to 15 students is estimated to lead to an average increase in test score of around 0.8 (i.e., around 10% of a standard deviation) or a move up in the distribution of test scores of around 3 percentiles on average. For comparison, Krueger (1999) finds that a smaller class size reduction of 24 to 16 students on average is associated with a gain in test scores of around 3 percentiles on average for third graders.30 In reading, reductions in class size are not estimated to have a statistically significant effect on student achievement. In mathematics, on a range of class size from 10 to 15 students, reductions in class size are estimated to lead to a decrease in student achievement. This reversal in the sign of the effect of class size reduction could be due to factors that are not controlled for here but are related to class size. In particular, ongoing work is interested in how these results could change when accounting for peer effects. It appears from these results that there is positive selection not only of students into larger class sizes, but also of teachers, since estimators that do not account for student and teacher unobserved heterogeneity simultaneously lead to smaller estimates of the effect of class size reduction on student achievement than the estimator proposed in this study.

Table 4.
Estimates of Persistence and of the Effect of Class Size Reductions on Student Achievement
Three-Way Unobserved HeterogeneityOnly Student Unobserved HeterogeneityNo Student Unobserved HeterogeneityNo Unobserved Heterogeneity
Mathematics
Persistence 0.089 [0.070, 0.102] 0.109 (0.007) 0.787 (0.001) 0.816 (0.001)
Class size reduction from 30 to:
25 0.437 [0.071, 0.730] 0.061 (0.100) 0.291 (0.128) −0.253 (0.140)
20 0.724 [0.405, 1.020] 0.398 (0.092) 0.434 (0.120) −0.309 (0.129)
15 0.875 [0.519, 1.225] 0.417 (0.107) 0.498 (0.138) −0.214 (0.147)
10 0.610 [0.209, 1.029] 0.070 (0.126) 0.266 (0.164) −0.643 (0.163)
Persistence 0.230 [0.213, 0.235] 0.256 (0.004) 0.690 (0.001) 0.719 (0.001)
Class size reduction from 30 to:
25 −0.018 [−0.438, 0.240] −0.099 (0.090) −0.089 (0.107) −0.438 (0.100)
20 0.169 [−0.283, 0.351] 0.102 (0.082) −0.032 (0.100) −0.660 (0.092)
15 0.261 [−0.192, 0.511] 0.180 (0.094) −0.045 (0.114) −0.637 (0.103)
10 0.245 [−0.246, 0.535] 0.125 (0.105) 0.053 (0.122) −0.562 (0.109)
Three-Way Unobserved HeterogeneityOnly Student Unobserved HeterogeneityNo Student Unobserved HeterogeneityNo Unobserved Heterogeneity
Mathematics
Persistence 0.089 [0.070, 0.102] 0.109 (0.007) 0.787 (0.001) 0.816 (0.001)
Class size reduction from 30 to:
25 0.437 [0.071, 0.730] 0.061 (0.100) 0.291 (0.128) −0.253 (0.140)
20 0.724 [0.405, 1.020] 0.398 (0.092) 0.434 (0.120) −0.309 (0.129)
15 0.875 [0.519, 1.225] 0.417 (0.107) 0.498 (0.138) −0.214 (0.147)
10 0.610 [0.209, 1.029] 0.070 (0.126) 0.266 (0.164) −0.643 (0.163)
Persistence 0.230 [0.213, 0.235] 0.256 (0.004) 0.690 (0.001) 0.719 (0.001)
Class size reduction from 30 to:
25 −0.018 [−0.438, 0.240] −0.099 (0.090) −0.089 (0.107) −0.438 (0.100)
20 0.169 [−0.283, 0.351] 0.102 (0.082) −0.032 (0.100) −0.660 (0.092)
15 0.261 [−0.192, 0.511] 0.180 (0.094) −0.045 (0.114) −0.637 (0.103)
10 0.245 [−0.246, 0.535] 0.125 (0.105) 0.053 (0.122) −0.562 (0.109)

Intervals in brackets are 95% confidence intervals; numbers in parentheses are standard errors. Confidence intervals for the first estimator rely on subsetting and pooled regression estimation and are robust to arbitrary dependence indexed by school and student. Standard errors for the rest of the estimators are two-way clustered standard errors by student and teacher. Note that it is likely that there is no asymptotic justification for using the third estimator (teacher and school unobserved heterogeneity treated as parameters that can be estimated consistently) jointly with two-way clustered standard errors. Units for the effect of class size reductions are the original test score units as described in the text.

## VI. Conclusion

This paper provides asymptotically valid methods for inference with linear models of panel data that include multiway unobserved heterogeneity even when few observations are assigned to each value of the variables that index unobserved heterogeneity and when dependence is present. It also defines a new estimator for dynamic models (models with sequentially exogenous instruments), so that practitioners do not have to make a choice between including one-way unobserved heterogeneity only, suppressing the dynamics of the model, or treating all but one term of unobserved heterogeneity as finite-dimensional parameters that can be estimated consistently.

## Notes

1

Models with two-way unobserved heterogeneity are not restricted to education or worker-firm matched data. For instance, Chetty and Hendren (2018) and Finkelstein, Gentzkow, and Williams (2016) consider households and patients assigned to geographical locations and models for income and health measures with two-way unobserved heterogeneity.

2

See Abowd et al. (2004) and Andrews et al. (2008, 2012) for a discussion of biases arising in variance decompositions with sparsely matched data.

3

A case where the framework used here clearly does not correspond to the structure of the data is when there is an observation for each combination of values of both indexing variables (exhaustively matched data). This is the case, for instance, when a balanced longitudinal data set is used with a model that includes individual and time heterogeneity. In such cases, estimators for linear models can be defined after a simple three-way demeaning of the original data. With exhaustively matched data, complications arise only when using nonlinear models, as studied, for instance, in Charbonneau (2014), Graham (2017), and Fernández-Val and Weidner (2016).

4

Bonhomme et al. (2019) estimate features of the distribution of unobserved heterogeneity while relaxing the requirement on sparsity that would be imposed when using classical theory by relying on a consistent classification of one out of two fixed effects in a first stage. While treating one of the fixed effects as a coefficient on an approximately observed finite-dimensional covariate allows for greater flexibility in the rest of the model, the restrictions imposed on the density of matching and the nature of the matching process itself are tighter than what is needed here.

5

In particular, several papers study how to compute such regressions—for example, Abowd, Creecy, and Kramarz (2002), Guimaraes and Portugal (2010), McCaffrey et al. (2012), Somaini and Wolak (2015), and Correia (2016).

6

I consider here only the case where the covariate of interest is a scalar since all results are generalized to finite-dimensional covariates in a straightforward way.

7

I consider the case where the first term of unobserved heterogeneity is specific to a cross-sectional observation as it is most frequently of interest in empirical work. Extending the results presented here to the case where the observed variable indexing the first term of unobserved heterogeneity is other than $i$ is straightforward. All results for inference with the proposed estimator based on subsetting would also extend to having more than two terms of unobserved heterogeneity. This is not considered in this section for simplicity.

8

Strict exogeneity of $xit$ will be relaxed in the next section, while strict exogeneity of $dit$ is essential to the results obtained here. Relaxing this latter assumption within the asymptotic framework of this paper is the subject of ongoing work but falls outside the scope of this paper.

9

Treating the number of teachers $Nn$ as finitely bounded (as in Abowd, Kramarz, & Margolis, 1999) would amount to analyzing a linear model with one-way unobserved heterogeneity and finite-dimensional covariates, although one can also note that the type of dependence in transitory shocks considered in most of this paper would imply that only finite sample inference could be considered. It is likely that all results that follow would generalize to $Nn,d$ or $T$ growing unboundedly but at a rate slower than $n$. This is not considered here for simplicity.

10

Note that the same issue is well known to arise in the context of models with one-way fixed effects and short panels. With one-way fixed-effects models, cluster robust standard errors are consistent (see, e.g., Wooldridge, 2010), while I discuss below that with two- (or multi-) way fixed effects, the dependence structure of the transformed error term $Mgnun$ will generally not have a cluster-dependence structure, so that cluster robust standard errors cannot be used.

11

Note that this definition of $κn$ is equivalent to $∑j∈N×Tκijmjk2=1[i=k]$$∀i,k∈N×T$.

12

In a recent contribution, Kline, Saggio, and Sølvsten (2018) discuss a leave-out estimator for variance components with two-way fixed-effects models. One condition they impose for consistency is that $Mgn[{i,t},{i,t}]$ be uniformly bounded away from 0. Proposition 1 shows that this is obtained if $T≥2$ and every pair of teachers has taught either no student in common or at least two pairs of same-student observations in common, both of which may originate from the same student. Formally, $Mgn[{i,t},{i,t}]≥c$$∀i,t∈N×T$, $∀n$, for a constant $c>0$ if for all $i∈N$, $t,s∈T$ s.t. $dit≠dis$, we have either $Nn,dis(i),Nn,dit(i)≥2$ or $∃j∈N$, $j≠i$ s.t. $Nn,dit(j)Nn,dis(j)≥1$.

13

Cameron and Miller (2015) discuss the existence of correlated shocks even after controlling for fixed effects in the context of one-way fixed effects, and their discussion extends to the case of two-way fixed-effects models.

14

In addition, these methods are computationally intensive, as they require solving linear systems in multiples of $nT$ unknowns.

15

The appendix also shows that in the case of one-way fixed-effects regression, Cattaneo et al.'s (2017) method extended to serially correlated errors coincides exactly with cluster robust standard errors.

16

The appendix shows that we can think of $Gn$ as an electrical network and consider current flowing from $dit$ to $dis$ through $Gn$. The paths used by electricity to flow through $Gn$ are the paths for which the associated sums of differences will receive positive weight in the definition of $u˙it$.

17

Note that this estimator, as well as the estimated variance defined below, is simple to compute even with large data sets because it lends itself well to parallelization.

18

The term M-dependence was originally used to refer to time-series data where draws more than $m$ time periods apart are uncorrelated or independent for some constant $m$. With spatial data, M-dependence refers to draws apart by a distance greater than $m$ being uncorrelated or independent. In the context of network dependence considered here, we have uncorrelation between $u˙n(e)$ and $u˙n(e')$ if the edges $e$ and $e'$ are apart by a distance greater than 1 in $Gn$.

19

In the empirical application of this paper where $x$ is class size, the condition requires that fourth-grade teachers are observed in two different years, say years 1 and 2, and that for a nonvanishing fraction of fourth-grade teachers, one student in year 1 and one student in year 2 were assigned to the same fifth-grade teacher, but that the change in class size experienced by these two students was different. In practice, $x˙n(e)$ is observed, so one can verify whether the subsetting of the data defined above leads to enough variation in the transformed covariate.

20

See Todd and Wolpin (2003) for a discussion of the conditions under which this approach is valid.

21

See Arellano (2016) and Verdier (2016) for a discussion of parsimonious optimal instruments in the context of dynamic models with one-way fixed effects.

22

Note that in this simplified example, where the covariate is sequentially exogenous, $E(xis|Xt)$ is simply $xit$ for $s=t$.

23

An alternative inference method could be bootstrap resampling. Even with uncorrelated errors, Bickel and Freedman (1982) have shown that bootstrap approximations are invalid for linear regression models with many covariates. This is easy to understand in the context of two-way fixed-effects regression since selecting observations at random would not preserve a match structure close to that found in the original data set. For instance, a significant fraction of students in bootstrap samples would be associated with only one time period, even if the original data set is a balanced panel. One could attempt to solve this issue by relying on residual bootstrap resampling, but as discussed, the OLS residuals and estimated fixed effects are not close approximations to the error term and unobserved heterogeneity terms of the original model with sparsely matched data.

24

The bisection algorithm for inverting the CRS test did not converge within four hours per replication for 138 replications, so results are reported conditional on this algorithm having converged—9,862 replications.

25

As noted above, the inference method based on subsetting and pooled regression proposed in this paper would remain asymptotically valid if $T$ and $Nn,d$ grow as sample size $n$ increases but at slow enough rates. In other words, the method proposed here remains valid for data with a moderately dense match structure. With weak serial and cross-sectional dependence and densely matched data, heteroskedasticity-autocorrelation consistent standard errors that treat the fixed effects as parameters that can be estimated precisely would also lead to asymptotically valid inference. Note that this second approach imposes stronger restrictions on dependence than assumption 2 does. It is unknown whether the two-way clustered standard errors of Cameron et al. (2011) can be shown to be asymptotically valid for two-way fixed-effects regression with densely matched data. A study of the performance of these methods, or of alternative methods, with varying degrees of match density is left for future work.

26

An alternative model to the one used here would rely on a teacher-by-school fixed effect, with equation (37) replaced by $yit=αgradeit,t+ρ0yit-1+xitβ0+ci+fdit,schoolit+uit$. Here a model with three-way fixed effects is used to exemplify that the methods proposed above are extended in a straightforward way to models with more than two fixed effects.

27

I selected this period of observation in order to have a unique version of the standardized tests in mathematics and reading. The state of North Carolina switches to a new version of its tests periodically, and it is unclear how test scores can be compared across versions of the test. The state's standardized tests rely on item response theory for scores to be comparable across both students and over time, but this comparison cannot be performed across versions of tests. While attempts have been made to compare scores across versions of the test by standardizing at the year-grade level (see Rothstein, 2010), there is no theoretical justification for such a procedure. For simplicity, the results shown here rely on data corresponding to only one version of the test in each subject.

28

Results obtained from restricting the sample to classrooms of between five and thirty students, which nearly encompasses the entire sample, are not reported here but lead to similar results.

29

Here, since $T=2$, a subset composed of only one student would not lead to any identifying variation, so that one need only consider pairs of teachers who have taught two or more students in common. Additionally, recall from note 19 that identification here requires that many fourth-grade teachers be observed teaching two or more classrooms over time, with students from both classrooms being assigned to the same fifth-grade teacher but experiencing a different change over time in class size.

30

See table 1, panel D in Krueger (1999). The STAR experiment does not provide results on the contemporaneous effect of class size reductions for fourth and fifth graders.

## REFERENCES

Abowd
,
J. M.
,
R. H.
Creecy
, and
F.
Kramarz
, “
Computing Person and Firm Effects Using Linked Longitudinal Employer-Employee Data
,”
U.S. Census Bureau technical paper
TP-2002-06
(
2002
).
Abowd
,
J. M.
,
F.
Kramarz
,
P.
Lengermann
, and
S.
Pérez-Duarte
, “
Are Good Workers Employed by Good Firms? A Test of a Simple Assortative Matching Model for France and the United States
,”
working paper
(
2004
).
Abowd
,
J. M.
,
F.
Kramarz
, and
D. N.
Margolis
, “
High Wage Workers and High Wage Firms,
Econometrica
67
(
1999
),
251
333
.
Alvarez
,
J.
, and
M.
Arellano
, “
The Time Series and Cross-Section Asymptotics of Dynamic Panel Data Estimators,
Econometrica
71
(
2003
),
1121
1159
.
Anderson
,
T. W.
, and
C.
Hsiao
, “
Estimation of Dynamic Models with Error Components,
Journal of the American Statistical Association
76
(
1981
),
598
606
.
Andrabi
,
T.
,
J.
Das
,
A. Ijaz
Khwaja
, and
T.
Zajonc
, “
American Economic Journal: Applied Economics
3
(
2011
),
29
54
.
Andrews
,
M. J.
,
L.
Gill
,
T.
Schank
, and
R.
Upward
, “
High Wage Workers and Low Wage Firms: Negative Assortative Matching or Limited Mobility Bias?
Journal of the Royal Statistical Society: Series A
171
(
2008
),
673
697
.
Andrews
,
M. J.
,
L.
Gill
,
T.
Schank
, and
R.
Upward
High Wage Workers Match with High Wage Firms: Clear Evidence of the Effects of Limited Mobility Bias,
Economics Letters
117
(
2012
),
824
827
.
Arellano
,
M.
, “
Modelling Optimal Instrumental Variables for Dynamic Panel Data Models,
Research in Economics
70
(
2016
),
238
261
.
Arellano
,
M.
, and
S.
Bond
, “
Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations,
Review of Economic Studies
58
(
1991
),
277
297
.
Bell
,
R. M.
, and
D. F.
McCaffrey
, “
Bias Reduction in Standard Errors for Linear Regression with Multi-Stage Samples,
Survey Methodology
28
(
2002
),
169
179
.
Bickel
,
P. J.
, and
D. A.
Freedman
, “
Bootstrapping Regression Models with Many Parameters
,” Department of Statistics, University of California Berkeley
technical report 7
(
1982
).
Bonhomme
,
S.
,
T.
, and
E.
Manresa
, “
A Distributional Framework for Matched Employer Employee Data,
Econometrica
87
(
2019
),
699
739
.
Cameron
,
A. C.
,
J. B.
Gelbach
, and
D. L.
Miller
, “
Robust Inference with Multiway Clustering,
Journal of Business and Economic Statistics
29
(
2011
),
238
249
.
Cameron
,
A. C.
, and
D. L.
Miller
, “
A Practitioner's Guide to Cluster-Robust Inference,
Journal of Human Resources
50
(
2015
),
317
372
.
Canay
,
I. A.
,
J. P.
Romano
, and
A. M.
Shaikh
, “
Randomization Tests under an Approximate Symmetry Assumption,
Econometrica
85
(
2017
),
1013
1030
.
Card
,
D.
,
J.
Heining
, and
P.
Kline
, “
Workplace Heterogeneity and the Rise of West German Wage Inequality,
Quarterly Journal of Economics
128
(
2013
),
967
1015
.
Cattaneo
,
M. D.
,
M.
Jansson
, and
W. K.
Newey
, “
Inference in Linear Regression Models with Many Covariates and Heteroskedasticity,
Journal of the American Statistical Association
113
(
2017
),
1107
1162
.
Chamberlain
,
G.
, “
Comment: Sequential Moment Restrictions in Panel Data,
Journal of Business and Economic Statistics
10
(
1992
),
20
26
.
Charbonneau
,
K. B.
, “
Multiple Fixed Effects in Binary Response Panel Data Models
,”
Bank of Canada working paper 2014-17
(
2014
).
Chetty
,
R.
,
J. N.
Friedman
, and
J. E.
Rockoff
, “
Measuring the Impacts of Teachers I: Evaluating Bias in Teacher Value-Added Estimates,
American Economic Review
104
(
2014
),
2593
2632
.
Chetty
,
R.
, and
N.
Hendren
, “
The Impacts of Neighborhoods on Intergenerational Mobility I: Childhood Exposure Effects,
Quarterly Journal of Economics
133
(
2018
),
1107
1162
.
Clotfelter
,
C. T.
,
H. F.
, and
J. L.
Vigdor
, “
Teacher Credentials and Student Achievement in High School: A Cross-Subject Analysis with Student Fixed Effects,
Journal of Human Resources
45
(
2010
),
655
681
.
Correia
,
S.
, “
Linear Models with High-Dimensional Fixed Effects: An Efficient and Feasible Estimator
,”
working paper
(
2010
).
Fernández-Val
,
I.
, and
M.
Weidner
, “
Individual and Time Effects in Nonlinear Panel Models with Large N, T.
Journal of Econometrics
192
(
2016
),
291
312
.
Finkelstein
,
A.
,
M.
Gentzkow
, and
H.
Williams
, “
Sources of Geographic Variation in Health Care: Evidence from Patient Migration,
Quarterly Journal of Economics
131
(
2016
),
1681
1726
.
Graham
,
B. S.
, “
An Econometric Model of Link Formation with Degree Heterogeneity,
Econometrica
85
(
2017
),
1033
1063
.
Guarino
,
C.
,
M.
Reckase
, and
J.
Wooldridge
, “
Can Value-Added Measures of Teacher Performance Be Trusted,
Education Finance and Policy
10
(
2015
),
117
156
.
Guimaraes
,
P.
, and
P.
Portugal
, “
A Simple Feasible Procedure to Fit Models with High-Dimensional Fixed Effects,
Stata Journal
10
(
2010
),
628
.
Hanushek
,
E. A.
,
J. F.
Kain
,
J. M.
Markman
, and
S. G.
Rivkin
, “
Does Peer Ability Affect Student Achievement?
Journal of Applied Econometrics
18
(
2003
),
527
544
.
Ibragimov
,
R.
, and
U. K.
Müller
, “
t-Statistic Based Correlation and Heterogeneity Robust Inference,
Journal of Business and Economic Statistics
28
(
2010
),
453
468
.
Imbens
,
G. W.
, and
M.
Kolesár
, “
Robust Standard Errors in Small Samples: Some Practical Advice
,” this review 98 (
2015
),
701
712
.
Kane
,
T. J.
, and
D. O.
Staiger
, “
Estimating Teacher Impacts on Student Achievement: An Experimental Evaluation
,”
NBER working paper
14607
(
2008
).
Kingdon
,
G.
, and
F.
Teal
, “
Teacher Unions, Teacher Pay and Student Performance in India: A Pupil Fixed Effects Approach,
Journal of Development Economics
91
(
2010
),
278
288
.
Kline
,
P.
,
R.
Saggio
, and
M.
Sølvsten
, “
Leave-Out Estimation of Variance Components
,” (
2018
),
arXiv:1806.01494
.
Krueger
,
A. B.
, “
Experimental Estimates of Education Production Functions,
Quarterly Journal of Economics
114
(
1999
),
497
532
.
MacKinnon
,
J. G.
, and
H.
White
, “
Some Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties,
Journal of Econometrics
29
(
1985
),
305
325
.
McCaffrey
,
D. F.
,
J. R.
Lockwood
,
K.
Mihaly
, and
T. R.
Sass
, “
A Review of Stata Commands for Fixed-Effects Estimation in Normal Linear Models,
Stata Journal
12
(
2012
),
406
.
Nepusz
,
T.
,
R.
Sasidharan
, and
A.
Paccanaro
, “
SCPS: A Fast Implementation of a Spectral Method for Detecting Protein Families on a Genome-Wide Scale,
BMC Bioinformatics
11
(
2010
),
120
.
Nickell
,
S.
, “
Biases in Dynamic Models with Fixed Effects,
Econometrica
49
(
1981
),
1417
1426
.
Postel–Vinay
,
F.
, and
J.
Robin
, “
Equilibrium Wage Dispersion with Worker and Employer Heterogeneity,
Econometrica
70
(
2002
),
2295
2350
.
Rivkin
,
S. G.
,
E. A.
Hanushek
, and
J. F.
Kain
, “
Econometrica
73
(
2005
),
417
458
.
Rothstein
,
J.
, “
Teacher Quality in Educational Production: Tracking, Decay, and Student Achievement,
Quarterly Journal of Economics
125
(
2010
),
175
214
.
Somaini
,
P.
, and
F. A.
Wolak
, “
An Algorithm to Estimate the Two-Way Fixed Effects Model,
Journal of Econometric Methods
5
(
2015
),
143
152
.
Sund
,
K.
, “
Estimating Peer Effects in Swedish High School Using School, Teacher, and Student Fixed Effects,
Economics of Education Review
28
(
2009
),
329
336
.
Todd
,
P. E.
, and
K. I.
Wolpin
, “
On the Specification and Estimation of the Production Function for Cognitive Achievement,
Economic Journal
113
(
2003
),
F3
F33
.
Verdier
,
V.
, “
Estimation of Dynamic Panel Data Models with Cross-Sectional Dependence: Using Cluster Dependence for Efficiency,
Journal of Applied Econometrics
31
(
2016
),
85
105
.
Windmeijer
,
F.
, “
A Finite Sample Correction for the Variance of Linear Efficient Two-Step GMM Estimators,
Journal of Econometrics
126
(
2005
),
25
51
.
Wooldridge
,
J. M.
,
Econometric Analysis of Cross Section and Panel Data
, 2nd ed. (
Cambridge, MA
:
MIT Press
,
2010
).

## Author notes

I thank the editor, Bryan Graham, as well as three anonymous referees for very helpful comments and suggestions. I also thank Stéphane Bonhomme, Colin Cameron, Matias Cattaneo, Jane Cooley Fruehwirth, Tiago Pires, and Jeffrey Wooldridge for useful discussions and suggestions; Paul Hanselman for providing code used to process part of the data; and Chunxiao Li, Kristina Vaughan, and Josh Horvath for their research assistance.

A supplemental appendix is available online at http://www.mitpressjournals.org/doi/suppl/10.1162/rest_a_00807.