## Abstract

Recent evidence on teacher productivity suggests that teachers meaningfully influence non-test academic student outcomes that are commonly overlooked by narrowly focusing on test scores. Despite a large number of studies investigating the Teach For America (TFA) effect on math and English achievement, little is known about non-tested academic outcomes. Using administrative data from Miami-Dade County Public Schools, we investigate the relationship between being in a TFA classroom and five non-test student outcomes commonly found in administrative datasets: days absent, days suspended, GPA, classes failed, and grade repetition. We validate our use of non-test student academic outcomes to assess differences in teacher productivity using the quasi-experimental teacher switching methods of Chetty, Friedman, and Rockoff (2014) and fail to reject the null hypothesis of unbiasedness in most cases in elementary and middle school, although in some cases standard errors are large. We find suggestive evidence that students taught by TFA teachers in elementary and middle schools were less likely to miss school due to unexcused absences and suspensions compared with students taught by non-TFA teachers in the same school, although point estimates are very small. Other outcomes were found to be forecast-unbiased but showed no evidence of a TFA effect.

## 1. Introduction

Teach For America (TFA) is an alternative certification program that selects, trains, and places recent college graduates or other young professionals into high-need schools with a two-year commitment to teach.^{1} Much of the prior empirical research on TFA has primarily focused on the impacts of TFA corps members on students’ standardized test scores. In general, these studies have shown a positive TFA effect in math (and science, where available) relative to comparison teachers in the same schools teaching similar students, but no significant effect is generally detected in reading (e.g., Glazerman, Mayer, and Decker 2006; Xu, Hannaway, and Taylor 2011; Clark et al. 2013; Backes et al. 2016).

Focusing on test score gains alone, however, is a narrow view of TFA's effects on students and the schools where they teach. TFA has a service-oriented mission that describes corps members as leaders and mentors to disadvantaged youth who help students set high expectations and change their learning and life trajectories in meaningful ways to overcome the challenges of generational poverty. Given the careful selection of TFA corps members based on dimensions of leadership and character (e.g., Dobbie 2011), one might hypothesize that TFA corps members could have broader impacts on their students beyond simply test scores by influencing other student behaviors in a meaningful way. Recent evidence on teacher productivity suggests teachers meaningfully influence non-test academic outcomes that are commonly overlooked by narrowly focusing on student test scores (e.g., Jackson 2014).

This paper uses longitudinal data from recent years in the Miami-Dade County Public Schools (M-DCPS), spanning 2008–09 through 2013–14, to examine the impact of TFA corps members on five non-test student academic outcomes commonly found in administrative datasets: (1) days absent, (2) days suspended, (3) grade point average (GPA), (4) classes failed, and (5) grade repetition. We do this by constructing estimates of TFA effectiveness in these non-tested academic outcomes (we refer to these value-added estimates of teacher effectiveness in non-tested academic outcomes as N-VAMs). During the time span of the longitudinal data, TFA experimented with a new placement strategy where exceptionally low-performing schools in the district were specifically targeted for intensive TFA placements. At the same time, the size of the TFA corps in the district more than tripled. The effective result of these two changes in TFA placements resulted in large clusters of TFA in these targeted schools, where on average more than 10 percent of the teacher workforce comprised TFA corps members. The large infusion of the TFA corps in these schools resulted in large sample sizes and relatively precise estimates of TFA effects.

This paper presents two main findings. The first relates to N-VAMs generally, where we demonstrate that forecasts of teacher N-VAMs are able to make unbiased predictions of non-tested student academic outcomes. We then use a teacher switching quasi-experiment adopted from Chetty, Friedman, and Rockoff (2014) to provide suggestive evidence that differences in performance across teachers as measured by N-VAMs are not driven by biasing factors, such as student–teacher sorting. This first step is important because, in contrast to VAMs based on student test scores, current evidence on whether N-VAMs represent causal effects on non-tested student academic outcomes is, to our knowledge, limited to one paper: Jackson (2014). However, Jackson (2014) is largely devoted to showing that N-VAMs are predictive of long-run outcomes. Thus, Jackson's analysis is narrowly tailored: The sample includes only students in grade 9, it examines only one composite index of noncognitive outcomes, and it does not examine consistency of N-VAMs within teachers or across outcomes. To our knowledge, no other attempt to validate N-VAMs exists.

Our second main finding links TFA teachers with students’ non-tested academic outcomes, where we estimate TFA effects in a value-added framework. We find suggestive evidence that student behavior in elementary and middle school—as measured by days missed due to unexcused absences and suspensions—improves by a small degree when placed in a TFA classroom, compared with non-TFA teachers in the same schools with similar students. We also find a small increase in GPA for elementary school students in TFA classrooms.

This paper proceeds as follows. First, we discuss the background research on attempts to measure teachers’ contributions to non-test academic student outcomes. Next, we address TFA placement in M-DCPS and the data we use. Following this, we describe how we construct estimates of TFA effectiveness at improving student non-test academic outcomes and how we forecast N-VAMs for our validation procedure. We then explore the properties of our forecasted N-VAMs and then estimate TFA effectiveness on these measures. Finally, we conclude.

## 2. Teachers’ Contributions to Student Non-Test Academic Outcomes

The last fifteen years in U.S. public schools have been an era characterized by test-based school accountability, as mandated by 2001’s No Child Left Behind Act. From the time of the Department of Education's Race to the Top initiative, first announced in 2009, test-based accountability pressures have also trickled down to teachers, where student growth measures have increasingly become a part of teacher evaluation practices (Doherty and Jacobs 2015). Though long unpopular, resistance to the culture of high-stakes testing has recently reached a boiling point among a diverse coalition of educators, parent groups, and other stakeholders (Taylor and Rich 2015).

In response to this growing resistance, public attention has shifted toward the viability of creating school and teacher performance measures based on non-tested student academic outcomes. This focus has been inspired in part by the growing consensus around the importance in the development of noncognitive or socioemotional skills among young people and their influence on long-term outcomes, with important work done over the last decade in both economics (Heckman, Stixrud, and Urzua 2006; Kautz et al. 2014) and psychology (Duckworth et al. 2007; Duckworth, Quinn, and Seligman 2009; Eskreis-Winkler et al. 2014; Robertson-Kraft and Duckworth 2014). On the policy side, state and local leaders have begun to explore and experiment with non-tested student academic outcomes as alternates to standardized tests, and last year's reauthorization of the Elementary and Secondary Education Act (ESSA) included language requiring states to adopt new school accountability systems that include alternate school performance indicators not directly tied to test scores (Klein 2015).^{2}

Education policy researchers have also shown a growing interest in exploring whether and how school factors, and particularly teachers, may influence these types of soft skills. For example, studies from Chetty et al. (2011) and Dee and West (2011) find that students assigned to smaller classrooms have persistent gains in noncognitive outcomes that explain future earnings increases. Strands of research examining teacher contributions to non-tested academic outcomes (i.e., N-VAMs) include showing that some teachers are more effective than others at improving these outcomes (Garcia 2013; Jackson 2014; Gershenson 2016) and documenting differential returns to teacher experience for various non-test academic outcomes (Ladd and Sorensen 2017).

It is important to highlight that the conceptualization of these non-tested student academic outcomes in the education policy literature is quite distinct from the socioemotional skills measured in the psychology literature, even if these investigations may be similarly motivated by an interest in looking beyond typical assessments of cognitive ability. Education policy research has been largely constrained to administrative databases or longitudinal surveys that were not designed to capture psychological personality factors. As a result, the non-tested academic outcomes that are captured in this literature focus on those non-test variables being recorded for other (typically administrative) purposes. In spite of this distinction in the actual outcomes measured, we feel that the focus on the non-tested academic outcomes that we present here is important, and perhaps even more critical for practitioners and policy makers for pragmatic reasons. Though the new law is not prescriptive on the form of these new accountability measures, we speculate using administrative data of non-tested academic outcomes for students is a likely next step for many states, as they constitute the lowest-cost mode of developing these new non-test measures at scale. In fact, to our knowledge, a consortium of districts have already begun adopting these measures into their school accountability systems, and we expect to soon see even more interest in these types of measures as states redesign their systems in response to the recent enactment of ESSA.^{3}

In this paper, we examine five different student outcomes: (1) days absent, (2) days suspended, (3) GPA, (4) classes failed, and (5) grade repetition. The outcomes are important for two reasons. First, they are simple to measure and found in most administrative databases,^{4} indicating that if teachers are found to meaningfully influence these outcomes, research examining teacher effects could easily be extended to include them. And second, each of these outcomes is an indicator of a student's academic well-being. For absences and suspensions, time out of school leads to depressed academic performance (Goodman 2014), especially for low-income students (Gershenson, Jacknowitz, and Brannegan 2017). GPA is strongly predictive of success in college (Antonovics and Backes 2014). And grade repetition is predicted by GPA and test scores (McCoy and Reynolds 1999) and possibly harmful in and of itself (Jacob and Lefgren 2009). More generally, if student improvements along these measures are driven in part by noncognitive skills (Jackson 2014), then one may expect eventual gains in college outcomes and the labor market (Heckman, Stixrud, and Urzua 2006).

Our examination here focuses specifically on whether TFA teachers impact these non-test academic outcomes. As discussed above, TFA teachers tend to be effective at improving students’ test scores in math. However, previous studies of teacher effects on non-test academic outcomes have found that the ability of a teacher to raise test scores and the ability to improve other student outcomes are largely unrelated (Jackson 2014; Gershenson 2016), meaning that a priori it is unclear whether one should expect TFA teachers to be better or worse (or no different) than the average non-TFA teacher along these measures. Two prior studies have explored TFA impacts on these types of non-test student academic outcomes (such as student absences and grade retention) but have not found any significant differences. Both Decker, Mayer, and Glazerman (2004) and Clark et al. (2013) examine the number of days absent and suspended for students randomly assigned to TFA classrooms relative to those assigned to control classrooms, with the former study conducted in elementary schools and the latter in secondary schools. Neither study finds a statistically significant relationship between TFA assignment and days absent or suspended, although their samples are substantially smaller than what is available in our M-DCPS data. In addition, our data provide information on a broader set of student non-test academic outcomes across a larger span of grade levels.

## 3. TFA Background in M-DCPS

TFA has been placing corps members in M-DCPS since 2003, beginning with thirty-five initial placements and aiming to place corps members in schools with high levels of student poverty (student bodies exceeding 70 percent eligibility for free or reduced-price lunch). Beginning with the 2009–10 school year, TFA began a clustering strategy in which new placements were purposely assigned to schools within designated high-need communities, which contained the district's lowest-performing schools.

TFA's clustering placement strategy grew out of an interest in accelerating TFA's impact on student outcomes. The growth of the TFA corps and its density is readily apparent in the placement numbers during the six school years of data used for this analysis. Table 1 presents TFA presence in the district over time. In the 2008–09 school year, the year immediately preceding the clustering strategy, there was an average of slightly fewer than two TFA teachers in each school where they were placed. In the years following, the number of schools containing any TFA corps members dropped by nearly half and the number of TFA members in the district more than tripled, resulting in a peak of nearly ten corps members per school where there was any presence. The net result was a jump in the proportion of TFA corps members in placement schools, going from 2 to 4 percent in 2008–09 to as high as 14 to 17 percent in 2012–13. The concentrations of TFA in placement schools decreased slightly in 2013–14, due to expansion into more target schools.

. | 2008—09 . | 2009—10 . | 2010—11 . | 2011—12 . | 2012—13 . | 2013—14 . |
---|---|---|---|---|---|---|

Total TFA (active and alumni) | 91 | 110 | 163 | 244 | 301 | 354 |

Total schools containing any TFA | 50 | 47 | 35 | 31 | 36 | 43 |

TFA as proportion of school teachers by school type, conditional on containing TFA | ||||||

Elementary | 3.6% | 4.4% | 8.6% | 20.4% | 13.8% | 11.8% |

Middle | 1.4% | 7.6% | 8.5% | 16.9% | 16.9% | 13.6% |

High | 1.7% | 4.0% | 13.6% | 15.9% | 14.9% | 12.0% |

. | 2008—09 . | 2009—10 . | 2010—11 . | 2011—12 . | 2012—13 . | 2013—14 . |
---|---|---|---|---|---|---|

Total TFA (active and alumni) | 91 | 110 | 163 | 244 | 301 | 354 |

Total schools containing any TFA | 50 | 47 | 35 | 31 | 36 | 43 |

TFA as proportion of school teachers by school type, conditional on containing TFA | ||||||

Elementary | 3.6% | 4.4% | 8.6% | 20.4% | 13.8% | 11.8% |

Middle | 1.4% | 7.6% | 8.5% | 16.9% | 16.9% | 13.6% |

High | 1.7% | 4.0% | 13.6% | 15.9% | 14.9% | 12.0% |

*Notes:* Proportions of school teachers by school type are calculated among any schools containing any TFA during that school year. TFA: Teach For America.

## 4. Data

We use detailed student-level administrative data that cover M-DCPS students linked to their teachers for six school years (2008–09 through 2013–14) from kindergarten through twelfth grade. M-DCPS is the largest school district in Florida and the fourth largest in the United States. The district has large minority and disadvantaged student populations, typical of regions TFA has historically targeted; about 60 percent of its students are Hispanic, 30 percent black, and 10 percent white, and more than 60 percent of students qualify for free or reduced-price lunch (FRPL).

Our set of non-test academic outcomes includes five variables: (1) the number of unexcused absences, (2) days absent due to suspension, (3) GPA in a given year, (4) percent of classes failed, and (5) grade repetition.^{5} In addition to these outcomes, we observe a variety of student characteristics: race, gender, FRPL-eligibility, limited English proficiency (LEP) status, and whether a student is flagged as having a mental, physical, or emotional disability. In addition, all students are linked to teachers through data files that contain information on course membership.

Teacher personnel files in the M-DCPS data contain information on teachers’ experience levels and demographics. These are used as covariates for the analysis that follows. One variable included in the data is a flag on TFA teachers (representing both active corps members and TFA alumni); given the importance of this variable in the analysis, we externally validated this variable with historical corps member lists from TFA. One of the potential benefits of N-VAMs is that they can be estimated on a larger set of teachers because they are not restricted to teachers in tested grades and subjects—the number of teachers in the N-VAM sample is about twice as large as it would be if we were to restrict the sample to those teaching students with test scores. By necessity, when estimating VAMs, we restrict the sample to those in tested grades and subjects.

Basic summary statistics for TFA and non-TFA teachers by school level are presented in table 2. TFA teachers are more likely to be concentrated in schools with high shares of black and FRPL-eligible students, and they are less likely to be minority or experienced. About 80 percent of TFA teachers are in their first or second year of teaching because of high attrition rates after the second year (see Hansen, Backes, and Brady 2016).

. | Elementary . | Middle . | High . | |||
---|---|---|---|---|---|---|

. | TFA . | Non-TFA . | TFA . | Non-TFA . | TFA . | Non-TFA . |

Student descriptives | ||||||

Male | 0.49 | 0.48 | 0.49 | 0.49 | 0.51 | 0.5 |

Black | 0.79 | 0.24 | 0.71 | 0.24 | 0.80 | 0.24 |

Hispanic | 0.20 | 0.67 | 0.27 | 0.66 | 0.19 | 0.66 |

FRPL | 0.97 | 0.75 | 0.95 | 0.74 | 0.86 | 0.68 |

ELL | 0.16 | 0.25 | 0.12 | 0.11 | 0.10 | 0.10 |

Mental disability | 0.03 | 0.06 | 0.05 | 0.07 | 0.05 | 0.07 |

Physical disability | 0.03 | 0.03 | 0.01 | 0.02 | 0.01 | 0.02 |

Emotional disability | 0 | 0.02 | 0.01 | 0.02 | 0.01 | 0.02 |

Unexcused absences | 6.51 | 3.67 | 8.37 | 4.75 | 12.3 | 7.36 |

(7.14) | (4.98) | (10.55) | (7.40) | (12.76) | (9.92) | |

Suspension absences | 0.21 | 0.08 | 2.01 | 0.77 | 1.15 | 0.51 |

(1.20) | (0.81) | (5.52) | (3.45) | (3.74) | (2.50) | |

GPA | 2.96 | 3.19 | 2.29 | 2.64 | 2.34 | 2.57 |

(0.67) | (0.63) | (0.76) | (0.78) | (0.82) | (0.82) | |

% classes failed | 0.04 | 0.03 | 0.07 | 0.04 | 0.12 | 0.08 |

(0.13) | (0.11) | (0.16) | (0.12) | (0.20) | (0.17) | |

Repeated | 0.04 | 0.02 | 0.02 | 0.01 | 0.01 | 0.02 |

Unique students | 6,856 | 322,358 | 16,262 | 227,607 | 23,298 | 282,746 |

Teacher descriptives | ||||||

Black | 0.09 | 0.21 | 0.09 | 0.23 | 0.03 | 0.21 |

Hispanic | 0.08 | 0.44 | 0.04 | 0.35 | 0.02 | 0.37 |

Class size | 13.06 | 12.9 | 19.15 | 17.76 | 16.83 | 18.14 |

(4.96) | (5.66) | (4.93) | (7.01) | (4.13) | (19.30) | |

Race matches student | 0.07 | 0.46 | 0.08 | 0.4 | 0.03 | 0.37 |

No experience | 0.41 | 0.01 | 0.44 | 0.01 | 0.48 | 0 |

1 year experience | 0.35 | 0 | 0.35 | 0 | 0.37 | 0.02 |

2 years experience | 0.12 | 0.02 | 0.09 | 0.02 | 0.08 | 0.03 |

3—4 years experience | 0.03 | 0.06 | 0.03 | 0.06 | 0 | 0.06 |

5—9 years experience | 0.01 | 0.19 | 0 | 0.18 | 0.02 | 0.19 |

10+ years experience | 0 | 0.59 | 0 | 0.54 | 0 | 0.6 |

Missing experience | 0 | 0.11 | 0.08 | 0.18 | 0.03 | 0.08 |

Unique teachers | 175 | 17,141 | 199 | 10,059 | 501 | 25,727 |

. | Elementary . | Middle . | High . | |||
---|---|---|---|---|---|---|

. | TFA . | Non-TFA . | TFA . | Non-TFA . | TFA . | Non-TFA . |

Student descriptives | ||||||

Male | 0.49 | 0.48 | 0.49 | 0.49 | 0.51 | 0.5 |

Black | 0.79 | 0.24 | 0.71 | 0.24 | 0.80 | 0.24 |

Hispanic | 0.20 | 0.67 | 0.27 | 0.66 | 0.19 | 0.66 |

FRPL | 0.97 | 0.75 | 0.95 | 0.74 | 0.86 | 0.68 |

ELL | 0.16 | 0.25 | 0.12 | 0.11 | 0.10 | 0.10 |

Mental disability | 0.03 | 0.06 | 0.05 | 0.07 | 0.05 | 0.07 |

Physical disability | 0.03 | 0.03 | 0.01 | 0.02 | 0.01 | 0.02 |

Emotional disability | 0 | 0.02 | 0.01 | 0.02 | 0.01 | 0.02 |

Unexcused absences | 6.51 | 3.67 | 8.37 | 4.75 | 12.3 | 7.36 |

(7.14) | (4.98) | (10.55) | (7.40) | (12.76) | (9.92) | |

Suspension absences | 0.21 | 0.08 | 2.01 | 0.77 | 1.15 | 0.51 |

(1.20) | (0.81) | (5.52) | (3.45) | (3.74) | (2.50) | |

GPA | 2.96 | 3.19 | 2.29 | 2.64 | 2.34 | 2.57 |

(0.67) | (0.63) | (0.76) | (0.78) | (0.82) | (0.82) | |

% classes failed | 0.04 | 0.03 | 0.07 | 0.04 | 0.12 | 0.08 |

(0.13) | (0.11) | (0.16) | (0.12) | (0.20) | (0.17) | |

Repeated | 0.04 | 0.02 | 0.02 | 0.01 | 0.01 | 0.02 |

Unique students | 6,856 | 322,358 | 16,262 | 227,607 | 23,298 | 282,746 |

Teacher descriptives | ||||||

Black | 0.09 | 0.21 | 0.09 | 0.23 | 0.03 | 0.21 |

Hispanic | 0.08 | 0.44 | 0.04 | 0.35 | 0.02 | 0.37 |

Class size | 13.06 | 12.9 | 19.15 | 17.76 | 16.83 | 18.14 |

(4.96) | (5.66) | (4.93) | (7.01) | (4.13) | (19.30) | |

Race matches student | 0.07 | 0.46 | 0.08 | 0.4 | 0.03 | 0.37 |

No experience | 0.41 | 0.01 | 0.44 | 0.01 | 0.48 | 0 |

1 year experience | 0.35 | 0 | 0.35 | 0 | 0.37 | 0.02 |

2 years experience | 0.12 | 0.02 | 0.09 | 0.02 | 0.08 | 0.03 |

3—4 years experience | 0.03 | 0.06 | 0.03 | 0.06 | 0 | 0.06 |

5—9 years experience | 0.01 | 0.19 | 0 | 0.18 | 0.02 | 0.19 |

10+ years experience | 0 | 0.59 | 0 | 0.54 | 0 | 0.6 |

Missing experience | 0 | 0.11 | 0.08 | 0.18 | 0.03 | 0.08 |

Unique teachers | 175 | 17,141 | 199 | 10,059 | 501 | 25,727 |

*Notes:* Averages at the individual-year level, with standard deviation in parentheses. For example, a teacher observed in the first and second years of teaching would have one observation from the first year and one from the second year. ELL: English language learner; FRPL: eligible for free or reduced-price lunch; GPA: = grade point average; TFA: Teach For America.

## 5. Empirical Strategy

Our analysis is motivated by an interest in quantifying TFA corps members’ influence on student's non-tested academic outcomes commonly found in administrative databases. In our survey of the prior literature, we do not believe the question of whether teachers have a causal role in influencing these non-tested academic outcomes (versus being related to socioeconomic factors and/or nonrandom assignment between students and teachers) has been sufficiently validated. To that end, our paper has two parts: The first derives and validates the use of N-VAMs for teachers, and the second then explores the relationship between TFA teachers and these non-test academic outcomes.

### Conventional VAMs and Derivative N-VAMs

*i*taught by teacher

*j*in year

*t*, prior year achievement, demographic characteristics, an indicator variable identifying the

*j*th teacher of

*J*total teachers, and an error term. The coefficient on teacher indicator

*j*, , is meant to capture teacher

*j*’s contribution to growth in student achievement.

^{6}

Models of this type are calculated in Gershenson (2016), Ladd and Sorensen (2017), and others. N-VAMs produced using this conventional methodology are assumed to take on the interpretation and analogous statistical properties of VAMs. For example, estimates are intended to be interpreted as teachers’ causal contributions to the corresponding non-test outcome in students—our validity tests described below will help evaluate this claim. N-VAMs may be estimated across various time spans in the data, producing one-year or multiyear teacher estimates, as the case may be. The reliability and variability of these estimates over time within teachers can be calculated following methods developed for test-based VAMs in McCaffrey et al. (2009) and Goldhaber and Hansen (2013).

### Estimating a TFA Effect on Non-Test Academic Outcomes

^{7}

Because TFA corps members are placed nonrandomly across schools in the district (at minimum, we know selected schools had at least 70 percent FRPL-eligible students; other characteristics may have also played into the selection decision), the point estimates associated with the TFA effect may be downward-biased because schools chosen to receive TFA corps members were probably targeted precisely because they were likely to have low student performance (both on tests and other non-test outcomes). As a result, we include both school fixed effects () and controls for time-varying averages within classrooms such as student demographic characteristics. The inclusion of school fixed effects ensures that TFA teachers are compared to non-TFA teachers within a given school, serving similar classrooms.

### Forecasting Non-Test Value Added

Estimating equation 3 will yield the average change in a given outcome associated with being in a TFA classroom. However, in contrast to teachers’ value added (VA) on test scores, which has been thoroughly vetted in both experimental (Kane, Rockoff, and Staiger 2008; Kane et al. 2013) and nonexperimental (Bacher-Hicks, Kane, and Staiger 2014; Chetty, Friedman, and Rockoff 2014) studies, there is little evidence regarding whether N-VAMs represent causal estimates or merely reflect other factors, such as student–teacher sorting. For example, a positive TFA coefficient in equation 3 could be because sitting in a TFA classroom genuinely leads to better outcomes or because TFA teachers are systematically assigned to students who would have had better outcomes whether or not they were taught by TFA teachers, in ways that the control variables cannot account for. Thus, in this section, we describe how we make out-of-sample predictions of teacher effectiveness to assess the extent to which we can think of the TFA estimates as meaningful contributions to changes in student outcomes.

To both validate the causal nature of N-VAMs and explore the variability of these measures within and across teachers, we take a different approach from equation 2 for two reasons. First, equation 2, when estimated directly on the full sample, assumes the teacher contribution to the corresponding outcome is fixed over time by estimating a constant for each teacher. However, using test-based VAMs, Goldhaber and Hansen (2013) and Chetty, Friedman, and Rockoff (2014) provide evidence that this is not the case: Teacher effectiveness drifts over time. Second, to conduct our validation procedure, we need to forecast teacher effectiveness in year *t*, not estimate it directly as in equation 2. As a result, we follow the three-step process used by Chetty, Friedman, and Rockoff (2014).^{8}

^{9}

*t*can then be written as , where

*n*is the number of students taught in year

*t*by teacher

*j*. We then obtain coefficients from the estimation of the best linear predictor of given average residuals in all other years, both past and future, .

^{10}Finally, we use the relationship between teachers’ average residuals across different years to forecast a teacher's value-added contribution for outcome

*OUTCOME*in year

*t*:

The result of this process is a forecast of each teacher's value-added contribution in each year for each outcome. In all that follows, we take this forecast to be a teacher's value-added effect in *t*. Thus, for example, when calculating the correlation between teacher effects across two outcome measures *Y1* and *Y2* in a given year, we would calculate the correlation between and . Although this does not actually use data from *t*, represents the best linear prediction of teacher performance in *t*, given data from all other years. Chetty, Friedman, and Rockoff (2014) find that these forecasts of teacher performance are predictive of student performance in math and reading. Below, we assess whether forecasts of non-test academic outcomes exhibit similar properties.

### Definition of Bias

We adopt the Chetty, Friedman, and Rockoff (2014) definition of forecast bias of . Chetty, Friedman, and Rockoff use two tests for bias to present evidence that the estimates of teacher effectiveness are forecast unbiased predictors of student test scores in *t* if the vector of prior achievement (from equation 2) contains lagged test scores. As we describe in the following section, we will replicate one of these tests—a teacher switching quasi-experiment—using the study's five non-test academic outcomes to determine whether value-added estimates based on these outcomes contain forecast bias.

### Assessing Validity

If we take a VAM framework and apply it to non-tested academic outcomes, do the estimates of teacher performance correspond to causal changes in actual student outcomes, as with tested outcomes? The idea behind our test is simple: If we know how students of a given teacher performed in the past, can we make out-of-sample predictions about the outcomes of future students taught by that teacher? Thus, our tests of validity are designed to generate predictive evidence about whether VAMs applied to non-test academic outcomes truly measure a teacher's ability to improve those outcomes for the students they teach. We believe predictive evidence on N-VAMs is an important first step in assessing whether they should be used in research and policy.

A regression of predicted student outcomes ( from equation 5) on the forecast value-added estimates ( from equation 6) is not directly informative about whether represents a causal relationship between and because students are not randomly assigned to teachers. In other words, a coefficient of 1 could either be because of a causal relationship or because of persistent differences in in equation 7 (e.g., certain teachers being assigned to students with high- or low-income parents). Thus, as in Chetty, Friedman, and Rockoff (2014), we leverage variation in student exposure to teacher N-VAM contributions caused by staffing changes at the grade-school level to test whether student non-test academic outcomes change, corresponding to a 1:1 relationship.

To construct , the differenced average teacher value-added contributions at the school-grade-year level, we weight teachers by the number of students they taught. To perform this test, when estimating value-added contributions for teachers in *t*, we omit observations from *t* and *t–1* so as not to introduce bias when using differenced outcomes on the left-hand side. In addition, we include year fixed effects and cluster at the school-cohort level.

The key driver of changes in teacher value-added contributions at the school-grade-year level is school staffing changes, where teachers are moving across schools or grades. The crucial identifying assumption is that changes in aggregate student outcomes are driven only by these changes in school staffing decisions and not other factors which influence test scores. Importantly, this test avoids the problem of nonrandom student–teacher sorting within schools because it aggregates teacher value-added contributions and student performance to school–grade cells.

In equation 8, forecast bias is defined as 1–α. Bias is 0 when α fails to reject the null hypothesis of being equal to one, and would imply that any changes in N-VAM estimates correspond to equivalent changes in actual non-tested academic outcomes. If α deviates significantly from the value of one, it would indicate that the predictive validity of the teacher estimate does not hold across schools or grades. Values less (greater) than one imply that estimated N-VAMs overpredict (underpredict) the corresponding changes in actual student outcomes as teachers flow across different schools. We test for the presence of forecast bias across the five candidate non-test academic outcomes, first as a pooled sample and then separately by levels (elementary, middle, and high school grades).

In any given year, students may be exposed to multiple teachers, especially in middle school and high school. When estimating TFA effects, we include each student–teacher link as its own observation and dosage weight each of these records.^{11} When conducting the test for forecast bias, we obtain each teacher's school-grade-year weight by dividing the sum of his or her dosages by the total sum of dosages.^{12} The mean outcome in the school-grade-year cell is simply the mean across students in that cell.

## 6. Results

We begin by presenting properties of forecasted N-VAMs and then conduct the validation procedure, showing that these N-VAMs represent persistent, meaningful differences across teachers and fail to reject forecast-unbiasedness for most measures in elementary and middle school. After finding these measures to be consequential and valid, we turn to the question of whether TFA teachers perform better on these measures than comparison non-TFA teachers.

### Basic Properties of Forecasted VA

In order to generate out-of-sample forecasts of VA for each outcome (both tested and non-tested outcomes), it has to be the case that the outcomes of students in a teacher's classroom in one year are correlated with the outcomes of students of that teacher in a different year. Otherwise, we would have nothing to base forecasts off of and every teacher would have forecasted VA of exactly zero. Throughout this section, we consider elementary, middle, and high school teachers as separate groups due to differences in results across school types, though not all outcomes of interest can be estimated at each level. Because students in high school grades are not annually tested, due to lack of observations we do not include tested outcomes at the high school level. In addition, because very few students repeat grades in middle school, we do not consider grade repetition at that level.

We verify that there is a persistent component of each non–test academic outcome within teachers by plotting the autocorrelation vectors by outcome and school type, shown in figure 1. The maximum years of separation is four: of our six years of data, one year must be used as prior controls in the residualization process, leaving five years of data, the maximum of which are four years apart. For teachers of elementary school students, the tested outcomes (math and reading), as well as GPA and unexcused absences, have the highest year-to-year correlations, whereas suspensions and grade repetition tail off quickly. For middle school, all correlations tend to be higher, especially suspensions and the percent of classes failed.

We next turn to properties of the VA forecasts themselves. Table 3 displays the standard deviations of the VA forecasts for each outcome for each school level. For reference, the standard deviations of VA forecasts for reading and math from Chetty, Friedman, and Rockoff (2014) are also presented; our comparable measures are slightly larger than theirs. As with Chetty, Friedman, and Rockoff, for tested outcomes, the dispersion of teacher effects is higher in math than English and higher in elementary school than middle school. However, in contrast, many of the non-tested academic outcomes have higher variance in middle school than elementary school. Across all grade levels, the magnitudes of the standard deviations of N-VAM forecasts are slightly larger than the standard deviations of the tested outcomes (both in our data and in Chetty, Friedman, and Rockoff 2014).^{13}

. | Elementary . | Middle . | High . |
---|---|---|---|

CFR: Math | 0.12 | 0.09 | |

CFR: English | 0.08 | 0.04 | |

Math | 0.15 | 0.12 | |

ELA | 0.10 | 0.08 | |

Unexcused absences | 0.11 | 0.16 | 0.17 |

Suspension absences | 0.09 | 0.15 | 0.11 |

% classes failed | 0.14 | 0.15 | 0.16 |

GPA | 0.17 | 0.17 | 0.16 |

Repeated | 0.10 | . | 0.26 |

. | Elementary . | Middle . | High . |
---|---|---|---|

CFR: Math | 0.12 | 0.09 | |

CFR: English | 0.08 | 0.04 | |

Math | 0.15 | 0.12 | |

ELA | 0.10 | 0.08 | |

Unexcused absences | 0.11 | 0.16 | 0.17 |

Suspension absences | 0.09 | 0.15 | 0.11 |

% classes failed | 0.14 | 0.15 | 0.16 |

GPA | 0.17 | 0.17 | 0.16 |

Repeated | 0.10 | . | 0.26 |

*Notes:* CFR values (provided from Chetty, Friedman, and Rockoff 2014) are included for comparison. ELA: English language arts; GPA: grade point average.

Even if we find that N-VAM estimates are valid and reliable, they are only useful for teachers in tested grades and subjects to the extent that they provide new information relative to VAMs. As an extreme example, if VAMs for math teachers were perfectly correlated with their N-VAMs, then there would be no reason to include both measures when measuring teacher effectiveness or designing a teacher evaluation system.^{14} One of the motivations for measuring N-VAMs is the assumption that they measure, at least in part, something other than VAMs. To explore the relationship between VAMs and N-VAMs, we calculate the correlation between math and reading VAMs and each of our five N-VAMs. The higher the correlation, the larger the estimated overlap between teacher effectiveness on tested and non-tested academic outcomes. If, for example, the correlation between VA in absences and grade retention were high, it would provide evidence that certain teachers consistently reach their students in a manner that is reflected across multiple measures.

We then turn to the degree to which estimated N-VAMs for a given teacher are correlated. Table 4 displays the correlation across outcomes for a given teacher. Perhaps unsurprisingly, for teachers who taught both math and reading, forecasted teacher VA was highly correlated across the two subjects at nearly 0.60. At the same time, math and reading VA have a notable negative correlation with unexcused absences and suspensions. Teacher effects on student absences being negatively correlated with effects on student math and reading scores is also found by Gershenson (2016) in a different dataset, although the correlations are larger here. The implication of this result is that teacher evaluation systems that commonly use test-based VA as the only student outcome measure (along with observational ratings) may be inadvertently overlooking important gains in these non-test academic outcomes that may still be important for students.

. | Math . | ELA . | Unexcused Absences . | Suspension Absences . | GPA . | % Classes Failed . |
---|---|---|---|---|---|---|

ELA | 0.59 | |||||

Unexcused absences | −0.17 | −0.32 | ||||

Suspension absences | −0.17 | −0.24 | 0.24 | |||

GPA | 0.09 | 0.11 | −0.07 | −0.20 | ||

% classes failed | −0.00 | 0.02 | 0.07 | 0.17 | −0.71 | |

Repeated | −0.09 | −0.07 | 0.10 | 0.03 | −0.02 | 0.08 |

. | Math . | ELA . | Unexcused Absences . | Suspension Absences . | GPA . | % Classes Failed . |
---|---|---|---|---|---|---|

ELA | 0.59 | |||||

Unexcused absences | −0.17 | −0.32 | ||||

Suspension absences | −0.17 | −0.24 | 0.24 | |||

GPA | 0.09 | 0.11 | −0.07 | −0.20 | ||

% classes failed | −0.00 | 0.02 | 0.07 | 0.17 | −0.71 | |

Repeated | −0.09 | −0.07 | 0.10 | 0.03 | −0.02 | 0.08 |

*Notes:* ELA: English language arts; GPA: grade point average.

Looking specifically at the correlations among the N-VAMs, most correlations are relatively low (correlation coefficients with absolute value less than 0.20), with two notable exceptions. Both absence types (unexcused and suspension absences)^{15} are modestly correlated (0.24), and, unsurprisingly, GPA and the percent of classes failed are highly negatively correlated (−0.71). Accordingly, states could consider the inclusion of these N-VAMs in evaluation systems (after further investigation) to better reflect the multidimensional nature of teaching.

### Results Using Forecasted VA

Results are shown in table 5. Panel A pools all observations across all levels, and panels B, C, and D are executed separately on elementary, middle, and high school samples, respectively. At the elementary and middle school levels, coefficients are indeed very close to 1. Although not surprising, it is still reassuring that for non-tested academic outcomes, the outcomes of students in a given year can be predicted based on the outcomes of students taught by their teacher in other years. Conversely, the forecasts in high school are not as closely related to current student outcomes, and reject the null hypothesis of being equal to one in two of the outcome variables (unexcused absences and the percentage of failed classes). These values exceeding 1 suggest N-VAM estimates based on other teacher years underpredict the actual outcomes observed among students assigned to a teacher in a given year, perhaps due to student tracking or a similar sorting bias in the data. The remaining outcomes fail to reject the null hypothesis, but the deviations from 1 are notable, especially for grade repetition.^{16} This may be due to the low degree of variation in grade repetition in high school, making it difficult to forecast.

. | Tests . | M . | ELA . | Unexcused Absences . | Suspension Absences . | GPA . | % Classes Failed . | Repeater . |
---|---|---|---|---|---|---|---|---|

Panel A: All | 1.01 | 1.01 | 1.02 | 1.04 | 1.05 | 1.00 | 1.04 | |

(0.02) | (0.02) | (0.02) | (0.02) | (0.03) | (0.01) | (0.01) | ||

Reject = 1 | x | |||||||

Panel B: Elementary | 1.01 | 1.00 | 1.02 | 1.01 | 1.05 | 0.96 | 0.99 | 0.95 |

(0.02) | (0.02) | (0.02) | (0.02) | (0.08) | (0.01) | (0.02) | (0.04) | |

Reject = 1 | x | |||||||

Panel C: Middle | 1.02 | 1.01 | 1.02 | 1.02 | 1.04 | 0.99 | 1.03 | |

(0.03) | (0.04) | (0.02) | (0.03) | (0.04) | (0.02) | (0.03) | ||

Reject = 1 | ||||||||

Panel D: High | 1.06 | 1.06 | 1.05 | 1.08 | 0.92 | |||

(0.03) | (0.05) | (0.03) | (0.02) | (0.06) | ||||

Reject = 1 | x | x |

. | Tests . | M . | ELA . | Unexcused Absences . | Suspension Absences . | GPA . | % Classes Failed . | Repeater . |
---|---|---|---|---|---|---|---|---|

Panel A: All | 1.01 | 1.01 | 1.02 | 1.04 | 1.05 | 1.00 | 1.04 | |

(0.02) | (0.02) | (0.02) | (0.02) | (0.03) | (0.01) | (0.01) | ||

Reject = 1 | x | |||||||

Panel B: Elementary | 1.01 | 1.00 | 1.02 | 1.01 | 1.05 | 0.96 | 0.99 | 0.95 |

(0.02) | (0.02) | (0.02) | (0.02) | (0.08) | (0.01) | (0.02) | (0.04) | |

Reject = 1 | x | |||||||

Panel C: Middle | 1.02 | 1.01 | 1.02 | 1.02 | 1.04 | 0.99 | 1.03 | |

(0.03) | (0.04) | (0.02) | (0.03) | (0.04) | (0.02) | (0.03) | ||

Reject = 1 | ||||||||

Panel D: High | 1.06 | 1.06 | 1.05 | 1.08 | 0.92 | |||

(0.03) | (0.05) | (0.03) | (0.02) | (0.06) | ||||

Reject = 1 | x | x |

*Notes:* Coefficient is from the regression of residualized student outcome on forecasted student outcome, based on teacher performance in other years. Standard errors clustered at the school-cohort level in parentheses. Sample sizes not provided because they are very large as the unit of observation is the student–teacher link. ELA: English language arts; GPA: grade point average.

Finally, we implement the Chetty, Friedman, and Rockoff (2014) quasi-experimental estimate of forecast bias described above: Changes in student outcomes at the school-grade-subject-year level are regressed on average forecasted teacher performance at that level. Results are shown in table 6. As in table 5, panel A presents the results on a pooled sample, and panels B, C, and D conduct the test by school levels. At the elementary and middle school levels, only for unexcused absences is the coefficient estimate significantly different than 1. However, due to having a short panel, standard errors are generally large and we cannot rule out somewhat large degrees of bias. Thus, rather than viewing this as a definitive test of whether N-VAMs can be thought of as valid, causal estimates of teacher effectiveness, we use this to justify measuring TFA effects on these outcomes, at least in elementary and middle schools. For high school, many outcomes are far from 1 (the standard errors are also much larger on these high school outcomes, relative to other levels). As previously seen, four of these five values exceed one, signifying that N-VAM forecasts underpredict actual changes in student outcomes; yet, in this table the estimates are constructed to remove the possibility of tracking within schools as a potential driver. The presence of forecast bias here may suggest that these outcomes are more driven by school environment or policy rather than teacher-specific factors. Given the rejection of these bias tests in high school grades, we do not estimate TFA effects for these teachers in our main results section (they are presented in Appendix table A.1 for completeness). Even for the high school outcomes which do not fail the test of statistically significant difference from one—classes failed and suspensions—the point estimates for bias are greater than 20 percent.

. | Tests . | M . | ELA . | Unexcused Absences . | Suspension Absences . | GPA . | % Classes Failed . | Repeater . |
---|---|---|---|---|---|---|---|---|

Panel A: All | 0.93 | 0.86 | 1.20 | 1.12 | 0.92 | 1.23 | 0.94 | |

(0.07) | (0.08) | (0.16) | (0.09) | (0.12) | (0.07) | (0.08) | ||

8689 | 4306 | 4383 | 11435 | 11435 | 10029 | 10140 | ||

Reject = 1 | x | |||||||

Panel B: Elementary | 0.98 | 0.89 | 1.27 | 0.50 | 0.88 | 1.10 | 1.00 | 0.88 |

(0.10) | (0.12) | (0.18) | (0.09) | (0.17) | (0.06) | (0.08) | (0.13) | |

4857 | 2410 | 2447 | 7737 | 7737 | 6406 | 6459 | 7737 | |

Reject = 1 | x | |||||||

Panel C: Middle | 0.88 | 0.83 | 1.10 | 1.04 | 0.80 | 0.83 | 0.88 | |

(0.10) | (0.10) | (0.29) | (0.16) | (0.19) | (0.12) | (0.19) | ||

3832 | 1896 | 1936 | 2056 | 2056 | 2010 | 2056 | ||

Reject = 1 | ||||||||

Panel D: High | 1.50 | 1.24 | 1.88 | 0.78 | 2.49 | |||

(0.17) | (0.36) | (0.25) | (0.16) | (0.41) | ||||

1642 | 1642 | 1613 | 1625 | 1642 | ||||

Reject = 1 | x | x | x |

. | Tests . | M . | ELA . | Unexcused Absences . | Suspension Absences . | GPA . | % Classes Failed . | Repeater . |
---|---|---|---|---|---|---|---|---|

Panel A: All | 0.93 | 0.86 | 1.20 | 1.12 | 0.92 | 1.23 | 0.94 | |

(0.07) | (0.08) | (0.16) | (0.09) | (0.12) | (0.07) | (0.08) | ||

8689 | 4306 | 4383 | 11435 | 11435 | 10029 | 10140 | ||

Reject = 1 | x | |||||||

Panel B: Elementary | 0.98 | 0.89 | 1.27 | 0.50 | 0.88 | 1.10 | 1.00 | 0.88 |

(0.10) | (0.12) | (0.18) | (0.09) | (0.17) | (0.06) | (0.08) | (0.13) | |

4857 | 2410 | 2447 | 7737 | 7737 | 6406 | 6459 | 7737 | |

Reject = 1 | x | |||||||

Panel C: Middle | 0.88 | 0.83 | 1.10 | 1.04 | 0.80 | 0.83 | 0.88 | |

(0.10) | (0.10) | (0.29) | (0.16) | (0.19) | (0.12) | (0.19) | ||

3832 | 1896 | 1936 | 2056 | 2056 | 2010 | 2056 | ||

Reject = 1 | ||||||||

Panel D: High | 1.50 | 1.24 | 1.88 | 0.78 | 2.49 | |||

(0.17) | (0.36) | (0.25) | (0.16) | (0.41) | ||||

1642 | 1642 | 1613 | 1625 | 1642 | ||||

Reject = 1 | x | x | x |

*Notes:* Coefficient from the regression of actual student outcomes on forecasted student outcomes at the school-grade-year cell, with forecasts based on teacher performance in other years. Standard errors clustered at the school-cohort level in parentheses. Number of cells provided below standard errors. See text for further details. ELA: English language arts; GPA: grade point average.

Taken together, our results from tables 5 and 6 suggest the non-test academic outcomes of students in elementary school and middle school can be systematically explained by the teachers to which they are assigned. The one exception is unexcused absences in elementary school, which fails the quasi-experimental test of forecast bias with a degree of bias of 50 percent.

### TFA Estimates on Non-Test Academic Outcomes

Having validated—at least to the extent possible given our data—most non-test academic outcomes in elementary and middle school grades, we now move to estimating TFA effects based on a value-added specification with these non-test variables as dependent variables. Results are shown in table 7, with columns representing separate regressions for each outcome variable. In elementary school, students in classrooms taught by a TFA teacher tended to have fewer unexcused absences, fewer days of suspension, and higher GPAs than observationally similar students in non-TFA classrooms, with the latter two being at least marginally significant.^{17} However, recall that unexcused absences in elementary school failed our test of forecast bias. In middle school, students in TFA classrooms continue to be less likely to have unexcused absences or suspensions (only the former is statistically significant), and the GPA effect is no longer present.^{18}

. | Unexcused Absences . | Suspension Absences . | GPA . | % Classes Failed . | Repeater . |
---|---|---|---|---|---|

Elementary | |||||

TFA coefficient | −0.044^{a} | −0.054^{*} | 0.050^{**} | 0.001 | 0.001 |

(0.139) | (0.032) | (0.023) | (0.003) | (0.005) | |

Observations | 4661344 | 4661344 | 4622273 | 4661344 | 4661344 |

R^{2} | 0.438 | 0.127 | 0.645 | 0.242 | 0.115 |

Dependent variable mean, full sample | 4.0 | 0.07 | 3.19 | 0.028 | 0.026 |

Dependent variable mean, students in TFA classrooms | 7.2 | 0.17 | 2.96 | 0.041 | 0.038 |

Middle | |||||

TFA coefficient | −0.347^{**} | −0.075 | −0.009 | 0.001 | |

(0.154) | (0.058) | (0.011) | (0.002) | ||

Observations | 3174990 | 3174990 | 3159288 | 3174990 | |

R^{2} | 0.488 | 0.301 | 0.659 | 0.266 | |

Dependent variable mean, full sample | 4.8 | 0.79 | 2.64 | 0.038 | 0.012 |

Dependent variable mean, students in TFA classrooms | 8.4 | 2.01 | 2.29 | 0.067 | 0.022 |

. | Unexcused Absences . | Suspension Absences . | GPA . | % Classes Failed . | Repeater . |
---|---|---|---|---|---|

Elementary | |||||

TFA coefficient | −0.044^{a} | −0.054^{*} | 0.050^{**} | 0.001 | 0.001 |

(0.139) | (0.032) | (0.023) | (0.003) | (0.005) | |

Observations | 4661344 | 4661344 | 4622273 | 4661344 | 4661344 |

R^{2} | 0.438 | 0.127 | 0.645 | 0.242 | 0.115 |

Dependent variable mean, full sample | 4.0 | 0.07 | 3.19 | 0.028 | 0.026 |

Dependent variable mean, students in TFA classrooms | 7.2 | 0.17 | 2.96 | 0.041 | 0.038 |

Middle | |||||

TFA coefficient | −0.347^{**} | −0.075 | −0.009 | 0.001 | |

(0.154) | (0.058) | (0.011) | (0.002) | ||

Observations | 3174990 | 3174990 | 3159288 | 3174990 | |

R^{2} | 0.488 | 0.301 | 0.659 | 0.266 | |

Dependent variable mean, full sample | 4.8 | 0.79 | 2.64 | 0.038 | 0.012 |

Dependent variable mean, students in TFA classrooms | 8.4 | 2.01 | 2.29 | 0.067 | 0.022 |

*Notes:* Coefficients represent the average change in outcome associated with being in a classroom taught by TFA teacher. Regression controls for student-level and class average demographics and their interactions with grade. Other controls include class size and teacher race and their interactions with grade. All models include school fixed effects and cluster standard errors at the school level, with standard errors shown in parentheses. GPA: grade point average; TFA: Teach For America.

^{a}Unexcused absences in elementary school fails the forecast bias test (see table 5). We display the coefficient here for completeness but urge caution in interpreting this result.

^{*}Significant at the 10% level; ^{**}significant at the 5% level.

In general, point estimates tend to be modest, with only one coefficient representing more than a 10 percent change relative to baseline values. For example, a reduction of unexcused absences in middle school by 0.347 per student corresponds to about 7 percent of the average of unexcused absences across all students in the sample (4.8). When taking into account that TFA teachers tend to be assigned relatively disadvantaged students (who tend to have more unexcused absences), the percentage change in outcomes becomes even smaller. A reduction of 0.347 absences per year corresponds to only 4 percent of the average number of absences of students taught by TFA teachers. For GPA in elementary school, an increase of 0.05 grade points corresponds to about 1.7 percent of mean GPA in TFA classrooms (2.96) and about 7 percent of a standard deviation of GPA.^{19}

The one coefficient estimate representing a large percentage change relative to baseline values is suspensions in elementary school, where the reduction of 0.05 days per year corresponds to about one-third of the average number of days suspended for students in TFA classrooms. However, the baseline value is extremely small (the average is about 1/6, meaning that on average, one student in six is suspended one day per year) and the coefficient is only marginally significant because of the relatively large standard error, so it is hard to know whether this particular result is replicable in different data.

Overall, these results provide suggestive evidence that student behavior—as measured by days missed due to unexcused absences and suspensions—improves by a small degree when placed in a TFA classroom. In addition, students in TFA classrooms in elementary school had modest increases in GPA.

### Robustness Check: Aggregate Grade-School Effects

Equation 10 does not utilize classroom-specific variation in TFA assignment, but rather the intensity of TFA presence in a given school-grade-year cell. Because of the high level of within-school variation over time induced by the clustering strategy, we are able to obtain more precise estimates for these TFA intensity variables than what would otherwise be possible under typical TFA assignment practices. Results are presented in table 8. In general, in table 7, where TFA effects were found to be associated with student outcomes, these effects are also present when TFA is measured at the grade level. Furthermore, most of the outcomes have signs in the same direction in both tables 7 and 8, with the exceptions being percent of classes failed (now significantly negative) and grade repetition (negative, but not significant) in elementary school.

. | Unexcused Absences . | Suspension Absences . | GPA . | % Classes Failed . | Repeater . |
---|---|---|---|---|---|

Elementary | −0.571^{a} | −0.179 | 0.301^{***} | −0.030^{**} | −0.018 |

(0.654) | (0.109) | (0.087) | (0.012) | (0.016) | |

Middle | −4.300^{**} | −0.189 | −0.063 | −0.005 | |

(2.080) | (0.623) | (0.126) | (0.031) |

. | Unexcused Absences . | Suspension Absences . | GPA . | % Classes Failed . | Repeater . |
---|---|---|---|---|---|

Elementary | −0.571^{a} | −0.179 | 0.301^{***} | −0.030^{**} | −0.018 |

(0.654) | (0.109) | (0.087) | (0.012) | (0.016) | |

Middle | −4.300^{**} | −0.189 | −0.063 | −0.005 | |

(2.080) | (0.623) | (0.126) | (0.031) |

*Notes:* Coefficients displayed are for the share of TFA teachers at the grade-school-year level. Regression controls for student-level and class average demographics and their interactions with grade. Other controls include class size and teacher race and their interactions with grade. Standard errors clustered at the school level shown in parentheses. GPA: grade point average; TFA: Teach For America.

^{a}Unexcused absences in elementary school fails the forecast bias test (see table 5). We display the coefficient here for completeness but urge caution in interpreting this result.

^{**}Significant at the 5% level; ^{***}significant at the 1% level.

Although the direction of the TFA effect is generally consistent across tables 7 and 8, the magnitudes are not. Taking the estimates at face value, the results from table 7 indicate that replacing every teacher with a TFA corps member would lower unexcused absences by 0.347 days per student, and the results from table 8 imply that replacing an entire grade with TFA teachers would lower unexcused absences by 4.3 days per student. However, there are two important differences between tables 7 and 8 that complicate this comparison. First, a TFA share of 1 is well outside the typical TFA density of schools in the sample: Of school–grade cells with any TFA in that year, the median TFA density is about 0.1 and the 90th percentile is about 0.3. Second, tables 7 and 8 are estimated from two different sources of variation. In table 7, the TFA coefficient represents the average change in student outcomes associated with being in a TFA classroom relative to a non-TFA classroom in a given school. On the other hand, table 8 compares school–grade outcomes in years with high dosage to outcomes in the same school–grade in years with low dosage. Thus, although both specifications are intended to measure the TFA contribution to student outcomes, there is little reason to expect them to have results consistent in magnitude.

Another potential explanation for the differences between tables 7 and 8 is spillover effects, where high concentrations of TFA corps members lead to school improvements beyond their impacts in their own classrooms. In a companion paper (Backes et al. 2016), we find little evidence of spillover in math or reading. However, one interpretation of tables 7 and 8 is that the overall school–grade effect in table 8 is larger than the individual effect in table 7. In results available from the authors, we find that in a hybrid model controlling for both an individual student's teacher and the TFA share, the coefficient on TFA share remains similar to that in table 8, consistent with the spillover hypothesis. Of course, this finding could also be driven by possible unobserved correlates of TFA share, such as other school investments made at the same time that schools added TFA corps members.^{20}

## 7. Reconciliation with Prior Evidence of TFA Impacts on Non-Test Academic Outcomes

We find suggestive evidence that students taught by TFA teachers in elementary and middle school were less likely to miss school due to unexcused absences and suspensions, and that students in elementary school had slightly higher GPAs.^{21} Among the outcomes that passed the validity tests, we do not find any evidence that assignment to TFA teachers has any adverse consequences on students.^{22}

Our results stand in contrast to prior studies of TFA corps members, which have not found any significant differences in student absences and suspensions (e.g., Decker, Mayer, and Glazerman 2004; Clark et al. 2013). Decker, Mayer, and Glazerman (2004) report students randomly assigned to TFA classrooms averaged more days absent (by 0.52 days) and days suspended (by 0.04 days) than control students, although the result is not statistically significant, and Clark et al. (2013) find minimal differences in student absences in secondary school. Although this study does not benefit from random assignment, we do have a substantially larger sample of TFA teachers, resulting in very precise estimates. In fact, our point estimates for absences and suspensions fall well within the confidence intervals of both Decker, Mayer, and Glazerman and Clark et al., the former of which having standard errors large enough that a 0.5 difference in days absent between students in TFA and non-TFA classroom is not close to significant (*p*-value = 0.415). On the other hand, our estimate of a difference of 0.35 days absent at the middle school level is statistically significant at the 95 percent level. Thus, although Decker, Mayer, and Glazerman and Clark et al. are able to rule out large TFA effects on non-test academic outcomes, their estimates are not precise enough to detect small effects.

## 8. Conclusion

The question of whether teachers influence non-test student academic outcomes has recently attracted the attention of both policy makers and researchers, with many states now considering new ways to evaluate school and teacher performance separately from test scores. This study specifically investigates whether TFA teachers influence non-tested academic outcomes commonly found in administrative databases, a likely starting point for many states as they move to experiment outside of test scores. In our review of the literature we find limited validation of the use of these non-tested student academic outcomes for teachers; hence, this paper seeks first to validate the use of such outcomes and then examines the influence of TFA teachers on these outcomes.

The analysis of N-VAMs revealed persistent differences in the influence of teacher effectiveness on non-test academic outcomes of their students. For all but one outcome in elementary and middle schools we cannot reject changes in student outcomes at the school-grade-subject-year level being fully explained by changes in the teachers assigned to those students. Although the short panel and large standard errors prevent us from making strong statements about the causality of estimated N-VAMs, most of the coefficient estimates are close enough to 1 that we feel comfortable estimating TFA effects on non-test academic outcomes in a value-added framework for elementary school and middle school. When estimating TFA effects, we find evidence of small reductions in class time missed due to absences and suspensions in elementary and middle school.

The cases in which our tests reject the validity of N-VAMs also raise some caution for prior studies of teacher effects on non-tested outcomes. For example, where Jackson's (2014) analysis presents evidence of teacher effects on an index of non-tested student academic outcomes in ninth grade, we are able to reject forecast unbiasedness of many of these same non-test outcomes in high school grades using our data. Although Jackson does conduct validity tests similar to those found in this paper, the tests are noisy enough to be unable to rule out substantial levels of bias (which is consistent with this paper). Another example is Gershenson's (2016) study analyzing teacher effects on unexcused absences of elementary school students; this is the one elementary outcome where the validity tests were rejected in our data. Although our rejection of the validity of these particular outcomes does not invalidate these prior studies’ findings (neither of which use the data we use here), it does underscore the need to thoroughly vet new outcomes before using them as dependent variables in a value-added framework. Further research validating new student outcomes across a variety of contexts would be highly valuable before considering policy applications of these N-VAMs.

An open question regarding measures of a teacher's contribution to non-tested student academic outcomes is the extent to which differences across teachers represent true changes in student behavior or differences among teachers. Taking student suspensions as an example, we find that teachers who tend to lead classrooms with fewer (or more) students who receive suspensions in one year tend to lead classrooms with fewer suspended students in other years, even when they switch schools or grade levels. Although this is likely not to be driven by the sorting of teachers to students, because these effects follow teachers when they switch schools, this finding could be driven by certain teachers inducing better behavior in their students (a true positive) or by certain teachers being more lenient with student discipline (a false positive). The only evidence that we are aware of is Jackson (2014), who finds that students assigned to teachers who are found to raise an index of non-test academic student outcomes (a weighted average of suspensions, absences, on time grade progression, and GPA) have improved future outcomes such as dropout and SAT-taking, suggesting that differences across teachers in estimated N-VAM measures translate to real effects on students. Future research could extend Jackson's work to examine each of the components in the index and how the relationship varies across school levels, rather than on only a sample of ninth graders.

On a related note, there is an additional element of caution warranted for using the non-test academic outcomes that we analyze here for evaluation purposes, because some of these outcomes could be easily manipulated by teachers or other administrative staff in the school. For example, if states began to attach stakes to students’ unexcused and suspension absences, unscrupulous administrators could simply record them as excused absences without penalty. Though we know of several jurisdictions that are beginning to consider student absences as performance measures, we urge caution in using these data as we do not know whether these measures will be equally predictive once stakes are assigned to them. Moreover, outcomes like GPA, course failure, and grade repetition are primarily (if not exclusively) determined by the teachers themselves, and therefore are not likely to be useful for evaluation purposes—for either teachers or schools. Yet, as long as policy makers do not apply stakes directly to them, the evidence that value-added measures based on many of these outcomes are valid in elementary and secondary grades should give some confidence to researchers (for research use) and personnel managers (for monitoring staff performance) for these new insights into teacher outcomes.

Finally, it is worth noting that the array of potential non-test measures that could plausibly be used for policy purposes is far wider than those we examined here. We focus on this set of non-test student academic outcomes because they are likely convenient starting points for states beginning to experiment with non-test measures, but the new federal law is not prescriptive on what the new measures should look like. Other types of performance measures based on student or teacher surveys, psychological evaluations, or school inspections by external raters could all theoretically be used to meet this provision in the law and have their own respective literatures supporting them.^{23} As of this writing, though, we do not know of states that have formally embraced any of these other types of measures.

## Acknowledgments

This work was produced under financial support from the John S. and James L. Knight Foundation. We gratefully acknowledge the cooperation of Miami-Dade County Public Schools, especially Gisela Feild, in providing data access and research support and allowing us to interview district and school personnel regarding Teach For America (TFA). We acknowledge the cooperation of TFA's national and Miami regional offices in conducting this research. We thank Victoria Brady and Jeff Nelson for excellent research assistance, and appreciate the helpful comments from Cory Koedel, Raegen Miller, Jason Atwood, Rachel Perera, James Cowan, and participants in presentations of findings at the 2015 AEFP Conference. Any and all remaining errors are our own.

The Knight Foundation funded both the TFA clustering strategy and this external evaluation of that strategy. The Knight Foundation, TFA, and Miami-Dade County Public Schools were cooperative in facilitating this evaluation and received early drafts of these findings, but none had editorial privilege over the actual content or findings presented here. CALDER has a current contract with TFA to conduct a separate evaluation of corps member impacts in a different region of the country, which TFA procured through a competitive bid process; these evaluations are independent of each other, and TFA has no editorial influence in the presentation of findings in either the current study or this separate contract.

## REFERENCES

*Evaluating teaching, leading and learning*

## Notes

In most regions where it operates, TFA is an alternative certification program and is widely recognized as such. In Miami, however, TFA does not set teachers up for permanent certification and cannot be technically considered an alternative certification program.

Note that school effects on non-test student academic outcomes like the ones we examine here are one of several plausible types of non-test school performance measures that could be adopted under the new law. These other types of non-test school measures will be discussed further in the Conclusion.

Last year, a consortium of nine large school districts in California, known as the CORE Districts, announced the development of its School Quality Improvement Index, which will include multiple measures of social and emotional learning in students and culture-climate indicators for schools. Several of the non-tested academic outcomes we explore in this paper are included as measures in this new index, and in the interim period leading up to the rollout of school climate instruments these administrative measures will exclusively be used to stand in for the non-test school performance measures. See the announcement available at http://coredistricts.org/our-work/school-quality-improvement-system/. We also found the inclusion of student absence measures for elementary and middle schools in a proposal for new accountability measures in Texas (see “2016 Accountability and Planning for House Bill 2804 Implementation,” available at https://tea.texas.gov).

We acknowledge that some of these measures may also be easily manipulated by teachers or administrators if accountability stakes were attached to them. We further address this issue in our discussion of the findings.

Grade point average is calculated using course grades from transcripts, where earning an A corresponds to 4 grade points, earning a B equals 3 grade points, etc.

For ease of notation, we write as in the remainder of the text, and it is meant to apply to any candidate non-test outcome. The procedure will be performed separately for each outcome.

The vector of student characteristics includes the following: race; gender; FRPL eligibility; LEP status; and mental, physical, or emotional disability status. The vector of classroom characteristics includes class size and classroom–level averages of each of the student characteristics listed above. Teacher controls include teacher race, gender, experience, and whether the race of the teacher matches that of the student. The student characteristic, class average, and teacher demographic controls are interacted with grade indicator variables to allow differences in the influence of these variables across grades. The estimating equation additionally includes indicator variables on grades and years. Stanford errors are clustered at the school level.

The discussion that follows is meant to provide a conceptual overview for how forecasts of teachers’ value–added contributions are constructed. In practice, we use the *vam.ado* program developed for use by Chetty, Friedman, and Rockoff (2014). See their online Appendix A, available at https://assets.aeaweb.org/assets/production/articles-attachments/aer/app/10409/20120127_app.pdf, for a step–by–step guide to implementing the methods developed in the paper.

In all regressions, includes FRPL eligibility, English language learner (ELL) status, gender, race, special education status, and an indicator of whether the student has an identified learning disability. includes a cubic function of the lagged outcome variable (unless the outcome is binary, in which case it is simply one indicator variable), and sometimes will include lagged values of other outcomes, as discussed below. For all outcomes other than GPA, for the validation procedure, we first transform each non-test outcome and its lag by adding 1, taking the log, and then standardizing the values to be mean zero and standard deviation 1.

Specifically, we choose the vector to minimize the mean-squared error of forecasts of test scores across all teacher–year observations in the sample: This is equivalent to regressing on the observations contained in the vector and obtaining the coefficients.

This method is referred to as the Full Roster Method (FRM) by Hock and Isenberg (2012).

In the paper where Chetty, Friedman, and Rockoff (2014) develop their test for forecast bias, they keep one observation per student per year, thus there are no instances of multiple teachers per student in their data. In our case, students are linked to multiple teachers in a given year. As pointed out by Isenberg and Walsh (2015), since some students are linked to more teachers than others, they will have a larger contribution in estimating the coefficients on explanatory variables in equation 4. Isenberg and Walsh (2015) recommend the Full Roster–plus Method (FRM+), which involves creating duplicate replications. However, due to the very large number of student–teacher links in our data, this suggestion is computationally infeasible. Thus, we do not implement this method; however, results are similar when restricting the sample to remove students linked to very few or very many teachers. When the sample is restricted to students linked to between 7 and 12 teachers in a given year (about 80 percent of the sample), results for the quasi-experimental tests for forecast bias are similar. Finally, Isenberg and Walsh (2015) note that results using FRM and FRM+ are very similar.

The correlations presented here are likely larger than the true correlation in underlying teacher skills due to factors such as unobserved student heterogeneity across outcomes. However, as the outcomes of interest in this paper are the forecasts of teacher effectiveness, estimating these true correlations is beyond the scope of the paper.

Of course, the measures could still be useful for teachers not in tested grades and subjects.

According to the M-DCPS handbook, suspensions count as excused absences, not unexcused. See www.dadeschools.net/schoolboard/rules/Chapt5/5a-1.041.pdf.

We also experimented with estimating teacher effects for whether students ever graduate. We do not include results in this paper for two reasons. First, controlling for lagged graduation outcomes is not feasible (students who graduate exit the data). And second, our panel is not sufficiently long to observe graduation outcomes for many students. When applying our methods to graduation outcomes, we find very high degrees of forecast bias and no relationship between TFA and graduation.

Consistent with Ladd and Sorensen (2017), in results available from the authors, we find that teacher experience is associated with a reduction in students’ unexcused absences, especially in elementary school. We do not find a relationship between teacher experience and student outcomes for the other non-test academic outcomes that we consider.

Many TFA corps members in M–DCPS are placed in schools under the direction of the Educational Transformation Office (ETO), which oversees the district's implementation of school turnaround efforts in targeted schools. Results are very similar when adding a year × ETO interaction term, which should not be surprising because, in the presence of school fixed effects, estimates are identified from within-school variation.

One could also use the estimates of forecast bias presented in table 6 to scale the estimates in table 7 accordingly. For example, the estimated coefficient on elementary GPA is 1.10, representing a 10 percent forecasting bias over what we would assign as due to the teacher's input; reducing the TFA point estimate on elementary GPA in table 7 by 10 percent would result in a value of 0.045 rather than 0.050 points on the GPA scale. Scaling this way would not qualitatively change the results presented here.

When estimating these additional specifications, we included interaction terms for year × ETO (see footnote 18) as additional control variables. Hence, any school-level investments that may bias these results would have to be separate from the ETO interventions.

Our TFA results are estimated on the full sample of TFA corps members and alumni. About 80 percent of the TFA sample consists of corps members (i.e., are in their first or second year) and thus estimating TFA impacts for alumni only yields very large standard errors.

The exception is the percent of classes failed for high school students, for which we find a statistically significant increase for students in TFA classrooms. However, although we cannot reject forecast-unbiasedness of this outcome due to large standard errors, the point estimate for bias in table 6 is 22 percent, and in table 5 we reject equality with 1. Thus, we do not place a great deal of weight on this finding, although it deserves future exploration.

For example, see Ferguson (2012) or Kane and Staiger (2012) for evidence and discussion of the use of student surveys or Kraft and Papay (2014) for the use of teacher surveys; Duckworth et al. (2007) for psychological assessments; Hussain (2015) for school inspections; and Schwartz, Stiefel, and Wiswall (2016) for school climate–based measures.

## Appendix A: Additional Data

. | Unexcused Absences . | Suspension Absences . | GPA . | % Classes Failed . | Repeater . |
---|---|---|---|---|---|

High school | 0.126 | 0.007 | −0.025 | 0.008^{**} | 0.003^{***} |

(0.204) | (0.028) | (0.018) | (0.004) | (0.001) | |

Observations | 4,242,829 | 4,2428,29 | 4,226,106 | 4,242,791 | 4,242,829 |

R^{2} | 0.512 | 0.175 | 0.577 | 0.285 | 0.132 |

Dependent variable mean, full sample | 7.59 | 0.53 | 2.57 | 0.08 | 0.028 |

Dependent variable mean, students in TFA classrooms | 12.3 | 1.14 | 2.33 | 0.12 | 0.012 |

. | Unexcused Absences . | Suspension Absences . | GPA . | % Classes Failed . | Repeater . |
---|---|---|---|---|---|

High school | 0.126 | 0.007 | −0.025 | 0.008^{**} | 0.003^{***} |

(0.204) | (0.028) | (0.018) | (0.004) | (0.001) | |

Observations | 4,242,829 | 4,2428,29 | 4,226,106 | 4,242,791 | 4,242,829 |

R^{2} | 0.512 | 0.175 | 0.577 | 0.285 | 0.132 |

Dependent variable mean, full sample | 7.59 | 0.53 | 2.57 | 0.08 | 0.028 |

Dependent variable mean, students in TFA classrooms | 12.3 | 1.14 | 2.33 | 0.12 | 0.012 |

*Notes:* Regression controls for student–level and class average demographics and their interactions with grade. Other controls include class size and teacher race and their interactions with grade. GPA: grade point average; TFA: Teach For America.

As all outcomes shown here have an estimate of forecast bias of greater than 20 percent (see table 5); *these estimates should not be taken as credible estimates of TFA effectiveness.*

^{**}Significant at the 5% level; ^{***}significant at the 1% level.