Abstract

In this policy brief we argue that there is little debate about the statistical properties of value-added model (VAM) estimates of teacher performance, yet, despite this, there is little consensus about what the evidence about VAMs implies for their practical utility as part of high-stakes performance evaluation systems. A review of the evidence base that underlies the debate over VAM measures, followed by our subjective opinions about the value of using VAMs, illustrates how different policy conclusions can easily arise even given a high-level general agreement about an existing body of evidence. We conclude the brief by offering a few thoughts about the limits of our knowledge and what that means for those who do wish to integrate VAMs into their own teacher-evaluation strategy.

Introduction

We contend that there is little academic disagreement about the statistical properties of value-added model (VAM) estimates currently being implemented to measure teacher job performance or the current state of in-service teacher evaluations.1 In particular, we believe that most researchers could sign on to five stylized facts: (1) there is important variation in teacher effectiveness that has educationally significant consequences for student achievement, as measured on standardized tests (and new evidence suggests later life outcomes as well); (2) most formal teacher evaluations are far from rigorous in that they largely ignore performance differences between teachers; (3) VAM measures are likely to contain real information about teacher effectiveness that could be used to inform personnel decisions and policies, but they are subject to potential biases; (4) VAM measures are noisy, varying from year to year or class section to class section for reasons other than variation in true effectiveness; and (5) little is known about how current and potential teachers might respond to the use of VAM measures in evaluating their job performance.

As we see it, the real debate about VAMs is less about the known properties of these measures, and more about what the evidence about VAMs implies for their practical utility as part of high-stakes performance evaluation systems.2 This policy brief offers a case in point—we agree about what research has to say about teacher evaluation, the properties of value-added measures, and the ways in which VAMs could conceivably be used for workforce improvement, but we do not reach the same conclusions about the value of VAMs in making high-stakes personnel decisions. Our views are not diametrically opposed but stem from subjective assessments about the risks associated with VAMs, judgments about the alternatives, and skepticism about the degree to which education systems will change without their use. As we will show, given the same body of evidence, the VAM glass can easily be viewed as half empty or half full.

In the next section, we briefly review the evidence base that underlies the debate over VAM measures, organized loosely around the five stylized facts listed earlier. We then describe theories of action connecting VAM measures to educational improvement and the limited available evidence on many of the theories put into practice. Successful use of VAMs in the real world requires a well-specified link between teacher measurement and human resource management actions; unfortunately, their implementation in practice is often carried out without much attention to this link. The remainder of the brief is devoted to our differences in opinion about the use of VAMs, illustrating how different conclusions can easily arise from the same body of evidence and a conclusion about the limits of our knowledge and what that means for those who do wish to integrate VAMs into their own teacher evaluation strategy.

A Brief Review of What We Know about VAMs

A near-universal point of agreement among education researchers is that teachers matter, or, put another way, that there is important variation in teacher effectiveness that has educationally significant consequences for students. This conclusion is an out-growth of value-added studies that find an effect size of individual teachers in the neighborhood of 0.10 to 0.25 standard deviations (Hanushek and Rivkin 2010; Goldhaber and Hansen 2012).3 To put this effect in perspective, a one standard deviation difference in teacher effectiveness amounts to about 10 to 25 percent of a grade-level's worth of growth in the elementary grades.4 The magnitude of these effects has led to the oft-used refrain that teacher quality is the most important in-school factor affecting student achievement.

Unfortunately for policy makers and school leaders, variation in achievement gains between teachers appears to be only weakly related to licensure, experience, and degree attainment, now used by most states and school districts to determine employment eligibility and compensation (Goldhaber, Brewer, and Anderson 1999; Aaronson, Barrow, and Sander 2007; McCaffrey et al. 2009).5 Moreover, existing in-service teacher evaluations fail to recognize important differences in learning gains across classrooms. In their well-known report The Widget Effect,Weisburg et al. (2009) found little to no documented variation in teacher job performance ratings in surveyed school districts.6 In the absence of variation, these ratings played little to no role in key human capital decisions such as tenure, promotion, and compensation. Together, these findings have led researchers and policy makers to consider alternative measures of job performance, such as VAMs, that aim to directly and objectively estimate teachers’ impact on student achievement.

Put simply, VAMs use historical data on student achievement to quantify how a teacher's students performed relative to similar students taught by other teachers (e.g., students with the same baseline performance, socioeconomic background, educational needs, and so on). Value-added models are designed to statistically control for differences across classrooms in student inputs, such that any remaining differences can be used to infer teachers’ impacts on achievement. As is true for any statistical model, users of VAMs must be attentive to issues of attribution (bias), imprecision (noise), and model specification. No model is perfect, and thus the relevant question for the VAM debate is whether these measures as currently constructed contain enough useful information that their integration into performance evaluation systems, and the use of these performance evaluations as part of human resource management systems, is likely to lead to improvements in the quality of the teacher workforce.

Questions of model specification, bias, and precision have been extensively studied. With respect to model specification, there appears to be general agreement that VAM estimates are not terribly sensitive to common specification choices, such as the inclusion or exclusion of student or classroom level covariates, but that there is greater sensitivity to whether models include school or student fixed effects.7 VAM estimates also show some sensitivity to the student outcomes (or domains on a test) employed as the metric to judge teachers (e.g., Lockwood et al. 2007; Corcoran et al. 2011; Papay 2011).

The question of whether and to what extent VAMs are biased—that is, whether test-score gains attributed to a teacher are due to student sorting or some other factor that varies with teacher assignment—is less settled. Rothstein (2010), for example, shows that in standard VAMs, teachers assigned to students in the future have statistically significant power in predicting past student achievement. This finding clearly cannot be causal, and suggests that typical models do not adequately account for the processes leading to the matching of students and teachers in classrooms.8 On the other hand, Goldhaber and Chaplin (2012) and Kinsler (2012b) find that “the Rothstein test” may not work as intended, and may suggest model misspecification where none exists. In a study based on random assignment of teachers to classrooms, Kane and Staiger (2008) showed that VAMs do not appear to be biased. Although promising, this experiment has some limitations, most notably a small sample size and the fact that the experiment only includes teachers who were deemed eligible by their principals to teach each other's classes (i.e., teachers who are seen to be exchangeable).9 Consequently, the results may not be fully generalizable. Finally, Chetty, Friedman, and Rockoff (2011) use different VAM falsification exercises from Rothstein and find little evidence of bias.

The extent of imprecision, or noise in VAM measures is another issue that has been extensively studied. Several recent studies have found teacher value-added estimates to be only moderately reliable or stable from year to year, with adjacent year correlations in the neighborhood of 0.3 to 0.5 (e.g., McCaffrey et al. 2009; Goldhaber and Hansen 2012). Whether or not these correlations are sufficiently strong is in the eye of the beholder. On the one hand, correlations of this magnitude result in a nontrivial proportion of teachers being rated as “ineffective” or “highly effective” in one school year (or class section) while being rated as average in the next year (or section). On the other hand, as Glazerman et al. (2011) point out, the magnitude of these stability estimates is not much different from what is observed in other occupations that have been quantitatively measured. It is true that the magnitude of these reliability estimates likely do not support the exclusive use of VAM measures to classify teachers into performance categories (Schochet and Chiang 2010; Goldhaber and Loeb 2013).10 Although some evaluation systems put a large weight on VAM results, to our knowledge no proposed or adopted policy relies exclusively on VAMs.

The validation of value-added measures against long-run student outcomes—as opposed to year-to-year test score gains—has been less extensively studied. A number of studies, however, have documented “fade out” of teacher effects—that is, the gains induced by teachers in one grade that dissipate in later grades.11 One possible explanation for this finding is that the heightened focus on tests may encourage teachers to emphasize strategies that lead to short-run test score gains at the expense of long-run learning. On the other hand, there are some less pernicious explanations for fade out, which include variation in test content across grades and test scaling effects (Cascio and Staiger 2012). Additionally, recent evidence by Chetty, Friedman, and Rockoff (2011) finds that, despite effects that fade out in the short-run, VAM measures still predict key later-life outcomes—e.g., teen pregnancy, college-going behavior, and labor market earnings—long after students have left a teacher's classroom.

Although we know a good deal about the properties of VAMs, existing research has little to say regarding how current and potential teachers are likely to respond to the use of VAM measures in high-stakes personnel decisions, other than with regard to whether there are short-run productivity effects of pay for performance (discussed below). Much of the available evidence on teacher value added comes from low-stakes settings, in which teacher evaluations were not linked to student performance or high-stakes personnel decisions.

Taken together, the balance of the evidence indicates that VAM measures are imperfect, and questions about bias are far from settled. We argue, however, that the more important question for policy is whether VAMs can improve upon existing or alternative measures of teacher job performance. As The Widget Effect pointed out, existing evaluation systems generally fail to recognize variation in teacher performance, and even imperfect measures may improve upon the status quo. Moreover, non-test-based measures of performance, such as classroom observations, student surveys, and portfolio assessment, suffer from their own imperfections and are as untested as VAMs (Harris 2012). We believe that policy makers ought not to focus solely on whether or not a VAM measure is biased and noisy but rather on the extent of bias and imprecision, and whether VAMs provide either a more accurate picture of teacher quality than other means of assessing teachers and/or additional information that can be used to improve instruction.

Theories of Action: Connecting VAMs to Educational Improvement

Much is known about the properties of value-added measures, but successful implementation of VAMs in practice requires a sound theory of action connecting VAM measures to educational improvement. Unfortunately, the implementation of VAMs is often done without a well-specified link between these two, and there is little evidence available to inform such links. By their nature, VAMs are norm-referenced measures, indicators of relative performance that use historical data to rank teachers based on classroom test performance. Thus the question is how these measures can be integrated into a performance evaluation system to raise overall teacher effectiveness.

There are two primary means by which VAMs might be used to improve educational outcomes. The first focuses on improving the effectiveness of incumbent teachers, whereas the second emphasizes changes in the composition of the teacher workforce. With respect to the former, VAMs could be used by school and district leaders as a professional development tool, identifying areas of weakness and providing models of successful instruction. School systems have historically invested vast sums in professional development, but generally little of this is targeted to the specific needs of individual teachers (Rice 2009; Toch and Rothman 2008). VAM measures could aid in better targeting these investments to the teachers who need them most. Furthermore, the availability of rewards for high performance may incentivize existing teachers to increase effort levels or improve their skills by learning from successful colleagues.12

The second mechanism operates through changes in the composition of the teacher workforce. Some researchers have argued that a policy of denying tenure or dismissal of the lowest-performing teachers would substantially raise the quality of the teacher workforce and have large effects on the life outcomes of students (Staiger and Rockoff 2010; Chetty, Friedman, and Rockoff 2011; Hanushek 2011). A practice of recognizing and rewarding effective teachers, either through higher pay or promotion, could also improve quality by attracting potential teachers who might otherwise be drawn to fields that better reward their skills.13 Finally, the availability of rewards for high performance may help with the retention of effective teachers.

Unfortunately, with practical implementation of VAMs in its infancy, we have almost no direct evidence on these proposed mechanisms. A notable exception is a series of experimental studies of teacher performance incentives tied to VAMs that suggest they have little impact on the productivity of practicing teachers.14 These studies remain small in number, however.

In the absence of much evidence on the mechanisms by which VAMs could improve educational outcomes, proponents and detractors of VAMs have largely relied on reasoned speculation over whether they believe these measures are likely to improve educational outcomes or not. These beliefs can be grounded in the existing evidence on VAM measures but hinge on whether one views the VAM glass as half empty or half full. On the one hand, each of these theories is compelling. The teaching profession has a poor track record of rewarding performance, whether through higher pay, increased responsibility, or career advancement (Johnson and Papay 2009). Teaching's reward structure may discourage talented graduates from the profession (Hoxby and Leigh 2004), fail to retain the best teachers (Chingos and West 2012; TNTP 2012), and provide weak incentives for effort and improvement.15 VAM measures may be used as part of a system to address some of these issues, and to do so it is not necessary for VAMs to be perfect indicators of teacher effectiveness (Glazerman et al. 2011).

On the other hand, VAMs may not be well suited to support these theories of action. If VAM measures are sufficiently biased and/or unreliable, they may lead to incorrect personnel decisions and misallocated resources. Bias and instability can also undermine trust in the system, and the risk associated with employment or compensation instability could dissuade potential teachers from the profession rather than attract them (Rothstein 2012).

Along the same lines, even if VAM measures have acceptable properties from a statistician's point of view, their complex calculation and inherent variability can limit their face validity among practitioners. For evaluation systems built on VAMs to improve the instruction techniques of existing teachers, practitioners will need to see direct connections between their day-to-day practice and their performance evaluations. Today, most VAM-based evaluation systems only provide information about teachers’ effectiveness categories or relative ranking, not direct information that can be used to improve particular aspects of practice. This is one of the reasons it makes sense to use VAM only as a part of a well-rounded evaluation system.

Where You Stand Depends on Where You Sit

We began the brief with the contention that most researchers would agree with our stylized facts about VAM measures of teacher effectiveness and the current state of performance evaluation in the teaching profession. But agreeing on the stylized facts does not necessarily mean agreement on the practical utility of using VAMs or their potential to improve teacher effectiveness. In fact, we do not entirely see eye to eye on these issues.

Glass Half Empty: Corcoran

I (Corcoran) tend to view the VAM glass as half empty. Although I would agree that performance evaluation in teaching is lacking and sorely in need of reform, I believe the potential for VAM measures to dramatically improve teaching effectiveness and the quality of the profession tends to be overblown. Student achievement should be an important part of a new and improved system of evaluation. As statistical estimates based on historic and limited data, however, VAM measures lack transparency and are inherently limited by imprecision. If teacher quality is to be improved, teachers and school leaders need instructive, actionable information they can use to make meaningful changes to their practice sooner rather than later. A statistical prediction that relies on annual tests and multiple years of data to produce reasonably reliable value-added estimates does not, in my view, meet this requirement.

VAM measures may turn out to be useful indicators of relative performance—separating the very high- and very low-performing teachers from the rest of the pack. This information could be fruitfully used by principals as an early warning signal or (in extreme cases) as grounds for dismissal. Their utility as a job performance indicator for a significant number of teachers is another matter, however. Given the inherent instability of VAM estimates, a high-stakes system tied to VAMs would need to be conservatively designed, reserving punishment and reward only for those with demonstrably very low or very high performance, and an acceptably low level of statistical uncertainty.16 But a VAM system that meets these conservative standards would ultimately only apply to the most extreme cases, and would provide little feedback to the bulk of teachers. This begs the question of what VAMs would add beyond the subjective evaluation of principals or other educators who are presumably capable of identifying the very worst (or best) teachers (e.g., see Jacob and Lefgren 2008). Finally, it is important to keep in mind that “value added” is not a unidimensional concept. There are as many value-added measures as there are tests and subject areas, and VAMs have been found to be only moderately correlated across these. There is room for combining information across tests (Lefgren and Sims 2012), but policy makers may find that creating decision rules to objectively identify the “best” and “worst” teachers is easier said than done.

I am encouraged by the prospect for reform in teacher evaluation and the renewed focus on student achievement. Our understanding of teacher labor markets and teacher effectiveness would not be where it is today without advances in value-added measurement and the careful linking of student-level achievement data to teachers over time. That VAM measures have proven invaluable to research, however, does not imply they will be useful as on-the-job performance measures. Inferences about teacher effects on average are quite different from inferences about individual teachers, and I tend to be more pessimistic about the latter. The attachment of high stakes to measures that are meant to be informative about the progress of students has the potential to undermine the validity of the measures themselves, to encourage teaching to the test, and (at worst) cheating. With professional careers and the education of our nation's children on the line, our educators need a new evaluation system that is transparent, informative, and responsive to their needs. Although VAMs have a role to play in this system, policy makers should temper their expectations and limit their high-stakes use.

Glass Half Full: Goldhaber

I (Goldhaber) view the VAM glass as half full. I do not disagree with the technical points or potential negative incentives associated with connecting VAM measures to high-stakes decisions described above. This is one of the reasons why I think that VAM measures ought only to be a component of a well-rounded evaluation system that also, for instance, includes classroom observations.17 The concern that focusing teacher accountability on student test achievement might lead to pernicious behaviors that distorts the learning process in schools—Campbell's Law—is entirely valid, which is why evaluation reform probably ought to happen in conjunction with measures designed to guard against these outcomes.18 Yet despite some wariness about VAMs and their use, I believe we should experiment with incorporating VAM as a factor in evaluations that are used for such high stakes purposes as pay, tenure, and promotion determination.

I would argue that VAMs ought to be used for three primary reasons. First, evaluation is to some extent about determining which teachers should stay in the profession (and perhaps in which positions and at what compensation levels). Even those who believe the overwhelming majority of teachers are successful enough to merit being in the profession likely recognize that some, perhaps quite small, proportion of teachers are not very effective and ought to be dismissed. And, as we have recently witnessed, economic circumstances occasionally necessitate that some teachers will lose their jobs. We cannot use performance evaluations to make personnel decisions if there is no variation in the evaluation ratings in the workforce. Unless the act of evaluating teachers itself makes them more effective,19 school systems are investing time and effort in the evaluation endeavor for little useable information.

Second, I see value added as a catalyst for broader changes to teacher evaluation. I am quite skeptical that we would be engaged in what is now almost a national experiment with new, and hopefully more rigorous, teacher evaluation were it not for the specter of VAM usage. It has been over two decades since research (Hanushek 1992) showed just how important the variation in teacher effectiveness is for determining student outcomes.20 Yet policy makers and practitioners have, by and large, been unable or unwilling to develop credible teacher evaluation systems that recognize the important differences that exist between teachers. As I noted earlier, I understand the need to be careful in making changes to the system but surely it is not unreasonable to have expected more movement on this issue over such a long period of time.

Third, and perhaps most importantly, evidence suggests that VAM measures are better at predicting future student test achievement than the other credentials—licensure, degree, and experience levels—that are now used for high stakes purposes (e.g., Goldhaber and Hansen 2010), and better than other means of assessing teachers (Glazerman et al. 2011; Harris 2010; Tyler et al. 2010). To the extent that evaluations are in fact used for any purpose, we would want them to be good predictors of student achievement and, although imperfect, VAMs look pretty good compared to the other options currently out there. There is a laser focus on the known flaws of VAMs while other methods of teacher evaluation have basically been given a pass.

There is no doubt that VAM-informed decisions about teachers will sometimes, because of bias or reliability, lead to teacher classification errors (Goldhaber and Loeb 2013). We clearly want to be careful so as to limit these errors, but we also have to recognize that it is almost certainly not optimal for students to entirely eliminate the downside risk to teachers of classification errors. The reason is obvious but still merits repeating: There is a trade-off inherent with reductions in the number of teachers who are unfairly classified as being ineffective; that is, the number of ineffective teachers classified as effective rises. In other words, to some extent what is best for teachers may not be best for students. Again, this is an issue that exists for all evaluation systems, not just those that utilize VAM-based information.

Going Forward

As is clear from the preceding section, we do not entirely agree with one another in terms of the extent to which policy makers ought to use VAMs for high-stakes purposes, but there is significant agreement about how we ought to move forward with some use of VAMs. First, implementation matters. It matters not only because of the support systems we discussed here that are likely to be crucial for the integrity of an evaluation system emphasizing value-added measures, but also because we are talking about a system that will affect human beings whose responses to the system will go a long way in determining its effectiveness. Specifically, the behavioral response to the use of VAMs depends not only on the design of the system but also on teachers’ (and other stakeholders’) reactions to it. This means that clear and constant communication about why particular modeling decisions were made, how the system works, and what are the consequences for teachers who receive different evaluation ratings, is essential.

Second, as we have hopefully stressed herein, our current understanding of the impact of VAMs in practice is limited because this impact depends on human beings’ responses to how VAMs are used.21 This means there is a good deal of room for debate about the extent to which VAM-usage, particularly for high-stakes personnel decisions, would lead to good (e.g., more feedback on effective teaching practices) or bad (e.g., a narrowing of the curriculum) outcomes. Given this, we recommend that policy makers roll out plans with an eye toward evaluation and modification.22 This is decidedly not how new education policies tend to be implemented; policy makers or practitioners who say “we are pretty sure this is a good plan, but we might have to change it soon” probably reduce the chances they will hold key positions of influence in the future. Fortunately, solving this political problem is outside the scope of our brief (and perhaps not possible).

Third, policy makers should, as best as possible, anticipate indirect and unintended consequences (not all of which are necessarily negative). For instance, attaching stakes for teachers to student test achievement will encourage cheating, so clearly school systems should be considering auditing systems to discourage cheating and detect it when it happens. As we noted earlier, increased job or compensation risk may make teaching a less desirable occupation. These are obvious examples; it is worth some time and effort planning for the obvious issues that will arise and anticipating some issues that may not be immediately obvious.

The bottom line is that we are entering a new world of teacher evaluation, a world in which recent policies dictate that VAMs will play a role. But, in implementing VAM-based reforms, we believe one should view the current policy push not as the end product but the evolution toward a better system.

Notes

1. 

Henceforth we use the terms “teacher effectiveness” and “teacher job performance” interchangeably.

2. 

In fact, we speculate that any means of evaluating teachers that leads to a spread of summative teacher performance ratings, which in turn have consequential implications for teachers’ jobs or compensation, would be controversial. As we briefly touch on subsequently, this may be because of teacher concerns about disruptive effects on teacher collegiality or a general resistance to move away from the current system where there are few job consequences for poor teaching (because poor teaching is rarely identified through a formal evaluation process).

3. 

The estimates are typically in the neighborhood of 0.10 to 0.15 for within-school estimates and 0.15 to 0.25 for estimates that also include between-school differences in teacher effectiveness.

4. 

In the lower grades a typical student gains about one standard deviation in math and reading achievement (Bloom et al. 2008). Thus a one standard deviation difference in teacher effectiveness—about the difference between having a teacher at the sixteenth percentile of the performance distribution and one at the fiftieth percentile—amounts to about 10 to 25 percent of a grade-level's worth of achievement in the elementary grades. The impact of teacher effectiveness is even larger in middle and high schools since students gain less (in standard deviation units) from grade to grade.

5. 

For more evidence on the relationship between these credentials and student achievement, see Rockoff (2004), Rivkin, Hanushek, and Kain (2005), Goldhaber and Brewer (1997, 2000), Boyd et al. (2009), Clotfelter, Ladd, and Vigdor (2010), Kane et al. (2010), and Goldhaber and Hansen (2012).

7. 

Correlations across different models are generally high (over 0.9) for models that vary in terms of how they handle covariate adjustments for student background, classroom, and school characteristics (Ballou, Sanders, and Wright 2004; Harris and Sass 2006; Lockwood et al. 2007) but the correlations between these models and those that include student or school fixed effects are far lower, in the range of 0.3 to 0.5 (Harris and Sass 2006; Goldhaber, Gabele, and Walch 2012), with the range dependent on the subject as well as the number of years of teacher data informing the model. Importantly, models with student fixed effects are much more commonly used in research than in practice, due to their high data and computational demands.

8. 

Rothstein (2009) suggests that bias is likely to be quite small when VAMs include several years of prior test scores as control variables.

9. 

A recently released report from the well-known Measures of Effective Teaching Project (Kane et al. 2013) is based on a similar experiment and also concludes that VAMs provide unbiased estimates of teacher effectiveness, and this study is based on a much larger teacher and student sample. The study is still limited, however, in that the findings might not be generalizable outside of the group of teachers who were deemed by principals to be randomized within schools.

10. 

Schochet and Chiang (2010), for instance, used simulated data that rely on plausible estimates of the signal to noise ratio in teacher effect estimates and conclude that, if three years of data are used for estimating teacher effectiveness, the probability of identifying an average teacher as being “exceptional” or “ineffective” (roughly one standard deviation above or below the mean, respectively), a Type I error, is about 25 percent. Conversely, the probability that a truly exceptional teacher is not identified, a Type II error, is also about 25 percent.

11. 

For instance, estimates suggest that only one third to one half of teacher value added can be identified in subsequent grades. See Jacob, Lefgren, and Sims (2010), Chetty, Friedman, and Rockoff (2011), Corcoran, Jennings, and Beveridge (2011), and Kinsler (2012a).

12. 

Although teachers may learn skills from their colleagues now (Jackson and Bruegmann 2009), recognition and rewards systems could facilitate additional informal learning by broadly identifying teacher successes and hence making it clearer which teachers might have the most to offer in terms of informal training.

13. 

The impacts on the teacher applicant pool would be even more beneficial if VAM-based evaluations helped teacher training institutions learn what kinds of skills they should be developing in future teachers (Boyd et al. 2009), or helped school districts identify which of their hires tend to be more effective, leading to better future hiring decisions (Rockoff et al. 2010).

14. 

For example, see Springer et al. (2010), Fryer (2011), and Fryer et al. (2012). An interesting exception is Fryer et al. (2012), who find that teachers respond to loss aversion (i.e., the threat that compensation that is in hand will be taken away if student performance does not meet certain goals); it is not clear how such an incentive system could be implemented in practice, however.

15. 

By some measures, research finds that more academically talented teachers are more likely to leave the teaching profession than their lower achieving counterparts (e.g., as measured using SAT scores, licensure exam scores, or college selectivity measures; see Stinebrickner 2001, 2002; Podgursky, Monroe, and Watson 2004; Goldhaber 2007). More recent research, however, finds that higher value-added teachers are retained at a slightly higher rate than lower value-added teachers (Hanushek et al. 2005; Boyd et al. 2008; Goldhaber, Gross, and Player 2011).

16. 

Some states, such as New York, have sought to accomplish this by assigning teachers to performance categories using both their VAM estimate and their level of certainty about this estimate.

17. 

Another reason is that VAMs alone typically do not provide much information that teachers can use to improve their practice.

18. 

This issue is just beginning to be addressed in a comprehensive way. See, for instance, Samuels (2012).

19. 

There is in fact some new evidence (Taylor and Tyler 2012) that suggests that a comprehensive evaluation does increase teacher effectiveness,

20. 

The results of this study suggest that the difference between having an effective versus an ineffective teacher can be equivalent to more than a year's worth of typical student achievement growth.

21. 

For instance, some of the academic debate is based on simulations (e.g., Hanushek 2009; Rothstein 2012).

22. 

This might, for instance, entail small-scale pilots or implementation features that permit research designs allowing for strong causal inferences.

Acknowledgments

We thank two anonymous reviewers for helpful comments on an earlier draft of this paper.

REFERENCES

Aaronson
,
Daniel
,
Lisa
Barrow
, and
William
Sander
.
2007
.
Teachers and student achievement in the Chicago Public High Schools.
Journal of Labor Economics
25
(
1
):
95
135
. doi:10.1086/508733
Ballou
,
Dale
,
William
Sanders
, and
Paul
Wright
.
2004
.
Controlling for student background in value-added assessment of teachers.
Journal of Educational and Behavioral Statistics
29
(
1
):
37
65
. doi:10.3102/10769986029001037
Bloom
,
Howard
,
Carolyn
Hill
,
Alison
Black
, and
Mark
Lipsey
.
2008
.
Performance trajectories and performance gaps as achievement effect-size benchmarks for educational interventions
.
MDRC Working Paper
.
Boyd
,
Donald
,
Pam
Grossman
,
Hamilton
Lankford
,
Susanna
Loeb
, and
James
Wyckoff
.
2008
.
Who leaves? Teacher attrition and student achievement
.
NBER Working Paper No. 14022
.
Boyd
,
Donald J.
,
Pamela L.
Grossman
,
Hamilton
Lankford
,
Susanna
Loeb
, and
James
Wyckoff
.
2009
.
Teacher preparation and student achievement.
Educational Evaluation and Policy Analysis
31
(
4
):
416
40
. doi:10.3102/0162373709353129
Cascio
,
Elizabeth U.
, and
Douglas O.
Staiger
.
2012
.
Knowledge, tests, and fadeout in educational interventions
.
NBER Working Paper No. 18038
.
Chetty
,
Raj
,
John N.
Friedman
, and
Jonah E.
Rockoff
.
2011
.
The long-term impacts of teachers: Teacher value-added and student outcomes in adulthood
.
NBER Working Paper No. 17699
.
Chingos
,
Matthew
, and
Martin
West
.
2012
.
Do more effective teachers earn more outside the classroom?
Education Finance and Policy
7
(
1
):
8
43
. doi:10.1162/EDFP_a_00052
Clotfelter
,
C. T.
,
H. F.
Ladd
, and
J. L.
Vigdor
.
2010
.
Teacher credentials and student achievement in high school.
Journal of Human Resources
45
(
3
):
655
81
.
Corcoran
,
Sean P.
,
Jennifer L.
Jennings
, and
Andrew A.
Beveridge
.
2011
.
Teacher effectiveness on high- and low-stakes tests
.
Mimeo, New York University
,
Institute for Education and Social Policy
.
Fryer
,
Roland
.
2011
.
Teacher incentives and student achievement: Evidence from New York City public schools
.
NBER Working Paper No. 16850
.
Fryer
,
Roland
,
Steven
Levitt
,
John
List
, and
Sally
Sadoff
.
2012
.
Enhancing the efficacy of teacher incentives through loss aversion: A field experiment
.
NBER Working Paper No. 18237
.
Glazerman
,
Steven
,
Susanna
Loeb
,
Dan
Goldhaber
,
Douglas
Staiger
,
Steve
Raudenbush
, and
Grover
Whitehurst
.
2011
.
Evaluating teachers: The important role of value-added
.
Washington, DC
:
Brown Center on Education Policy at Brookings
.
Goldhaber
,
Dan
.
2007
.
Everyone's doing it, but what does teacher testing tell us about teacher effectiveness?
Journal of Human Resources
42
(
4
):
765
94
.
Goldhaber
,
Dan
, and
Dominic J.
Brewer
.
1997
.
Why don't schools and teachers seem to matter? Assessing the impact of unobservables on educational productivity.
Journal of Human Resources
32
(
3
):
505
23
. doi:10.2307/146181
Goldhaber
,
Dan
, and
Dominic J.
Brewer
.
2000
.
Does teacher certification matter? High school teacher certification status and student achievement.
Educational Evaluation and Policy Analysis
22
(
2
):
129
45
.
Goldhaber
,
Dan
,
Dominic J.
Brewer
, and
Deborah J.
Anderson
.
1999
.
A three-way error components analysis of educational productivity.
Education Economics
7
(
3
):
199
208
. doi:10.1080/09645299900000018
Goldhaber
,
Dan
, and
Duncan
Chaplin
.
2012
.
Assessing the “Rothstein falsification test.” Does it really show teacher value-added models are biased?
CEDR Working Paper No. 2012–1.3, University of Washington
.
Goldhaber
,
Dan
,
Brian
Gabele
, and
Joe
Walch
.
2012
.
Does the model matter? Exploring the relationship between different achievement-based teacher assessments
.
CEDR Working Paper No. 2012–6, University of Washington
.
Goldhaber
,
Dan
,
Betheny
Gross
, and
Daniel
Player
.
2011
.
Teacher career paths, teacher quality, and persistence in the classroom: Are public schools keeping their best?
Journal of Policy Analysis and Management
30
(
1
):
57
87
. doi:10.1002/pam.20549
Goldhaber
,
Dan
, and
Michael L.
Hansen
.
2010
.
Using performance on the job to inform teacher tenure decisions.
American Economic Review
100
(
2
):
250
55
. doi:10.1257/aer.100.2.250
Goldhaber
,
Dan
, and
Michael L.
Hansen
.
2012
.
Forthcoming
.
Is it just a bad class? Assessing the stability of measured teacher performance.
Economica
. doi:10.1111/ecca.12002
Goldhaber
,
Dan
, and
Susanna
Loeb
.
2013
.
What do we know about the tradeoffs with teacher misclassification in high stakes personnel decisions? Carnegie Knowledge Network
.
April 15, 2013
.
Hanushek
,
Eric A.
1992
.
The trade-off between child quantity and quality.
Journal of Political Economy
100
(
1
):
84
117
. doi:10.1086/261808
Hanushek
,
Eric A.
2009
.
Teacher deselection
. In
Creating a new teaching profession
,
edited by Dan Goldhaber and Jane Hannaway
, pp.
165
80
.
Washington, DC
:
Urban Institute Press
.
Hanushek
,
Eric A.
2011
.
The economic value of higher teacher quality.
Economics of Education Review
30
(
3
):
466
79
. doi:10.1016/j.econedurev.2010.12.006
Hanushek
,
Eric A.
,
John F.
Kain
,
Daniel M.
O'Brien
, and
Steven G.
Rivkin
.
2005
.
The market for teacher quality
.
NBER Working Paper No. 11154
.
Hanushek
,
Eric A.
, and
Steven G.
Rivkin
.
2010
.
Generalizations about using value-added measures of teacher quality.
American Economic Review
100
(
2
):
267
71
. doi:10.1257/aer.100.2.267
Harris
,
Douglas N.
2010
.
Clear away the smoke and mirrors of value-added.
Phi Delta Kappan
91
(
8
):
66
69
.
Harris
,
Douglas N.
2012
.
How do teacher value-added indicators compare to other measures of teacher effectiveness? Carnegie Knowledge Network Brief
. Available www.carnegieknowledgenetwork.org/briefs/value-added/value-added-other-measures/.
Accessed 11 March 2013
.
Harris
,
Douglas N.
, and
Tim
Sass
.
2006
.
Value-added models and the measurement of teacher quality
.
Unpublished paper, Florida State University
.
Hoxby
,
Caroline M.
, and
Alison
Leigh
.
2004
.
Pulled away or pushed out? Explaining the decline of teacher aptitude in the United States.
American Economic Review
94
(
2
):
236
40
. doi:10.1257/0002828041302073
Jackson
,
C. Kirabo
, and
Elias
Bruegmann
.
2009
.
Teaching students and teaching each other: The importance of peer learning for teachers.
American Economic Journal: Applied Economics
1
(
4
):
85
108
. doi:10.1257/app.1.4.85
Jacob
,
Brian
, and
Lars
Lefgren
.
2008
.
Can principals identify effective teachers? Evidence on subjective performance evaluation in education.
Journal of Labor Economics
26
(
1
):
101
36
. doi:10.1086/522974
Jacob
,
Brian
,
Lars
Lefgren
, and
David
Sims
.
2010
.
The persistence of teacher-induced learning gains.
Journal of Human Resources
45
(
4
):
915
43
. doi:10.1353/jhr.2010.0029
Johnson
,
Susan Moore
, and
John P.
Papay
.
2009
.
Redesigning teacher pay. A system for the next generation of educators
.
Washington, DC
:
Economic Policy Institute
.
Kane
,
Thomas J.
, and
Douglas O.
Staiger
.
2008
.
Estimating teacher impacts on student achievement: An experimental evaluation
.
NBER Working Paper No. 14607
.
Kane
,
Thomas J.
,
Daniel F.
McCaffrey
,
Trey
Miller
, and
Douglas O.
Staiger
.
2013
.
Have we identified effective teachers? Validating measures of effective teaching using random assignment
.
MET Project Research Paper
.
Seattle, WA
:
The Bill & Melinda Gates Foundation
.
Kane
,
Thomas J.
,
Eric S.
Taylor
,
John H.
Tyler
, and
Amy L.
Wooten
.
2010
.
Identifying effective classroom practices using student achievement data
.
NBER Working Paper No. 15803
.
Kinsler
,
Joshua
.
2012a
.
Beyond levels and growth: Estimating teacher value-added and its persistence.
Journal of Human Resources
47
(
3
):
722
53
. doi:10.1353/jhr.2012.0023
Kinsler
,
Joshua
.
2012b
.
Assessing Rothstein's critique of teacher value-added models.
Quantitative Economics
3
(
2
):
333
62
. doi:10.3982/QE132
Lefgren
,
Lars
, and
David
Sims
.
2012
.
Using subject test scores efficiently to predict teacher value-added.
Educational Evaluation and Policy
34
(
1
):
109
21
. doi:10.3102/0162373711422377
Lockwood
,
J. R.
,
Daniel F.
McCaffrey
,
Laura S.
Hamilton
,
Brian M.
Stecher
,
Vi Nhuan
Le
, and
Jose F.
Martinez
.
2007
.
The sensitivity of value-added teacher effect estimates to different mathematics achievement measures.
Journal of Educational Measurement
44
(
1
):
47
67
. doi:10.1111/j.1745-3984.2007.00026.x
McCaffrey
,
Daniel F.
,
Tim R.
Sass
,
J. R.
Lockwood
, and
Kata
Mihaly
.
2009
.
The intertemporal variability of teacher effect estimates
.
Education Finance and Policy
4
(
4
):
572
606
. doi:10.1162/edfp.2009.4.4.572
The New Teacher Project (
TNTP
)
.
2012
.
The irreplaceables: Understanding the real retention crisis in America's schools
.
New York
:
TNTP
.
Papay
,
John P.
2011
.
Different tests, different answers: The stability of teacher value-added estimates across outcome measures.
American Educational Research Journal
48
(
1
):
163
93
. doi:10.3102/0002831210362589
Podgursky
,
Michael
,
Ryan
Monroe
, and
Donald
Watson
.
2004
.
The academic quality of public school teachers: An analysis of entry and exit behavior.
Economics of Education Review
23
(
5
):
507
18
. doi:10.1016/j.econedurev.2004.01.005
Rice
,
Jennifer K.
2009
.
Investing in human capital through teacher professional development
. In
Creating a new teaching profession
,
edited by Dan Goldhaber and Jane Hannaway
, pp.
227
50
.
Washington, DC
:
Urban Institute Press
.
Rivkin
,
Steven G.
,
Eric A.
Hanushek
, and
John F.
Kain
.
2005
.
Teachers, schools, and academic achievement.
Econometrica
73
(
2
):
417
58
. doi:10.1111/j.1468-0262.2005.00584.x
Rockoff
,
Jonah
.
2004
.
The impact of individual teachers on student achievement: Evidence from panel data.
American Economic Review
94
(
2
):
247
52
. doi:10.1257/0002828041302244
Rockoff
,
Jonah E.
,
Brian A.
Jacob
,
Thomas J.
Kane
, and
Douglas O.
Staiger
.
2010
.
Can you recognize an effective teacher when you recruit one?
Education Finance and Policy
6
(
1
):
43
74
. doi:10.1162/EDFP_a_00022
Rothstein
,
Jesse
.
2009
.
Student sorting and bias in value-added estimation: Selection on observables and unobservables.
Education Finance and Policy
4
(
4
):
537
71
. doi:10.1162/edfp.2009.4.4.537
Rothstein
,
Jesse
.
2010
.
Teacher quality in educational production: Tracking, decay, and student achievement.
Quarterly Journal of Economics
125
(
1
):
175
14
. doi:10.1162/qjec.2010.125.1.175
Rothstein
,
Jesse
.
2012
.
Teacher quality when supply matters
.
NBER Working Paper No. 18419
.
Samuels
,
Christina A.
2012
.
Experts outline steps to guard against cheating.
Education Week
6
(
March
).
Schochet
,
Peter Z.
, and
Hanley S.
Chiang
.
2010
.
Error rates in measuring teacher and school performance based on student test score gains (NCEE 20120–24004). Washington, DC: National Center for Education, Evaluation and Regional Assistance, Institute of Education Sciences
.
US
:
Department of Education
.
Springer
,
Matthew G.
,
Dale
Ballou
,
Laura S.
Hamilton
,
Vi Nhuan
Le.
J.R.
Lockwood
,
Daniel F.
McCaffrey
,
Matthew
Pepper
, and
Brian M.
Stecher
.
2010
.
Teacher pay for performance: Experimental evidence from the project on incentives in teaching
.
Santa Monica, CA
:
RAND Corporation
.
Staiger
,
Douglas O.
, and
Jonah E.
Rockoff
.
2010
.
Searching for effective teachers with imperfect information.
Journal of Economic Perspectives
24
(
3
):
97
118
. doi:10.1257/jep.24.3.97
Stinebrickner
,
Todd R.
2001
.
A dynamic model of teacher labor supply.
Journal of Labor Economics
19
(
1
):
196
230
. doi:10.1086/209984
Stinebrickner
,
Todd R.
2002
.
An analysis of occupational change and departure from the labor force: Evidence of the reasons that teachers leave.
Journal of Human Resources
37
(
1
):
192
216
. doi:10.2307/3069608
Taylor
,
Eric S.
, and
John H.
Tyler
.
2012
.
The effect of evaluation on teacher performance: Evidence from longitudinal student achievement data of mid-career teachers
.
American Economic Review
102
(
7
):
3628
51
.
Toch
,
Thomas
, and
Robert
Rothman
.
2008
.
Rush to judgment: Teacher evaluation in public education
.
Washington, DC
:
Education Sector Reports
.
Tucker
,
Pamela
.
1997
.
Lake Wobegon: Where all teachers are competent (or, have we come to terms with the problem of incompetent teachers?)
.
Journal of Personnel Evaluation in Education
11
(
2
):
103
26
. doi:10.1023/A:1007962302463
Tyler
,
John H.
,
Eric S.
Taylor
,
Thomas J.
Kane
, and
Amy L.
Wooten
.
2010
.
Using student performance data to identify effective classroom practices.
American Economic Review
100
(
2
):
256
60
. doi:10.1257/aer.100.2.256
Weisberg
,
Daniel
,
Susan
Sexton
,
Jennifer
Mulhern
, and
David
Keeling
.
2009
.
The widget effect: Our national failure to acknowledge and act on differences in teacher effectiveness
.
New York
:
The New Teacher Project
.