In this policy brief we argue that there is little debate about the statistical properties of value-added model (VAM) estimates of teacher performance, yet, despite this, there is little consensus about what the evidence about VAMs implies for their practical utility as part of high-stakes performance evaluation systems. A review of the evidence base that underlies the debate over VAM measures, followed by our subjective opinions about the value of using VAMs, illustrates how different policy conclusions can easily arise even given a high-level general agreement about an existing body of evidence. We conclude the brief by offering a few thoughts about the limits of our knowledge and what that means for those who do wish to integrate VAMs into their own teacher-evaluation strategy.
We contend that there is little academic disagreement about the statistical properties of value-added model (VAM) estimates currently being implemented to measure teacher job performance or the current state of in-service teacher evaluations.1 In particular, we believe that most researchers could sign on to five stylized facts: (1) there is important variation in teacher effectiveness that has educationally significant consequences for student achievement, as measured on standardized tests (and new evidence suggests later life outcomes as well); (2) most formal teacher evaluations are far from rigorous in that they largely ignore performance differences between teachers; (3) VAM measures are likely to contain real information about teacher effectiveness that could be used to inform personnel decisions and policies, but they are subject to potential biases; (4) VAM measures are noisy, varying from year to year or class section to class section for reasons other than variation in true effectiveness; and (5) little is known about how current and potential teachers might respond to the use of VAM measures in evaluating their job performance.
As we see it, the real debate about VAMs is less about the known properties of these measures, and more about what the evidence about VAMs implies for their practical utility as part of high-stakes performance evaluation systems.2 This policy brief offers a case in point—we agree about what research has to say about teacher evaluation, the properties of value-added measures, and the ways in which VAMs could conceivably be used for workforce improvement, but we do not reach the same conclusions about the value of VAMs in making high-stakes personnel decisions. Our views are not diametrically opposed but stem from subjective assessments about the risks associated with VAMs, judgments about the alternatives, and skepticism about the degree to which education systems will change without their use. As we will show, given the same body of evidence, the VAM glass can easily be viewed as half empty or half full.
In the next section, we briefly review the evidence base that underlies the debate over VAM measures, organized loosely around the five stylized facts listed earlier. We then describe theories of action connecting VAM measures to educational improvement and the limited available evidence on many of the theories put into practice. Successful use of VAMs in the real world requires a well-specified link between teacher measurement and human resource management actions; unfortunately, their implementation in practice is often carried out without much attention to this link. The remainder of the brief is devoted to our differences in opinion about the use of VAMs, illustrating how different conclusions can easily arise from the same body of evidence and a conclusion about the limits of our knowledge and what that means for those who do wish to integrate VAMs into their own teacher evaluation strategy.
A Brief Review of What We Know about VAMs
A near-universal point of agreement among education researchers is that teachers matter, or, put another way, that there is important variation in teacher effectiveness that has educationally significant consequences for students. This conclusion is an out-growth of value-added studies that find an effect size of individual teachers in the neighborhood of 0.10 to 0.25 standard deviations (Hanushek and Rivkin 2010; Goldhaber and Hansen 2012).3 To put this effect in perspective, a one standard deviation difference in teacher effectiveness amounts to about 10 to 25 percent of a grade-level's worth of growth in the elementary grades.4 The magnitude of these effects has led to the oft-used refrain that teacher quality is the most important in-school factor affecting student achievement.
Unfortunately for policy makers and school leaders, variation in achievement gains between teachers appears to be only weakly related to licensure, experience, and degree attainment, now used by most states and school districts to determine employment eligibility and compensation (Goldhaber, Brewer, and Anderson 1999; Aaronson, Barrow, and Sander 2007; McCaffrey et al. 2009).5 Moreover, existing in-service teacher evaluations fail to recognize important differences in learning gains across classrooms. In their well-known report The Widget Effect,Weisburg et al. (2009) found little to no documented variation in teacher job performance ratings in surveyed school districts.6 In the absence of variation, these ratings played little to no role in key human capital decisions such as tenure, promotion, and compensation. Together, these findings have led researchers and policy makers to consider alternative measures of job performance, such as VAMs, that aim to directly and objectively estimate teachers’ impact on student achievement.
Put simply, VAMs use historical data on student achievement to quantify how a teacher's students performed relative to similar students taught by other teachers (e.g., students with the same baseline performance, socioeconomic background, educational needs, and so on). Value-added models are designed to statistically control for differences across classrooms in student inputs, such that any remaining differences can be used to infer teachers’ impacts on achievement. As is true for any statistical model, users of VAMs must be attentive to issues of attribution (bias), imprecision (noise), and model specification. No model is perfect, and thus the relevant question for the VAM debate is whether these measures as currently constructed contain enough useful information that their integration into performance evaluation systems, and the use of these performance evaluations as part of human resource management systems, is likely to lead to improvements in the quality of the teacher workforce.
Questions of model specification, bias, and precision have been extensively studied. With respect to model specification, there appears to be general agreement that VAM estimates are not terribly sensitive to common specification choices, such as the inclusion or exclusion of student or classroom level covariates, but that there is greater sensitivity to whether models include school or student fixed effects.7 VAM estimates also show some sensitivity to the student outcomes (or domains on a test) employed as the metric to judge teachers (e.g., Lockwood et al. 2007; Corcoran et al. 2011; Papay 2011).
The question of whether and to what extent VAMs are biased—that is, whether test-score gains attributed to a teacher are due to student sorting or some other factor that varies with teacher assignment—is less settled. Rothstein (2010), for example, shows that in standard VAMs, teachers assigned to students in the future have statistically significant power in predicting past student achievement. This finding clearly cannot be causal, and suggests that typical models do not adequately account for the processes leading to the matching of students and teachers in classrooms.8 On the other hand, Goldhaber and Chaplin (2012) and Kinsler (2012b) find that “the Rothstein test” may not work as intended, and may suggest model misspecification where none exists. In a study based on random assignment of teachers to classrooms, Kane and Staiger (2008) showed that VAMs do not appear to be biased. Although promising, this experiment has some limitations, most notably a small sample size and the fact that the experiment only includes teachers who were deemed eligible by their principals to teach each other's classes (i.e., teachers who are seen to be exchangeable).9 Consequently, the results may not be fully generalizable. Finally, Chetty, Friedman, and Rockoff (2011) use different VAM falsification exercises from Rothstein and find little evidence of bias.
The extent of imprecision, or noise in VAM measures is another issue that has been extensively studied. Several recent studies have found teacher value-added estimates to be only moderately reliable or stable from year to year, with adjacent year correlations in the neighborhood of 0.3 to 0.5 (e.g., McCaffrey et al. 2009; Goldhaber and Hansen 2012). Whether or not these correlations are sufficiently strong is in the eye of the beholder. On the one hand, correlations of this magnitude result in a nontrivial proportion of teachers being rated as “ineffective” or “highly effective” in one school year (or class section) while being rated as average in the next year (or section). On the other hand, as Glazerman et al. (2011) point out, the magnitude of these stability estimates is not much different from what is observed in other occupations that have been quantitatively measured. It is true that the magnitude of these reliability estimates likely do not support the exclusive use of VAM measures to classify teachers into performance categories (Schochet and Chiang 2010; Goldhaber and Loeb 2013).10 Although some evaluation systems put a large weight on VAM results, to our knowledge no proposed or adopted policy relies exclusively on VAMs.
The validation of value-added measures against long-run student outcomes—as opposed to year-to-year test score gains—has been less extensively studied. A number of studies, however, have documented “fade out” of teacher effects—that is, the gains induced by teachers in one grade that dissipate in later grades.11 One possible explanation for this finding is that the heightened focus on tests may encourage teachers to emphasize strategies that lead to short-run test score gains at the expense of long-run learning. On the other hand, there are some less pernicious explanations for fade out, which include variation in test content across grades and test scaling effects (Cascio and Staiger 2012). Additionally, recent evidence by Chetty, Friedman, and Rockoff (2011) finds that, despite effects that fade out in the short-run, VAM measures still predict key later-life outcomes—e.g., teen pregnancy, college-going behavior, and labor market earnings—long after students have left a teacher's classroom.
Although we know a good deal about the properties of VAMs, existing research has little to say regarding how current and potential teachers are likely to respond to the use of VAM measures in high-stakes personnel decisions, other than with regard to whether there are short-run productivity effects of pay for performance (discussed below). Much of the available evidence on teacher value added comes from low-stakes settings, in which teacher evaluations were not linked to student performance or high-stakes personnel decisions.
Taken together, the balance of the evidence indicates that VAM measures are imperfect, and questions about bias are far from settled. We argue, however, that the more important question for policy is whether VAMs can improve upon existing or alternative measures of teacher job performance. As The Widget Effect pointed out, existing evaluation systems generally fail to recognize variation in teacher performance, and even imperfect measures may improve upon the status quo. Moreover, non-test-based measures of performance, such as classroom observations, student surveys, and portfolio assessment, suffer from their own imperfections and are as untested as VAMs (Harris 2012). We believe that policy makers ought not to focus solely on whether or not a VAM measure is biased and noisy but rather on the extent of bias and imprecision, and whether VAMs provide either a more accurate picture of teacher quality than other means of assessing teachers and/or additional information that can be used to improve instruction.
Theories of Action: Connecting VAMs to Educational Improvement
Much is known about the properties of value-added measures, but successful implementation of VAMs in practice requires a sound theory of action connecting VAM measures to educational improvement. Unfortunately, the implementation of VAMs is often done without a well-specified link between these two, and there is little evidence available to inform such links. By their nature, VAMs are norm-referenced measures, indicators of relative performance that use historical data to rank teachers based on classroom test performance. Thus the question is how these measures can be integrated into a performance evaluation system to raise overall teacher effectiveness.
There are two primary means by which VAMs might be used to improve educational outcomes. The first focuses on improving the effectiveness of incumbent teachers, whereas the second emphasizes changes in the composition of the teacher workforce. With respect to the former, VAMs could be used by school and district leaders as a professional development tool, identifying areas of weakness and providing models of successful instruction. School systems have historically invested vast sums in professional development, but generally little of this is targeted to the specific needs of individual teachers (Rice 2009; Toch and Rothman 2008). VAM measures could aid in better targeting these investments to the teachers who need them most. Furthermore, the availability of rewards for high performance may incentivize existing teachers to increase effort levels or improve their skills by learning from successful colleagues.12
The second mechanism operates through changes in the composition of the teacher workforce. Some researchers have argued that a policy of denying tenure or dismissal of the lowest-performing teachers would substantially raise the quality of the teacher workforce and have large effects on the life outcomes of students (Staiger and Rockoff 2010; Chetty, Friedman, and Rockoff 2011; Hanushek 2011). A practice of recognizing and rewarding effective teachers, either through higher pay or promotion, could also improve quality by attracting potential teachers who might otherwise be drawn to fields that better reward their skills.13 Finally, the availability of rewards for high performance may help with the retention of effective teachers.
Unfortunately, with practical implementation of VAMs in its infancy, we have almost no direct evidence on these proposed mechanisms. A notable exception is a series of experimental studies of teacher performance incentives tied to VAMs that suggest they have little impact on the productivity of practicing teachers.14 These studies remain small in number, however.
In the absence of much evidence on the mechanisms by which VAMs could improve educational outcomes, proponents and detractors of VAMs have largely relied on reasoned speculation over whether they believe these measures are likely to improve educational outcomes or not. These beliefs can be grounded in the existing evidence on VAM measures but hinge on whether one views the VAM glass as half empty or half full. On the one hand, each of these theories is compelling. The teaching profession has a poor track record of rewarding performance, whether through higher pay, increased responsibility, or career advancement (Johnson and Papay 2009). Teaching's reward structure may discourage talented graduates from the profession (Hoxby and Leigh 2004), fail to retain the best teachers (Chingos and West 2012; TNTP 2012), and provide weak incentives for effort and improvement.15 VAM measures may be used as part of a system to address some of these issues, and to do so it is not necessary for VAMs to be perfect indicators of teacher effectiveness (Glazerman et al. 2011).
On the other hand, VAMs may not be well suited to support these theories of action. If VAM measures are sufficiently biased and/or unreliable, they may lead to incorrect personnel decisions and misallocated resources. Bias and instability can also undermine trust in the system, and the risk associated with employment or compensation instability could dissuade potential teachers from the profession rather than attract them (Rothstein 2012).
Along the same lines, even if VAM measures have acceptable properties from a statistician's point of view, their complex calculation and inherent variability can limit their face validity among practitioners. For evaluation systems built on VAMs to improve the instruction techniques of existing teachers, practitioners will need to see direct connections between their day-to-day practice and their performance evaluations. Today, most VAM-based evaluation systems only provide information about teachers’ effectiveness categories or relative ranking, not direct information that can be used to improve particular aspects of practice. This is one of the reasons it makes sense to use VAM only as a part of a well-rounded evaluation system.
Where You Stand Depends on Where You Sit
We began the brief with the contention that most researchers would agree with our stylized facts about VAM measures of teacher effectiveness and the current state of performance evaluation in the teaching profession. But agreeing on the stylized facts does not necessarily mean agreement on the practical utility of using VAMs or their potential to improve teacher effectiveness. In fact, we do not entirely see eye to eye on these issues.
Glass Half Empty: Corcoran
I (Corcoran) tend to view the VAM glass as half empty. Although I would agree that performance evaluation in teaching is lacking and sorely in need of reform, I believe the potential for VAM measures to dramatically improve teaching effectiveness and the quality of the profession tends to be overblown. Student achievement should be an important part of a new and improved system of evaluation. As statistical estimates based on historic and limited data, however, VAM measures lack transparency and are inherently limited by imprecision. If teacher quality is to be improved, teachers and school leaders need instructive, actionable information they can use to make meaningful changes to their practice sooner rather than later. A statistical prediction that relies on annual tests and multiple years of data to produce reasonably reliable value-added estimates does not, in my view, meet this requirement.
VAM measures may turn out to be useful indicators of relative performance—separating the very high- and very low-performing teachers from the rest of the pack. This information could be fruitfully used by principals as an early warning signal or (in extreme cases) as grounds for dismissal. Their utility as a job performance indicator for a significant number of teachers is another matter, however. Given the inherent instability of VAM estimates, a high-stakes system tied to VAMs would need to be conservatively designed, reserving punishment and reward only for those with demonstrably very low or very high performance, and an acceptably low level of statistical uncertainty.16 But a VAM system that meets these conservative standards would ultimately only apply to the most extreme cases, and would provide little feedback to the bulk of teachers. This begs the question of what VAMs would add beyond the subjective evaluation of principals or other educators who are presumably capable of identifying the very worst (or best) teachers (e.g., see Jacob and Lefgren 2008). Finally, it is important to keep in mind that “value added” is not a unidimensional concept. There are as many value-added measures as there are tests and subject areas, and VAMs have been found to be only moderately correlated across these. There is room for combining information across tests (Lefgren and Sims 2012), but policy makers may find that creating decision rules to objectively identify the “best” and “worst” teachers is easier said than done.
I am encouraged by the prospect for reform in teacher evaluation and the renewed focus on student achievement. Our understanding of teacher labor markets and teacher effectiveness would not be where it is today without advances in value-added measurement and the careful linking of student-level achievement data to teachers over time. That VAM measures have proven invaluable to research, however, does not imply they will be useful as on-the-job performance measures. Inferences about teacher effects on average are quite different from inferences about individual teachers, and I tend to be more pessimistic about the latter. The attachment of high stakes to measures that are meant to be informative about the progress of students has the potential to undermine the validity of the measures themselves, to encourage teaching to the test, and (at worst) cheating. With professional careers and the education of our nation's children on the line, our educators need a new evaluation system that is transparent, informative, and responsive to their needs. Although VAMs have a role to play in this system, policy makers should temper their expectations and limit their high-stakes use.
Glass Half Full: Goldhaber
I (Goldhaber) view the VAM glass as half full. I do not disagree with the technical points or potential negative incentives associated with connecting VAM measures to high-stakes decisions described above. This is one of the reasons why I think that VAM measures ought only to be a component of a well-rounded evaluation system that also, for instance, includes classroom observations.17 The concern that focusing teacher accountability on student test achievement might lead to pernicious behaviors that distorts the learning process in schools—Campbell's Law—is entirely valid, which is why evaluation reform probably ought to happen in conjunction with measures designed to guard against these outcomes.18 Yet despite some wariness about VAMs and their use, I believe we should experiment with incorporating VAM as a factor in evaluations that are used for such high stakes purposes as pay, tenure, and promotion determination.
I would argue that VAMs ought to be used for three primary reasons. First, evaluation is to some extent about determining which teachers should stay in the profession (and perhaps in which positions and at what compensation levels). Even those who believe the overwhelming majority of teachers are successful enough to merit being in the profession likely recognize that some, perhaps quite small, proportion of teachers are not very effective and ought to be dismissed. And, as we have recently witnessed, economic circumstances occasionally necessitate that some teachers will lose their jobs. We cannot use performance evaluations to make personnel decisions if there is no variation in the evaluation ratings in the workforce. Unless the act of evaluating teachers itself makes them more effective,19 school systems are investing time and effort in the evaluation endeavor for little useable information.
Second, I see value added as a catalyst for broader changes to teacher evaluation. I am quite skeptical that we would be engaged in what is now almost a national experiment with new, and hopefully more rigorous, teacher evaluation were it not for the specter of VAM usage. It has been over two decades since research (Hanushek 1992) showed just how important the variation in teacher effectiveness is for determining student outcomes.20 Yet policy makers and practitioners have, by and large, been unable or unwilling to develop credible teacher evaluation systems that recognize the important differences that exist between teachers. As I noted earlier, I understand the need to be careful in making changes to the system but surely it is not unreasonable to have expected more movement on this issue over such a long period of time.
Third, and perhaps most importantly, evidence suggests that VAM measures are better at predicting future student test achievement than the other credentials—licensure, degree, and experience levels—that are now used for high stakes purposes (e.g., Goldhaber and Hansen 2010), and better than other means of assessing teachers (Glazerman et al. 2011; Harris 2010; Tyler et al. 2010). To the extent that evaluations are in fact used for any purpose, we would want them to be good predictors of student achievement and, although imperfect, VAMs look pretty good compared to the other options currently out there. There is a laser focus on the known flaws of VAMs while other methods of teacher evaluation have basically been given a pass.
There is no doubt that VAM-informed decisions about teachers will sometimes, because of bias or reliability, lead to teacher classification errors (Goldhaber and Loeb 2013). We clearly want to be careful so as to limit these errors, but we also have to recognize that it is almost certainly not optimal for students to entirely eliminate the downside risk to teachers of classification errors. The reason is obvious but still merits repeating: There is a trade-off inherent with reductions in the number of teachers who are unfairly classified as being ineffective; that is, the number of ineffective teachers classified as effective rises. In other words, to some extent what is best for teachers may not be best for students. Again, this is an issue that exists for all evaluation systems, not just those that utilize VAM-based information.
As is clear from the preceding section, we do not entirely agree with one another in terms of the extent to which policy makers ought to use VAMs for high-stakes purposes, but there is significant agreement about how we ought to move forward with some use of VAMs. First, implementation matters. It matters not only because of the support systems we discussed here that are likely to be crucial for the integrity of an evaluation system emphasizing value-added measures, but also because we are talking about a system that will affect human beings whose responses to the system will go a long way in determining its effectiveness. Specifically, the behavioral response to the use of VAMs depends not only on the design of the system but also on teachers’ (and other stakeholders’) reactions to it. This means that clear and constant communication about why particular modeling decisions were made, how the system works, and what are the consequences for teachers who receive different evaluation ratings, is essential.
Second, as we have hopefully stressed herein, our current understanding of the impact of VAMs in practice is limited because this impact depends on human beings’ responses to how VAMs are used.21 This means there is a good deal of room for debate about the extent to which VAM-usage, particularly for high-stakes personnel decisions, would lead to good (e.g., more feedback on effective teaching practices) or bad (e.g., a narrowing of the curriculum) outcomes. Given this, we recommend that policy makers roll out plans with an eye toward evaluation and modification.22 This is decidedly not how new education policies tend to be implemented; policy makers or practitioners who say “we are pretty sure this is a good plan, but we might have to change it soon” probably reduce the chances they will hold key positions of influence in the future. Fortunately, solving this political problem is outside the scope of our brief (and perhaps not possible).
Third, policy makers should, as best as possible, anticipate indirect and unintended consequences (not all of which are necessarily negative). For instance, attaching stakes for teachers to student test achievement will encourage cheating, so clearly school systems should be considering auditing systems to discourage cheating and detect it when it happens. As we noted earlier, increased job or compensation risk may make teaching a less desirable occupation. These are obvious examples; it is worth some time and effort planning for the obvious issues that will arise and anticipating some issues that may not be immediately obvious.
The bottom line is that we are entering a new world of teacher evaluation, a world in which recent policies dictate that VAMs will play a role. But, in implementing VAM-based reforms, we believe one should view the current policy push not as the end product but the evolution toward a better system.
Henceforth we use the terms “teacher effectiveness” and “teacher job performance” interchangeably.
In fact, we speculate that any means of evaluating teachers that leads to a spread of summative teacher performance ratings, which in turn have consequential implications for teachers’ jobs or compensation, would be controversial. As we briefly touch on subsequently, this may be because of teacher concerns about disruptive effects on teacher collegiality or a general resistance to move away from the current system where there are few job consequences for poor teaching (because poor teaching is rarely identified through a formal evaluation process).
The estimates are typically in the neighborhood of 0.10 to 0.15 for within-school estimates and 0.15 to 0.25 for estimates that also include between-school differences in teacher effectiveness.
In the lower grades a typical student gains about one standard deviation in math and reading achievement (Bloom et al. 2008). Thus a one standard deviation difference in teacher effectiveness—about the difference between having a teacher at the sixteenth percentile of the performance distribution and one at the fiftieth percentile—amounts to about 10 to 25 percent of a grade-level's worth of achievement in the elementary grades. The impact of teacher effectiveness is even larger in middle and high schools since students gain less (in standard deviation units) from grade to grade.
Correlations across different models are generally high (over 0.9) for models that vary in terms of how they handle covariate adjustments for student background, classroom, and school characteristics (Ballou, Sanders, and Wright 2004; Harris and Sass 2006; Lockwood et al. 2007) but the correlations between these models and those that include student or school fixed effects are far lower, in the range of 0.3 to 0.5 (Harris and Sass 2006; Goldhaber, Gabele, and Walch 2012), with the range dependent on the subject as well as the number of years of teacher data informing the model. Importantly, models with student fixed effects are much more commonly used in research than in practice, due to their high data and computational demands.
Rothstein (2009) suggests that bias is likely to be quite small when VAMs include several years of prior test scores as control variables.
A recently released report from the well-known Measures of Effective Teaching Project (Kane et al. 2013) is based on a similar experiment and also concludes that VAMs provide unbiased estimates of teacher effectiveness, and this study is based on a much larger teacher and student sample. The study is still limited, however, in that the findings might not be generalizable outside of the group of teachers who were deemed by principals to be randomized within schools.
Schochet and Chiang (2010), for instance, used simulated data that rely on plausible estimates of the signal to noise ratio in teacher effect estimates and conclude that, if three years of data are used for estimating teacher effectiveness, the probability of identifying an average teacher as being “exceptional” or “ineffective” (roughly one standard deviation above or below the mean, respectively), a Type I error, is about 25 percent. Conversely, the probability that a truly exceptional teacher is not identified, a Type II error, is also about 25 percent.
Although teachers may learn skills from their colleagues now (Jackson and Bruegmann 2009), recognition and rewards systems could facilitate additional informal learning by broadly identifying teacher successes and hence making it clearer which teachers might have the most to offer in terms of informal training.
The impacts on the teacher applicant pool would be even more beneficial if VAM-based evaluations helped teacher training institutions learn what kinds of skills they should be developing in future teachers (Boyd et al. 2009), or helped school districts identify which of their hires tend to be more effective, leading to better future hiring decisions (Rockoff et al. 2010).
For example, see Springer et al. (2010), Fryer (2011), and Fryer et al. (2012). An interesting exception is Fryer et al. (2012), who find that teachers respond to loss aversion (i.e., the threat that compensation that is in hand will be taken away if student performance does not meet certain goals); it is not clear how such an incentive system could be implemented in practice, however.
By some measures, research finds that more academically talented teachers are more likely to leave the teaching profession than their lower achieving counterparts (e.g., as measured using SAT scores, licensure exam scores, or college selectivity measures; see Stinebrickner 2001, 2002; Podgursky, Monroe, and Watson 2004; Goldhaber 2007). More recent research, however, finds that higher value-added teachers are retained at a slightly higher rate than lower value-added teachers (Hanushek et al. 2005; Boyd et al. 2008; Goldhaber, Gross, and Player 2011).
Some states, such as New York, have sought to accomplish this by assigning teachers to performance categories using both their VAM estimate and their level of certainty about this estimate.
Another reason is that VAMs alone typically do not provide much information that teachers can use to improve their practice.
This issue is just beginning to be addressed in a comprehensive way. See, for instance, Samuels (2012).
There is in fact some new evidence (Taylor and Tyler 2012) that suggests that a comprehensive evaluation does increase teacher effectiveness,
The results of this study suggest that the difference between having an effective versus an ineffective teacher can be equivalent to more than a year's worth of typical student achievement growth.
This might, for instance, entail small-scale pilots or implementation features that permit research designs allowing for strong causal inferences.
We thank two anonymous reviewers for helpful comments on an earlier draft of this paper.