Abstract

The No Child Left Behind Act (NCLB) has been criticized for encouraging schools to neglect students whose performance exceeds the proficiency threshold or lies so far below it that there is no reasonable prospect of closing the gap during the current year. We examine this hypothesis using longitudinal data from 2002–03 through 2005–06. Our identification strategy relies on the fact that as NCLB was phased in, states had some latitude in designating which grades were to count for purposes of a school making adequate yearly progress. We compare the mathematics achievement distribution in a grade before and after it became a high-stakes grade. We find in general no evidence that gains were concentrated on students near the proficiency standard at the expense of students scoring much lower, though there are inconsistent signs of a trade-off with students at the upper end of the distribution.

1.  Introduction

The No Child Left Behind Act (NCLB) of 2001 represents a major effort by the federal government to improve academic performance of students who have traditionally lagged behind. States have been required to set minimum proficiency standards in reading and mathematics. Sanctions of increasing severity are to be applied to schools that fail to demonstrate adequate yearly progress (AYP), determined by the percentage of students achieving the state-defined performance standard. Over time the percentage of students required to meet this standard is ratcheted upward, until by 2014 virtually all students must score proficient or better.

NCLB has been criticized for focusing narrowly on a single performance threshold rather than on gains across the spectrum of achievement. Schools under short-term pressure to raise performance (i.e., make AYP) may pay greatest attention to students near the threshold, neglecting students who are already proficient as well as students so far below the standard that there is no reasonable likelihood of bringing them to that level within the current year. In short, accountability systems with this feature are thought to encourage a form of educational triage.

That NCLB has in fact had such an effect is now widely believed by educators and others involved in the formation of education policy. Typical of these views are the following:

“I can tell you anecdotally, after visiting many states in the last several years, that focusing on the bubble kids is an explicit strategy for many districts and schools,” said Margaret Heritage, the assistant director for professional development at the National Center on Evaluation, Standards, and Student Testing, located at the University of California, Los Angeles. (Viadero 2007)

“But because schools are under pressure to make AYP, which typically means a certain percentage of students must pass state standardized tests, some teachers say they are being told to spend less time working with students who have very little chance of passing. Instead, they are being asked to direct their energies toward so-called ‘bubble kids’—students who could pass standardized testing with a little extra help.” (Hart 2010)

“Any single proficiency standard invites sabotaging the goal of teaching all children, because the only ones who matter are those with scores just below passing. Educators call them ‘bubble kids,’ a term from poker and basketball, where bubble players or teams are those just on the cusp of elimination. Explicit school policies now demand that teachers ignore already-proficient children to focus only on bubble kids, because inching the bubbles past the standard is all that matters for adequate yearly progress.” (Rothstein 2009)

In the research literature, support for the triage hypothesis has been reported by Booher-Jennings (2005), Krieg (2008, 2011), and Neal and Schanzenbach (2010). The first is a qualitative case study of instructional practices in a single public elementary school in Texas. Like much anecdotal evidence, it finds that accountability has increased the focus on students near the proficiency threshold. The two studies by Krieg use test results in the state of Washington. In schools under increased pressure to meet AYP, Krieg (2008) finds there is a negative effect on students well above or below the proficiency cut score, and the Krieg (2011) study finds that students of academically successful racial groups at a school where another racial group failed to make AYP score considerably lower (presumably as resources are diverted to the less successful group). The fourth study, examining the introduction of high-stakes accountability in Chicago, found greater than expected gains concentrated in the middle of the achievement distribution, with evidence of negative effects at the low end and, on select occasions, at the top.1 These findings led the authors to warn that the introduction of NCLB accountability can have a negative impact on the lowest achievers.

Not all studies have reached this conclusion, however. Dee and Jacob (2011) and Loveless, Farkas, and Duffett (2008) detect more improvement in National Assessment Educational Progress math scores at the low end of the achievement distribution than at the top.2 In an examination of North Carolina's accountability systems, Ladd and Lauen (2010) take advantage of the fact that over time the state used two different types of accountability: a pre-NCLB “growth” system that rewarded schools for the progress made by students, and a “status” or “single threshold” system after implementation of NCLB. The authors report that achievement gains were more pronounced at the lower end of the distribution than the upper end, but in the schools facing NCLB-type pressure, there were no gains on average at the high end, whereas gains at the low end were substantially greater than other groups. In another North Carolina study, Lauen and Gaddis (2012) provide evidence that NCLB helped close the gap between the lowest achieving students and those near the middle of the distribution. Springer (2008) also reports evidence contrary to the triage hypothesis in that students at all achievement levels improved in an analysis of third to eighth grade test scores in an unidentified northwestern state.3

All but one of the quantitative studies cited use large student-level data sets (Dee and Jacob 2011, which uses state-level panel data, is the exception). In general they utilize a difference-in-differences analysis to identify the effects of NCLB, contrasting the responses to the onset of accountability across schools or student subgroups at greater or lesser risk of failing to reach performance targets required to make AYP. The research reported here resembles the literature in these two respects. We differ from the literature in several important regards, however.

Previous analyses examine the impact of NCLB on performance on the high-stakes exams used to determine AYP (Dee and Jacob 2011 is again an exception). Although one would expect to find the most salient evidence of triage here, if triage does not also affect performance on low-stakes exams that are valid measures of student learning, it is not clear why we should be particularly concerned about it. It would appear that triage matters when schools “teach to the test” but lacks broader consequences. In this study we examine outcomes on low-stakes exams (a choice that also has implications for our identification strategy).

Previous researchers have invariably estimated parametric models of achievement, despite the fact that there is no clear definition of a “bubble” student. Rather, there is an imprecise notion that the bubble encompasses students who are sufficiently close to the proficiency threshold at the beginning of the year that they can attain it with extra help. In this research we present graphical evidence based on nonparametric estimates of the relationship between prior and end-of-year achievement. Readers can judge for themselves whether the relationship takes on a distinctive shape in the vicinity they deem to be “the bubble.”

The existing literature uses test results from the preceding year as a control for prior achievement. This is problematic for three reasons: (1) data on prior achievement are often missing, particularly for mobile students; (2) these controls do not capture summer learning loss, which can vary substantially across students, and which can affect which students are perceived to be “on the bubble” in the fall when schools are making important instructional decisions; and (3) different test forms are used in different years. A common solution to this problem, the inclusion of grade and year effects to control for changes in the difficulty of the exam (among other things), is not sufficient when the focus of the analysis is not mean achievement but the distribution of achievement gains, inasmuch as a new test form may be more or less difficult for some students but not for others. In this research we use a fall score as a measure of prior achievement, avoiding the first and second of these problems. Moreover, there was no change of test form between fall and spring that could affect the scores of some students but not others. Tests were administered on computers via an adaptive testing algorithm designed to maximize the information obtained on each student's progress by reducing the number of items that are either too easy or too difficult, with items drawn from a common item bank in fall and spring.

Finally, most previous studies have examined outcomes in a single district or a single state. The variety of conclusions suggests that educational practices are not the same everywhere, and not all schools have had recourse to triage (or have been equally successful in implementing it). Our findings add to the base of evidence on this question. The fact that we, too, find variation across states is also suggestive, given that we use a single, consistent research methodology across all of these sites, so that our heterogeneous results cannot be ascribed to differences in tests or in the method of analysis. We regard this as yet another contribution to the literature, further emphasizing the need for caution when generalizing from evidence drawn from a single state or district.

2.  Research Question and Identification Strategy

As we noted, it is far from obvious how to identify students who are “on the bubble.” Schools (and by implication, researchers) must forecast the scores students are likely to obtain in the spring using information available to them in the fall. On the basis of these forecasts, some schools will anticipate little or no difficulty in making AYP. Although there are students in these schools who are expected to fall short of proficiency, there is no need for them to progress at a faster rate for the school to avoid sanctions. In other schools, students who begin the year in the same position will need to make more rapid progress if the school is to make AYP. Which students and how many will vary from school to school. Moreover, differences in state tests affect the probability of reaching proficiency from a given starting point. Given the noisiness of test scores, schools cannot take for granted that a student who begins the year above the proficiency cut score will perform that well at year's end, implying some uncertainty about the upper limit of the bubble.

These considerations suggest that a parametric model relying on a particular definition of a bubble student (e.g., one beginning the year within a certain distance of the cut score) will in many instances misidentify the students (if any) who are targeted for extra attention. Preferring to make as few assumptions on this score as possible, in this paper we use a nonparametric estimator of the entire achievement distribution to investigate the triage hypothesis. Specifically, we examine achievement growth over the course of the academic year conditional on a student's beginning achievement level in the fall. Evidence of educational triage would take the form of a distortion of that distribution wherein students in the vicinity of the proficiency cut score in the fall gain more, whereas students whose initial performance is either well above that level or well below it gain less. The evidence, in short, is visual—when we look at the conditional distribution of achievement, do we see anything that looks like the effect of triage?

We depict the hypothesized relationship in stylized form in figure 1a. In the absence of high-stakes accountability, achievement gains are depicted as a downward sloping linear function of initial (fall) scores. Though neither linearity nor the negative slope is essential, regression to the mean will tend to produce a downward sloping relationship, as will compensatory instructional practices on the part of teachers (focusing on students who start the year behind to help them catch up). In this figure we assume that the introduction of high-stakes accountability, by changing incentives, alters instructional practices. Students “on the bubble” gain more than previously, evident in a flattening or bulging out of the distribution in the vicinity of the cut score (normalized here to 0). Compared with the original relationship, the new relationship between initial achievement and gains is more sinusoidal, with a redistribution of achievement gains away from the ends toward students whose initial performance is nearer the cut score.

Figure 1a.

Stylized Representation of Educational Triage.

Figure 1a.

Stylized Representation of Educational Triage.

In figure 1b we have depicted an alternative response to high-stakes accountability that boosts gains among low-performing students at the expense of high-performing students. There is a tradeoff here, but it is not triage—all students at the low end of the distribution gain, and those at the bottom gain the most. The reverse is possible—a tilt (not shown) that favors high-performing students at the expense of low performers—though it would be a surprise to find a response of this kind to NCLB. A third possibility is an across-the-board rise in scores as high-stakes accountability leads schools to make more effective use of their resources to the benefit of all students. Finally, there might be no systematic difference between the high-stakes and low-stakes achievement profiles, but rather random variation in performance of no particular pattern.

Figure 1b.

Response to NCLB Favoring Low Achievers.

Figure 1b.

Response to NCLB Favoring Low Achievers.

This characterization of the triage hypothesis is qualitative—a flattening or bulging out of the achievement distribution in the neighborhood of the cut score, combined with declines at the extremes. One might worry that this characterization is too vague to provide guidance in interpreting complicated patterns in the data. In principle this might have been so. In fact, the visual evidence is nearly always clear and unambiguous. Changes in the distribution of achievement within our sample rarely look like figure 1a (triage). Although we do find some evidence of a bulge in the vicinity of the cut score, it does not come at the expense of the extremes. More commonly the visual evidence resembles figure 1b (a tilt favoring low achievers), though we also find across-the-board improvements. In the data we examine, there is exceedingly little support for the hypothesis that triage has been a frequent—let alone dominant—response to high-stakes accountability under NCLB.

Identification Strategy

As noted earlier, previous studies have tended to use some variant of a difference-in-difference strategy to assess the effects of NCLB, typically contrasting schools or subgroups at greater or lesser risk of failing to make AYP.4 It is not obvious that successful schools or student subgroups tell us much about counterfactual achievement, absent accountability, among schools and subgroups that are struggling. In this paper we use an identification strategy that exploits a quirk in the way NCLB was implemented to avoid making such assumptions. During the phasing in of NCLB, not all grades counted as “high stakes” for purposes of determining whether a school made AYP. Whereas a full implementation of NCLB requires testing in grades 3–8 and in one high school grade, in these early years (before 2005) states had the option of designating one elementary grade and one middle school grade as the grades whose test results would determine whether the school made AYP. Because of delays in implementing NCLB, in practice this deadline was not always met until 2006.

The principal reason for this delayed phase-in was to permit states to develop assessments for grades that had not previously been tested. Before a test was available, a grade was necessarily a low-stakes grade. Once a test came on line, results for that grade generally counted in the determination of AYP. Identifying the effects of NCLB by comparing low-stakes to high-stakes grades would therefore appear to be infeasible, as the low-stakes grades are precisely those for which no achievement data are available. This is not invariably the case, however. Districts often pursue their own testing programs, a fact we exploit in order to compare outcomes from years when testing in a grade was low-stakes with later years when outcomes in that grade had a bearing on whether a school made AYP. By further restricting the sample to districts and schools that participated in testing under both regimes, we effectively control for a wide variety of otherwise unobservable factors responsible for cross-sectional variation in achievement. The approach is analogous to including school and district fixed effects in a regression model. We identify the effects of NCLB from changes that occur within grade and school when that grade switches status from low-stakes to high-stakes.

We do not claim that this strategy allays all concerns about identification. Particularly in the early years of NCLB, high-stakes grades may have received extra attention and special treatment not feasible later, when all grades counted toward AYP. For example, the best teachers may have been reassigned to the high-stakes grades, if they were not already there. (Recall these are typically grades in which a state was already conducting achievement testing.) If so, a comparison of outcomes in low- and high-stakes grades could overstate the effects of accountability. We agree this would be worrisome if our goal were to estimate the effect of accountability on mean achievement. We believe it is less troubling, given the object of this study, to examine the effect of accountability on the distribution of achievement. If, as some researchers have reported, schools have explicitly instructed teachers to focus their efforts on students near the proficiency threshold, or if the reason to do so is sufficiently obvious that teachers do not require explicit direction, we would expect to see evidence of such practices on the distribution of achievement, whatever might be happening to the mean.

Our identification strategy assumes a certain degree of myopia on the part of schools, that schools treat students in low-stakes grades differently from students in high-stakes grades, even though many of the former will be promoted into a high-stakes grade before leaving the school. If, instead, schools take the long view, we may fail to find significant differences between low-stakes and high-stakes grades. Nevertheless, the triage hypothesis itself rests on very much the same assumption, postulating that schools are focusing their attention on students who can be brought to the proficiency level within the current year. If schools instead reckon that high-performing students need to continue to progress in order to meet tougher standards in future grades, and that the lowest-performing students might require more than a year to catch up but that the investment will ultimately pay off, triage is much less likely to occur.

Finally, we note that we can test our identification strategy by ascertaining whether there are any differences between the distribution of outcomes in low-stakes and high-stakes years. To the extent that we find differences not consistent with triage, it would appear that the identification strategy is successful—that is, schools are responding to NCLB, merely not in the way suggested by the hypothesis.

3.  Data

The principal data for this research are drawn from the Northwest Evaluation Association's (NWEA) Growth Research database. During the period of our study, NWEA contracted with over 3,400 school districts in 45 states to conduct testing for diagnostic and formative purposes. NWEA has developed tests in reading, mathematics, language arts, and, more recently, science. Exams at different grade levels are placed on a single scale to measure student development over time and are constructed to avoid ceiling effects. We have supplemented NWEA data with information from the Common Core of Data (see Appendix A).

Most schools contracting with NWEA test at least twice a year, in the fall and spring, though not all districts contracting with NWEA test all of their students. This study uses data from three states in which comparatively large numbers of students were tested—Colorado, Idaho, and Indiana (see table 1). There were other states in which the number of tested students was as large, but those states could not be included in the sample because they had no low-stakes grades, having designated all grades as high-stakes from the inception of NCLB. We further restrict the sample to schools that tested at least 80 percent of their students.5 We include only students tested in both fall and spring in the same school.6

Table 1. 
Number of Observations by State, Grade, and Year
Year
StateGrade2003200420052006All Years
Colorado 10,633 8,450 9,577 10,128 38,788 
 10,848 8,292 9,135} 10,020 38,295 
Idaho 15,663 18,558 18,636 18,896 71,753 
 15,506 18,898 18,950 19,062 72,416 
Indiana 25,931 23,830 22,914 24,257 96,932 
 26,249 24,455 23,212 24,548 98,464 
 25,795 23,066 21,684 23,497 94,042 
Year
StateGrade2003200420052006All Years
Colorado 10,633 8,450 9,577 10,128 38,788 
 10,848 8,292 9,135} 10,020 38,295 
Idaho 15,663 18,558 18,636 18,896 71,753 
 15,506 18,898 18,950 19,062 72,416 
Indiana 25,931 23,830 22,914 24,257 96,932 
 26,249 24,455 23,212 24,548 98,464 
 25,795 23,066 21,684 23,497 94,042 

Notes: Entries in boldface and italics are high-stakes grades and years. Entries compiled from Northwest Evaluation Association's Growth Research Database.

We have already noted several advantages to using NWEA tests: (1) fewer problems with missing data due to student mobility between years; (2) controls for prior achievement that incorporate summer learning loss; (3) computer-adaptive testing that avoids problems associated with changing test forms; and (4) greater generalizability of results from low-stakes tests.7 There is an additional feature of this testing that is particularly useful for our purposes. Schools often use the results of NWEA tests to identify students who need to make extra progress in order to achieve proficiency. As a result, NWEA has conducted a series of technical studies to create crosswalks between scores on its tests in mathematics and reading, and scores on each state's high-stakes assessments. These technical studies are posted on the company's Web site and information is disseminated to school districts to aid schools in the interpretation of NWEA test results. (See Appendix B for a list of those studies pertaining to the three states in our research.) Furthermore, NWEA provides reports to classroom teachers and schools within three days of completing testing so teachers and principals know which students in their classes and school are on track to meet proficiency standards and which students may require remediation.

There are also drawbacks to using NWEA data. The mix of schools represented in the NWEA database changes over time as new districts sign contracts with NWEA and old districts allow existing contracts to elapse. The changing makeup of the population of NWEA schools is particularly problematic given that our identification strategy rests on another change that occurs over time—the switching of a grade from low- to high-stakes status. We therefore restrict the analysis to schools that administered NWEA tests in all four sample years.

One might also wonder how representative NWEA districts are of the rest of their states. If the districts signing contracts with NWEA are highly atypical of their states, questions are raised about the generalizability of our findings (though the internal validity of our analyses is unaffected). In table 2 we present characteristics of sample schools and non-sample schools for the three states included in this study, drawing on information in the National Center on Education Statistics’ Common Core of Data. We use data from the 2005–06 school year for this comparison. The non-NWEA comparison set is restricted to schools that have at least some students in the grade range examined in this study, grades 3–8.

Table 2. 
Characteristics of Sample Schools and Non-Sample Schools
ColoradoIdahoIndiana
NWEAOtherNWEAOtherNWEAOther
Large city (%) 5.26 3.89 0.00 0.00 5.34 5.43 
Mid-Size City (%) 2.63 2.22 5.93 12.00 6.87 6.59 
Fringe of large city (%) 9.21 8.33 0.00 0.00 17.56 16.28 
Fringe of mid-size city (%) 2.63 3.89 11.02 18.00 10.69 7.75 
Large town (%) 0.00 0.00 0.85 2.00 0.76 1.16 
Small town (%) 19.74 14.44 19.49 24.00 12.98 12.02 
Rural, outside CBSA/MSA (%) 46.05 47.78 42.37 20.00 24.43 22.87 
Rural, inside CBSA/MSA (%) 14.47 18.89 20.34 18.00 21.37 20.54 
Number of schools 189 1,246 497 95 516 1,231 
Charter schools 18 80 19 15 14 
Magnet schools 15 12 
Title I eligible schools 133 719 413 21 480 1042 
Pupil–teacher ratio 16.13 23.01 17.38 6.39 17.35 15.70 
Total enrollment 62,820 501,016 187,536 6,025 232,020 533,005 
Average enrollment 332.38 402.10 377.34 63.42 449.65 432.99 
ColoradoIdahoIndiana
NWEAOtherNWEAOtherNWEAOther
Large city (%) 5.26 3.89 0.00 0.00 5.34 5.43 
Mid-Size City (%) 2.63 2.22 5.93 12.00 6.87 6.59 
Fringe of large city (%) 9.21 8.33 0.00 0.00 17.56 16.28 
Fringe of mid-size city (%) 2.63 3.89 11.02 18.00 10.69 7.75 
Large town (%) 0.00 0.00 0.85 2.00 0.76 1.16 
Small town (%) 19.74 14.44 19.49 24.00 12.98 12.02 
Rural, outside CBSA/MSA (%) 46.05 47.78 42.37 20.00 24.43 22.87 
Rural, inside CBSA/MSA (%) 14.47 18.89 20.34 18.00 21.37 20.54 
Number of schools 189 1,246 497 95 516 1,231 
Charter schools 18 80 19 15 14 
Magnet schools 15 12 
Title I eligible schools 133 719 413 21 480 1042 
Pupil–teacher ratio 16.13 23.01 17.38 6.39 17.35 15.70 
Total enrollment 62,820 501,016 187,536 6,025 232,020 533,005 
Average enrollment 332.38 402.10 377.34 63.42 449.65 432.99 

In every state except Colorado, the NWEA sample comprised at least one-fourth of all schools serving one or more of these grades. In Idaho, virtually all schools are in the NWEA sample; those that are not tend to be small schools serving special populations. Elsewhere there are minor differences between NWEA and non-NWEA schools. In all states the NWEA sample includes a higher proportion of schools eligible for Title I funds. Charter schools are also over-represented in the NWEA sample. In Colorado, the NWEA sample contains schools somewhat smaller than average, with lower pupil–teacher ratios. By and large, however, the NWEA samples are broadly representative of these states as a whole. Certainly nothing here suggests that they are highly atypical with respect to the characteristics shown in table 2.

An additional complication is created by the fact that students in the same state (and even in the same district and school) do not take NWEA tests at the same time. To control for the time that elapses between testing dates, we divide a student's test score gain by the total number of days between fall and spring administration of the NWEA test. The assumption that all days within this interval contribute equally to gains is strong. Nonetheless, a superior metric is not obvious. Moreover, as our estimator is based on within-grade changes in testing status (high-stakes vs. low-stakes), if testing dates are stable over time, the fact that they vary from district to district should not affect our findings. In table 3 we show mean fall and spring test dates by state and grade for low-stakes versus high-stakes years. Apart from Colorado, where spring test dates are approximately two weeks later during high-stakes years, differences are not pronounced. We return to this issue in our sensitivity tests in section 5, restricting our samples to schools with stable testing dates.

Table 3. 
Mean Testing Dates, Sample States and Grades
StateGradeHigh StakesFallSpringDays Elapsed
Colorado no 43.4 69.9 210.1 
  yes 50.4 90.0 223.6 
 no 43.5 70.2 210.3 
  yes 49.8 90.5 224.6 
Idaho no 59.6 89.4 213.4 
  yes 61.8 88.2 210.4 
 no 61.2 89.5 211.8 
  yes 60.9 87.5 210.6 
Indiana no 48.1 80.8 216.2 
  yes 47.5 81.9 218.4 
 no 48.0 80.4 215.9 
  yes 47.7 80.8 217.1 
 no 46.4 78.7 215.8 
  yes 47.9 80.6 216.8 
StateGradeHigh StakesFallSpringDays Elapsed
Colorado no 43.4 69.9 210.1 
  yes 50.4 90.0 223.6 
 no 43.5 70.2 210.3 
  yes 49.8 90.5 224.6 
Idaho no 59.6 89.4 213.4 
  yes 61.8 88.2 210.4 
 no 61.2 89.5 211.8 
  yes 60.9 87.5 210.6 
Indiana no 48.1 80.8 216.2 
  yes 47.5 81.9 218.4 
 no 48.0 80.4 215.9 
  yes 47.7 80.8 217.1 
 no 46.4 78.7 215.8 
  yes 47.9 80.6 216.8 

Notes: Fall test dates are measured from 1 July. Spring test dates are measured from 1 January. Days elapsed are mean number of calendar days between fall and spring testing.

We note, finally, some idiosyncrasies in the implementation of NCLB within our states. In Colorado, separate ratings are issued for elementary and middle school grades even when these grades are housed in the same school. We treat these cases as effectively two schools for purposes of analysis. Also in Colorado, the level of achievement termed “proficiency” is not the level students need to reach for the school to make AYP under NCLB. Rather, “partially proficient” is treated as “proficient” for purposes of NCLB (though not for the state's own accountability system). All references to “proficiency” in Colorado should therefore be understood as applying to level of achievement that the state calls “partially proficient.”

4.  Distribution of Achievement Gains: Baseline Results

We investigate the effects of NCLB on a standardized gain score (the daily rate of gain described above, standardized to have a mean of zero and standard deviation of one within each state/grade/year). We also standardize the initial level of achievement in the fall. Because attention has focused on “bubble students” near the threshold of proficiency, we center the standardized fall score on the proficiency cut score (literally, the score that NWEA has identified as the cut score equivalent on its tests). Given that these cut scores have remained constant over time in a given state and grade, and that data for each state/grade are analyzed separately, this is an innocuous normalization.8

In this paper we report seven sets of results, defined by state and grade level: Colorado, grades 3 and 4; Idaho, grades 5 and 6; and Indiana, grades 4, 5, and 7. We have conducted additional analyses at the state level, pooling data across grades, for Minnesota, Arizona, Michigan, and Wisconsin. (The pooling is necessitated by the smaller samples.) Results are qualitatively similar to those reported here. (See the online appendix to this paper for details, available at Education Finance and Policy's Web site at www.mitpressjournals.org/doi/suppl/10.1162/EDFP_a_00189.) We restrict each analysis to a particular grade because cut scores vary by grade, affecting the probability a given student will attain proficiency and the identification of which students are “on the bubble.” We have also selected these grades because we have four years of data for each: two from a low-stakes regime (2003 and 2004), and two after the grade was designated high-stakes (2005 and 2006). This makes it possible for us to compare year-to-year changes during the low-stakes regime to changes that occurred between low- and high-stakes testing. The former serve as a placebo test. If changes between 2003 and 2004 are similar in kind and magnitude to those observed between the two accountability regimes, we would appear to have adopted a dubious identification strategy.

We use a well-known kernel regression estimator, the Nadaraya-Watson estimator, to obtain nonparametric estimates of the conditional mean of achievement growth—that is, growth from fall to spring testing, conditional on a student's score in the fall.9 For these baseline estimates, we have applied the sample restrictions discussed above. Grades must have switched from low-stakes to high-stakes within our sample period, schools must have tested at least 80 percent of their students in the grades in question, and students must have fall and spring test results in the same school. Counts of schools and students in our baseline sample are presented in table 3.

Our nonparametric estimates yield a very large number of graphs, only a few of which can be shown in the published version of this paper. Complete results for all states/grades are provided in the online appendix. Here we display for one grade from each state, preferring graphs in which contrasts are more frequently statistically significant. Results for other grades in the same states are qualitatively if not quantitatively similar.

The distribution of gains conditional on fall scores is depicted for Colorado, grade 3, Idaho, grade 6, and Indiana, grade 5, in figures 2ac. Because nonparametric estimates can become quite erratic at the extremes of the fall distribution, the x-axis is truncated at 2.5 standard deviations above and below the cut score. Fewer than 2.5 percent of students fall outside these limits. Note also that the vertical scale differs from one state to the next. For each state/grade we display two graphs: pre-NCLB changes on the left (the placebo test); and a contrast of the low- and high-stakes testing regimes on the right (the treatment effect). The shaded areas in figure 2 show where 95 percent confidence intervals for the high-stakes and low-stakes estimates overlap.10 By this measure, regions in which the two curves lie outside the shaded area therefore represent statistically significant differences. Not surprisingly, there is greatest overlap at the extreme ends of these curves, where there are relatively few observations. This is less true in the vicinity of the cut score (except where the low-stakes and high-stakes curves are close to one another), suggesting that our nonparametric estimates have sufficient power to detect the presence of a bubble, if there is one.

Figure 2a.

Math Gains as a Function of Fall Scores, Colorado.

Figure 2a.

Math Gains as a Function of Fall Scores, Colorado.

Figure 2b.

Math Gains as a Function of Fall Scores, Idaho.

Figure 2b.

Math Gains as a Function of Fall Scores, Idaho.

Figure 2c.

Math Gains as a Function of Fall Scores, Indiana.

Figure 2c.

Math Gains as a Function of Fall Scores, Indiana.

Focusing first on the placebo tests (the leftmost graph in each figure), we find no worrisome differences between 2003 and 2004 in these grades, even though NCLB had already taken effect and tests in other grades were high stakes. The distribution of scores in 2004 does not differ systematically from the distribution in 2003. In most grades and states, the two curves representing achievement conditional on fall performance tend to wrap around each other. In most parts of the distribution each curve lies within the other's confidence interval.

The graphs depicting changes between the low- and high-stakes regimes are different but there is little evidence of triage. In Colorado, grade 3, high-stakes accountability is associated with greater gains practically across the board. In Idaho, grade 6, high-stakes testing has improved gains for students at the low end of the distribution—with gains greatest for those furthest below the cut score. In Indiana, grade 5, we see a bulging out of the high-stakes distribution in the region to the immediate left of the cut score, a plausible location for students “on the bubble,” but there are statistically significant differences in favor of students at the ends of the distribution as well. We note that our failure to find evidence of triage is not due to a more general failure of our identification strategy to find any effects of NCLB. Schools in our sample have responded to high-stakes accountability, but not by practicing triage. This conclusion holds as well for the states/grades whose results appear in our online appendix.

Did the Schools in our Sample Have a Reason to Practice Triage?

In the next section, we report the results of several sensitivity tests, investigating whether triage holds in subsamples of our data: in other words, whether we can find evidence of triage if we look more judiciously. Before doing so, however, we first take our baseline results at face value and ask whether there are special reasons, perhaps peculiar to our sample, that these schools did not practice triage. We consider two possibilities. First, it may be that many of the lowest-achieving students in our sample have a high chance of reaching proficiency within the current year. Schools therefore have no reason to write them off. Second, schools may need many of their lowest-achieving students to reach proficiency in order for the school to make AYP. In such cases, triage simply isn't an option.

To explore these hypotheses, we begin by estimating the probability that students reach the proficiency cut score when next tested, conditional on their fall performance.11 Our model includes indicators for state, grade, time that elapses between tests, and fall scores in math and in reading interacted with grade level.12 As one would expect, the fall performance is strongly predictive of future performance. In our sample, students who are already scoring at the cut score or higher are predicted to have a 95 percent chance of again reaching the cut score when next tested. Depending on state and grade, this probability drops to as low as 29 percent for students whose fall scores are 1 to 1.5 standard deviations below the cut score. Results aggregated over all states and grades are reported in table 4. For students whose fall performance is more than 1.5 standard deviations below the cut score, the probability of reaching proficiency when next tested is quite low. It is therefore not the case that virtually all students in our sample are “on the bubble”—that is, have a reasonable chance of reaching proficiency within a year. On the contrary, students whose performance in the fall is a standard deviation or more below the cut score are likely candidates to receive less attention under a system of triage (if triage is being practiced).

Table 4. 
Relationship of Standardized Fall Test Score to the Probability of Reaching the Proficiency Cut Score
Less Than…andGreater Than…Mean Probability of Passing
 −0.5 0.73 
−0.5  −1 0.51 
−1  −1.5 0.29 
−1.5  −2 0.14 
−2   0.04 
Less Than…andGreater Than…Mean Probability of Passing
 −0.5 0.73 
−0.5  −1 0.51 
−1  −1.5 0.29 
−1.5  −2 0.14 
−2   0.04 

What of the second explanation, then, that schools can't afford to write these students off because they can't make AYP without them? To explore this hypothesis, we rank students within each school by fall test scores. We define the marginal student for each school as the nth student where N students need to reach proficiency for the school to make AYP. (Only students in high-stakes grades are ranked.) We then ask where our low-performing students fall relative to the marginal student in their schools, to ascertain whether the school needs these students in order to make AYP. Thus, students who rank far below the marginal student might well be given less attention in a system of triage, as there are many more promising candidates for the school to focus on.13

In table 5 we report descriptive statistics for two indicators. The first is the difference between a student's own probability of reaching proficiency when next tested and the same probability for the school's marginal student. A negative value indicates that a student is less likely than the marginal student to reach this level. We give the mean values of this indicator for various ranges of the fall score. As one would expect, this indicator becomes increasingly negative the further a student's fall score is from the proficiency threshold. Students whose scores are 1 to 1.5 standard deviations below the cut score in the fall are 50 percentage points less likely to reach proficiency than the marginal student in their schools—a very substantial difference, indicating that such students are unlikely to be considered on the bubble.

Table 5. 
Comparison of Individual Students to a School's Marginal Student, by Fall Score
Fall Score RangeMean Difference in Pass ProbabilityMean Number of Intervening Students
>0 0.09 63.6 
(−0.5, 0) −0.04 −8.2 
(−1, −0.5) −0.25 −36.8 
(−1.5, −1) −0.47 −58.2 
(−2, −1.5) −0.63 −70.5 
<−2 −0.74 −82.9 
Fall Score RangeMean Difference in Pass ProbabilityMean Number of Intervening Students
>0 0.09 63.6 
(−0.5, 0) −0.04 −8.2 
(−1, −0.5) −0.25 −36.8 
(−1.5, −1) −0.47 −58.2 
(−2, −1.5) −0.63 −70.5 
<−2 −0.74 −82.9 

Note: Marginal student is the nth student (ranked by fall score) in a school where n students need to score proficient for school to make AYP.

We also report a second indicator of the distance between a student and the school's marginal student—the number of intervening students in the ranking described earlier. Once again, the gap widens as fall performance drops. Students in the same category (1 to 1.5 standard deviations below the cut score in the fall) are on average 58 positions below the marginal student in their schools. (Only students in high-stakes grades are counted.) That is, if a school wants a cushion in the event the marginal student doesn't pass, on average there are 58 other students who represent better bets. By neither indicator is there much support for the hypothesis that the schools in our sample can't afford to neglect students who are a standard deviation or more below the cut score in the fall if they are to make AYP.

5.  Sensitivity Tests

In this section we report the results of a large number of sensitivity tests, looking for evidence of triage in subsamples of our baseline data. Because our purpose is to compare these results with our baseline findings, we continue to use the same bandwidth of 0.1 as before, although the sample sizes are often considerably smaller than the baseline samples and many of the curves are undersmoothed. We have done this in the knowledge that readers can compensate for undersmoothing by picturing to themselves a smoother curve, whereas they cannot similarly compensate for oversmoothing.

Title I Schools and Low-performing Schools

To this point our sample has included schools that do not face sanctions if they fail to make AYP, or that simply have a low probability of failing and are therefore under little pressure to alter instructional practices. NCLB sanctions generally apply only to schools receiving Title I funds.14 We conduct two sensitivity tests to see whether we find evidence of triage among these schools. In the first, we limit the sample to Title I schools. In the second, we use the subset of schools in Idaho and Indiana that failed to make AYP in either 2003 or 2004 (before the switch to high-stakes testing in the grades we have analyzed). Results for these two tests are very similar (reflecting the extensive overlap between Title I schools and our low-performing schools) and we present only the results for the schools that had already failed to make AYP at least once before 2005 (figures 3ab). The distribution of gains in high-stakes and low-stakes regimes is quite similar to what we found in our baseline analysis.15 Differences are less frequently statistically significant, reflecting the smaller sample sizes.

Figure 3a.

Math Gains, Restricted Sample, Idaho.

Figure 3a.

Math Gains, Restricted Sample, Idaho.

Figure 3b.

Math Gains, Restricted Sample, Indiana.

Figure 3b.

Math Gains, Restricted Sample, Indiana.

Terminal Grade Students

As noted earlier, the triage hypothesis rests on the assumption that schools behave myopically, focusing their attention on students who can be brought up to proficiency within the current year. If instead schools take the long view, we ought not be surprised at finding few systematic differences between low-stakes and high-stakes grades.

There is, however, one case in which a short time horizon is not particularly myopic—students who are in their final year in a school. If the low achievers in this group cannot be brought to the proficient level within the current year, the school has no long-term stake in raising their achievement, at least from the standpoint of making AYP. Likewise, if students start the year comparatively assured of reaching the proficiency threshold for that grade, the school has no long-term stake in moving them forward, even though they will be held to higher standards in the future. By that time, they will be someone else's students. Thus terminal-grade students would appear to offer the most likely setting for triage. If NCLB has induced triage anywhere, presumably it is here.

We have repeated the previous analyses, restricting our samples to students in the final grade offered by a school. Because of the configuration of grade levels within schools, we lose all of our Colorado observations. Results for Idaho and Indiana are very similar to our baseline estimates (see the online appendix).

Triage Masked by Reassignment of Effective Teachers

Because NCLB did not require high-stakes testing in all grades immediately, schools might have attempted to game the system by placing their most effective teachers in high-stakes grades. If these teachers had the ability to reach students at all levels of ability—if they didn't need to resort to triage—then their superior teaching might be the reason we fail to see a focus on “bubble” students. Had less capable teachers been assigned to the high-stakes grades, the effects of triage (so the argument goes) would have been more apparent.

We doubt this explains our failure to find stronger evidence of triage. Unlike some state accountability systems in which high-stakes tests were given in the same one or two grades every year, under NCLB all grades from 3 to 8 would count in determining AYP—if not at once, then by 2005 or 2006. Given the personal inconvenience to teachers and the probable impact on their effectiveness from frequent changes in assignment, we are skeptical that school administrators would have shifted their best teachers across grades for such uncertain and short-lived gains. Although previous research shows that effective teachers are more likely to continue teaching in high-stakes grades with the onset of accountability (Chingos and West 2011), that research does not speak to the question of whether an effective teacher in a low-stakes grade is apt to be moved into a high-stakes position. Nonetheless, it is easy enough to test the conjecture that such assignments explain why we fail to detect triage. By 2005, grades 3–8 were all supposed to be giving high stakes tests. Even allowing for the fact that this target was not met in some states, by 2006 all grades in our sample were designated high stakes, making it impossible for an administrator to game the system by assigning the school's best teachers to the high stakes grades.

Accordingly, we have re-estimated the achievement relationship using restricted samples in which high-stakes data are drawn only from the final year, 2006. (We continue to use all the data from the two low-stakes years.) Results are very similar to our baseline estimates (again, see the online appendix).

Changing Test Dates

As we noted, there is variation between and even within schools in the dates on which NWEA tests are taken. Within-school variation in a given year does not appear to be particularly problematic. Most students at a particular grade-level take the test within a few days of one another. Between schools and within a school over time, variation is greater. We have controlled for this by putting fall-to-spring gains on a per day basis, although it may be that this solution is too crude: The effectiveness of instruction might depend on when within the school year it is given, and variation in the latter could conceivably be correlated with the switch from a low-stakes to high-stakes testing regime.

Accordingly, we have identified a set of schools where testing dates varied little over the sample period. Because there is no single test date within a school, we measured stability as follows. First, we found the point in the school year (counting from 1 July for the fall and from 1 January for the spring) by which 10 percent of a school's students (at a particular grade level) had been tested. We then found the date by which 90 percent of the students in that grade had been tested. We then required that neither of these two dates vary by more than two weeks over the four years of our study period. Schools meeting these criteria were deemed to have stable testing regimes. Depending on grade and state, between a quarter and a half of the original student sample is found in these schools.

Estimates obtained using this sample are reported in the online appendix. Overall they are quite similar to our baseline results. There is nothing that resembles triage.

Generalizability Beyond These Three States

Finally we turn to the possibility that our findings reflect outcomes in a small number of atypical states. Our ability to test this hypothesis is limited. Although there are other states in which NWEA tests are widely used (e.g., California, South Carolina), these states designated all grades 3–8 as high-stakes when NCLB first took effect and therefore do not lend themselves to our identification strategy. There are other states that phased in high-stakes accountability, though NWEA testing in those states during our sample period was much less extensive than in the three states we have examined in this paper. We have conducted analyses identical to those reported here on four of these states in which our samples are largest: Arizona, Michigan, Minnesota, and Wisconsin. Results for samples pooled across grades and including all tested schools, regardless of when they signed contracts with NWEA, are depicted in figure 4. In Arizona and Wisconsin, we find that the onset of high-stakes accountability was accompanied by a decline in gains among the lowest achieving students. As there were no offsetting gains elsewhere, it is not evident that a focus on bubble students is responsible. We also note that the negative impact was limited to the extreme left-hand tail, where there are few students. In Michigan and Minnesota, high-stakes accountability has been accompanied by a modest outward shift of achievement at all levels.

Figure 4.

Math Gains, Additional States.

Figure 4.

Math Gains, Additional States.

6.  Discussion

This study has investigated whether NCLB has led schools to practice educational triage, focusing instructional efforts on students near the proficiency threshold to the detriment of those well above and below it. We find essentially no support for this hypothesis. Much more commonly, we find that accountability has increased gains among the students who began the year substantially below the proficiency cut score. Evidence on the impact of accountability on high achievers is mixed, with a negative impact in some places (Idaho and Indiana, grade 4) and a positive one in others (Colorado and Indiana, grades 5 and 7).

Although our findings are not unprecedented (as noted in the Introduction, evidence for the practice of triage is mixed), the extent to which they are due to our identification and estimation strategy may be questioned. We pursue this matter by applying to our data the approach used by Neal and Schanzenbach (2010) in their study of accountability in Chicago. Their study relied on two years of data for each of two cohorts, the first year supplying controls for prior achievement, and the second year outcomes that were contrasted across cohorts. High-stakes accountability was initiated during cohort B's second year. Thus, B experienced a change in accountability regime that A did not. Students in A were divided into deciles based on scores from a prior year. Within each decile, these prior scores were then used to predict achievement at the end of the current year. The same coefficients were applied to prior scores of cohort B to generate a counterfactual prediction—expected achievement in the absence of a change in the accountability regime (assuming no other differences across cohorts). The difference between these predictions and the actual cohort B outcomes represented the treatment effect. Comparing results across deciles, Neal and Schanzenbach concluded that the onset of high-stakes accountability was damaging to the lowest-achieving students and most helpful to those in the middle, with mixed results (depending on grade level) for those at the high end.

We have replicated this approach in those states and cohorts for which we have two consecutive years of data before the onset of high-stakes testing, and another two years after (two fourth grade cohorts in Colorado, two sixth grade cohorts in Idaho, and two fifth grade cohorts in Indiana). Results are shown in figures 5bd, with Neal and Schanzenbach's (2010) results in figure 5a. We find a different pattern in each state, but none of them resembles the inverted-U shape obtained by Neal and Schanzenbach. These results are broadly consistent with those shown in figures 2ac: In Colorado and Indiana, most students gain from the onset of high-stakes testing; in Idaho low achievers gain and achievement declines among high performers. The quantitative differences between these results and those reported in figures 2ac are largely due to the following: (1) when we implemented the Neal and Schanzenbach approach, we limited the analysis sample to one pre- and one post-high-stakes year for cohort B, whereas the analyses we reported in our set relied on two pre-NCLB and two post-NCLB cohorts; (2) our Neal and Schanzenbach results used prior year scores as a control for previous achievement, whereas we previously used the fall score from the current year. Decile assignments based on prior year scores differ substantially from those based on the current fall score. Although it can be debated which is more informative for purposes of predicting spring outcomes, for our purposes—an investigation of triage—we believe that the fall scores are superior, particularly as the use of these fall tests to identify students at risk of failing to reach proficiency is an important selling point when NWEA markets its exams.

Figure 5a.

Fifth-Grade Results for 2002 in Neal and Schanzenbach (2010).

Figure 5a.

Fifth-Grade Results for 2002 in Neal and Schanzenbach (2010).

Figure 5b.

Fourth-Grade Results for 2005 in Colorado.

Figure 5b.

Fourth-Grade Results for 2005 in Colorado.

Figure 5c.

Sixth-Grade Results for 2005 in Idaho.

Figure 5c.

Sixth-Grade Results for 2005 in Idaho.

Figure 5d.

Fifth-Grade Results for 2005 in Indiana.

Figure 5d.

Fifth-Grade Results for 2005 in Indiana.

Replacing our method of analysis with that of Neal and Schanzenbach (2010) does not, then, alter our qualitative conclusion. Though the distribution of gains changes somewhat, we still find no evidence that high-stakes accountability has been harmful to the lowest-performing students (nor, in two of our three states, has it hurt the highest achievers). Why, then, do our findings differ from those of Neal and Schanzenbach (2010) and others who have found evidence of triage? One possible explanation is that instructional practices are heterogeneous and the response of Chicago teachers to high-stakes accountability was not that of most teachers in our samples. There is another possibility—the data used by Neal and Schanzenbach do not contain much information about responses to NCLB.

Illinois’ NCLB policy was first implemented during the 2002–03 school year—one year after the second year of data on Neal and Schanzenbach cohort B.16 Neal and Schanzenbach (2010) argue that 2001–02 was effectively a high-stakes year based on district policy. They note that in February 2002, the Chicago Public School (CPS) system “declared that the 2002 ISAT [Illinois Standards Achievement Test] exams would be high-stakes exams” (p. 269). This suggests that district policy (including district sanctions that exceeded those of NCLB) drove changes in teacher behavior between 2001 and 2002.17

This may be conceding too much, however. The 2002 ISAT assessment was administered in April, roughly 35 school days after the CPS announcement, giving teachers little time to respond. Moreover, conflicting signals were sent by the state. The Illinois’ Consolidated NCLB Application, submitted by the Illinois State Board of Education, stated that the 2002 ``state assessment data will be used to establish the official baseline and AYP expectations. The new AYP formula will be applied for the first time to the 2003 assessment” (Illinois State Board of Education 2002). Given that the 2002 ISAT would be used to establish the performance baseline, poor test results in 2002–03 would make future performance standards more easily attainable. Psychological and behavioral research documents the distortion to incentives when participants can influence the standard-setting process (e.g., Murphy 2000; Heneman et al. 2007). In short, there are multiple reasons why instructional practices in Chicago in 2002 might differ from those adopted elsewhere two and three years after NCLB was enacted, the time frame of our analysis.

We suspect that the persuasiveness of the triage hypothesis is largely due to its intuitive appeal. We therefore suggest some reasons why we have failed to find a focus on “bubble” students. First, it may be that administrators have directed teachers to focus their efforts in this way, but that teachers have not complied, adhering to professional norms that make it difficult to deliberately slight some students in favor of others. To the extent that teachers cannot avoid decisions involving instructional trade-offs, many have responded by focusing their instruction primarily on students in the middle. It is likely that they find it difficult, if not objectionable, to deviate from this practice. It is also possible that for many teachers, focusing on bubble students is not that different from focusing on students in the middle, particularly if tracking and other forms of ability grouping draw off the students at the very top and bottom.

A focus on bubble students may manifest itself primarily in preparing for specific tests. If so, its effects may be apparent on high-stakes tests but not on the low-stakes tests that are the basis of the analysis here. But although this explanation might apply to Colorado and Indiana, it cannot account for our failure to find evidence of triage in Idaho, where NWEA tests were the high-stakes exams during our sample period.

NCLB holds schools responsible not just for student performance as a whole, but for the performance of specific subgroups. It is possible that schools focused extra help not on low performers in general but only on those who belonged to a critical subgroup. Unfortunately, our data do not contain indicators of subgroup membership except for student race. Samples comprising minority students are small and the size of the resulting confidence intervals makes it difficult to say whether this did or did not occur. Nevertheless, a focus on bubble students in certain subgroups (as opposed to all students near the cut score) cannot explain our finding that low-achieving students in general have benefitted from the switch to high-stakes accountability. If only certain small subgroups must raise their performance for a school to make AYP, the school has all the more reason to divert resources from students who do not belong to these subgroups (low and high performers alike).

Finally, it is possible that teachers have been trying to focus on students on the bubble, but that the practices they have adopted have had the unintended consequence of benefiting most low-performing students. That is, teachers would practice triage if they knew how, but they miscalculate.

Although we do not find much evidence of triage, we do find some suggestion of trade-offs wherein achievement profiles tilt in favor of low-performing students at the expense of students at the upper end of the distribution (most notably in Idaho). It is not obvious that this has to happen. It may be possible to reconfigure classes to protect above-average students from these effects. In other states (Colorado and to a lesser extent Indiana) the best students have also experienced greater gains in test scores with the onset of high-stakes accountability.

The pronounced gains for low-performing students in Idaho require special explanation. We think it unlikely that Idaho teachers are philosophically more averse to triage than teachers elsewhere, or that they are more dedicated to raising the achievement of the weakest students. The most likely explanation for the difference between Idaho and other states is that NWEA tests were used as Idaho's high-stakes statewide examination during our sample years. If triage were the dominant response to high-stakes accountability, particularly with respect to performance on the high-stakes assessment, we should have seen evidence to that effect in Idaho. Instead, we find indications of a change in instructional practices that benefited all low-performing students, with substantial gains for those furthest behind.

This paper has focused on the early years of NCLB. It is possible that the pressure to practice triage has increased in recent years—though we note that claims about triage go back to the earliest years of NCLB (and, indeed, even earlier, having been applied to state and district accountability systems before NCLB was launched). Although in principle it would be possible to extend our analysis to include data from beyond the 2005–06 school year, given the identification strategy used here, we have not done so. By 2005–06, all grades from 3 to 8 were high stakes. Thus, additional data on the high-stakes regime would necessarily have been drawn from years increasingly remote from the period when testing was low-stakes. Moreover, alternative and arguably superior identification strategies are available for later years, notably the use of the Annual Measurable Objective (the percentage of students who must score proficient for a school to make AYP, which has been ratcheted up over time) as an instrument for past failures to make AYP and the level of sanctions to which schools will be subject if performance does not improve.

Much has changed since the early years of NCLB. It may be wondered whether the findings reported here are of continued relevance to the making of policy. We believe they are, for the following reasons.

In the first place, the bubble/triage hypothesis is still cited in discussions of accountability, where it has acquired the status of a received truth. We began this paper with several excerpts from this public discussion in the media. More recent contributions could be cited (Provini 2012; Stecher 2013). Studies that go back further than ours (e.g., Booher-Jennings 2005; Neal and Schanzenbach 2010) continue to be cited in support of this belief.

The substitution of new targets for NCLB's original goal of 100 percent proficiency has not fundamentally altered the incentive of schools to focus on bubble students. This includes the adoption of so-called growth models, wherein students are not required to attain proficiency in a single year but must make progress sufficient to attain proficiency within a specified time frame. By substituting one target for another, this changes the definition of who is on the bubble but it does not alter the fact that some students will need to make more progress in a given year than others, that some will appear to be within reasonable striking distance of that goal while others will not, and that schools can satisfy federal waiver requirements if a sufficient percentage of students are on track.

Finally, there is an unresolved question at the heart of discussions of accountability—whether targets for student achievement should be expressed in terms of proficiency or progress, where the latter is taken to be a year's satisfactory growth whether or not growth at that rate ever enables the student to attain a particular end goal. Proponents of the latter view argue that the former holds schools accountable for factors beyond their control, whereas the latter allows a focus on value added, ostensibly fairer to schools and teachers. Replacing an accountability system based on proficiency with one based on value added may be politically unacceptable, however. It gives the appearance of letting educators off the hook, of acquiescing in what former President George W. Bush termed “the soft bigotry of low expectations” (Bush 2000). For this reason we believe it likely that one-size-fits-all targets like NCLB proficiency will continue to be part of accountability systems. To the extent that they are, the argument will surely be advanced that such systems create incentives for schools to focus their efforts on students who are not too far from attaining that target and to do less for others. Consequently, the debate over triage will continue to be with us.

7.  Conclusion

This study has investigated a widely held belief that high-stakes accountability under NCLB has led schools to focus on students at the margin of achieving proficiency (“bubble students”) at the expense of others. Using an identification strategy and data that offer advantages over previous studies, we have found no support for this hypothesis. The inception of high-stakes accountability has not worsened outcomes for the lowest-achieving students in the states and grades we have examined. On the contrary, the evidence indicates that they benefitted. More commonly, we find high-stakes accountability has been associated with gains very nearly across the board, at all points in the distribution (with the notable exception of Idaho). Gains were not always very great. Progress has been slow. But the claim that high-stakes accountability was harmful to the most vulnerable students appears to have no empirical foundation in these states during the period of this study.

Notes

1 

In the most dramatic example offered by Neal and Schanzenbach (2010), for the fifth grade cohort of 1998, reading scores in the bottom decile fell by a full month of achievement, as large a change in the negative direction as any of the positive effects in the higher deciles. Similarly, Krieg (2011) equates the magnitude of the differential impact of NCLB on racial groups to the conditional impact of switching schools mid-year and the conditional achievement differences among students with and without computers at home.

2 

Dee and Jacob (2011) report that the share of students at the “basic” level (up from “below basic”) in fourth grade math increased by 9 percentage points, and the percentage at the next highest level increased by 5 percentage points. In eighth grade math, the increase in the percent basic (5 percentage points) was statistically significant, though overall effect was not. Because National Assessment Educational Progress's basic level is close to “proficiency” as defined by many states, these results can be read in more than one way: They may lend support to the hypothesis that schools are focusing on bubble students, but they could also have been produced by gains lower in the distribution.

3 

Mixed results continue to characterize recent studies. Reback, Rockoff, and Schwartz (2011) find that accountability pressure from NCLB lowers teachers’ perceptions of job security and causes untenured teachers in high-stake grades to work longer hours than their peers. Krieg (2011) demonstrates that students in nonfailing racial subgroups at low-performing schools have smaller test score gains than comparable peers in nonfailing schools. Deming et al.'s (2013) study of high school accountability in Texas in the 1990s, a precursor to NCLB, found a mixed response depending on the nature of the pressure schools faced. In schools facing sanctions for poor performance, achievement was boosted most for low achievers. However, low-achieving students did not benefit from accountability in schools that were competing for recognition.

4 

Some studies use regression discontinuity analysis, comparing schools that just failed to make AYP under NCLB with those that barely succeeded. This is a problematic approach to the study of an ongoing accountability system, when schools in the latter group know that they will run the same gauntlet again the following year. As a result, the behavior of schools that have barely passed may differ only slightly from those that have barely failed. That some researchers find a difference demonstrates that the comparison is not wholly uninformative, but it is likely that it falls far short of the difference between outcomes under high-stakes accountability systems and those that would obtain without it.

5 

Some districts contract with NWEA to test subpopulations of students (for example, at-risk students). Districts that meet the 80 percent are typically those giving the test to everyone (the average test participation rate in our sample exceeds 90 percent). Enrollment figures were obtained from the National Center on Education Statistics’ Common Core of Data.

6 

Students who switch schools mid-year do not count when determining a school's AYP status.

7 

Idaho was an exception. During the years of our study, the state used NWEA as its high-stakes exam. We return to this fact and the influence it may have had on educational practices in the Discussion section.

8 

In fact, NWEA's equivalent cut scores for Indiana changed in 2005 and again in 2006. These changes were in response to NWEA's perception that the difficulty of Indiana's standardized tests had changed. Given that the changing cut scores intended to represent a single unchanging level of difficulty, we have used the revised values in the years in question. We have also conducted our analyses ignoring these changes, retaining the original cut scores defined for 2003 throughout. Results are very similar and the implications for the triage hypothesis unaffected.

9 

We use a Gaussian kernel with bandwidth equal to 0.1.

10 

We compute the variance of the Nadaraya-Watson estimate at x using the asymptotic approximation (Härdle et al. 2004) of (1/nh) [σ2/f(x)] ∥K22, where n is the sample size, h is the bandwidth (0.1), σ2 is the error variance, f(x) is nonparametric estimate of the density of x, and ∥K22 is the square of the L2 norm of the kernel (Gaussian).

11 

In states that administer high-stakes tests in the spring, “next tested” means the spring administration of the NWEA math test. A student is deemed to have reached proficiency if they reach the score that NWEA has identified as the equivalent of the proficiency threshold on the state's high-stakes assessment. In Indiana, where high-stakes tests are given in the fall, “next tested” means the administration of the NWEA test the following fall.

12 

To enhance the precision of these predictions, we use all of our sample data when estimating the probit models, not distinguishing between observations in high-stakes and low-stakes regimes. Very similar predictions are obtained using data only from low-stakes years.

13 

By identifying a particular student in each school as the “marginal” student, we do not mean to imply that schools have no incentive to raise the achievement of lower-ranked students. As the marginal student is not assured of passing, even a school practicing triage will presumably focus instructional effort on some students who are below the marginal student, as insurance.

14 

States have the option of applying these sanctions to non–Title I schools. Most do not. Colorado, Indiana, and Minnesota do not hold non–Title I schools accountable under NCLB. Chronically low-performing non–Title I schools are held accountable under Public Law 221 in Indiana, which preceded NCLB by two years but which does not carry sanctions like NCLB. Idaho holds non–Title I schools accountable under NCLB.

15 

We omit Colorado schools for the reason noted earlier—the unit held accountable under NCLB is not the entire school. This makes it quite problematic to match historic data on AYP to schools.

16 

All states were required to submit proposed accountability programs to the U.S. Department of Education (USDOE) in the form of consolidated state applications. Illinois submitted its proposal to USDOE on 10 June 2002 (see www.isbe.net/nclb/csa/application.pdf).

17 

We have met former Chicago public school teachers who lost their jobs because of poor performance following the inception of high-stakes accountability there.

Acknowledgments

We thank the Northwest Evaluation Association for providing data for this study and the Smith-Richardson Foundation and the federally funded National Center on School Choice at Vanderbilt University for research support. We also thank Yanqin Fan, Adam Gamoran, Steve Rivkin, and Kim Rueben for their helpful comments and insights in developing this work, as well as seminar participants at the American Education Finance Association, American Educational Research Association, Association for Public Policy Analysis and Management, and Amherst College. Special thanks is due to Xiao Peng for his research assistance on this project.

REFERENCES

Booher-Jennings
,
Jennifer
.
2005
.
Below the bubble: “Educational triage” and the Texas accountability system
.
American Educational Research Journal
42
(
2
):
231
268
. doi:10.3102/00028312042002231.
Bush
,
George W.
2000
.
Text: George W. Bush's speech to the NAACP
. Available www.washingtonpost.com/wp-srv/onpolitics/elections/bushtext071000.htm.
Accessed 25 January 2016
.
Chingos
,
Matthew M.
, and
Martin R.
West
.
2011
.
Promotion and reassignment in public school districts: How do schools respond to differences in teacher effectiveness
?
Economics of Education Review
30
(
3
):
419
433
. doi:10.1016/j.econedurev.2010.12.011.
Dee
,
Thomas S.
, and
Brian A.
Jacob
.
2011
.
The impact of No Child Left Behind on student achievement
.
Journal of Policy Analysis and Management
30
(
3
):
418
446
. doi:10.1002/pam.20586.
Deming
,
David J.
,
Sarah
Cohodes
,
Jennifer
Jennings
, and
Christopher
Jencks
2013
.
School accountability, postsecondary attainment and earnings
.
NBER Working Paper No. 19444
.
Härdle
,
Wolfgang
,
Marlene
Müller
,
Stefan
Sperlich
, and
Axel
Werwatz
.
2004
.
Nonparametric and semiparametric models
.
Berlin
:
Springer-Verlag
. doi:10.1007/978-3-642-17146-8.
Hart
,
Kevin
.
2010
.
Is NCLB intentionally leaving some kids behind? Focus on high-stakes testing means some lower-performing students may get less attention
. Available www.nea.org/home/37772.htm.
Accessed 27 May 2011
.
Heneman
,
Herbert G.
,
Anthony
Milanowski
, and
Steven
Kimball
.
2007
.
Teacher performance pay: Synthesis of plans, research, and guidelines for practice
.
CPRE Policy Briefs No. RB-46
,
Philadelphia, PA
.
Illinois State Board of Education
.
2002
.
Consolidated State Application
,
No Child Left Behind Act 2001
. Available www.isbe.net/nclb/csa/application.pdf.
Accessed 20 October 2016
.
Krieg
,
John M.
2008
.
Are students left behind? The distributional effects of the No Child Left Behind Act
.
Education Finance and Policy
3
(
3
):
250
281
. doi:10.1162/edfp.2008.3.2.250.
Krieg
,
John M.
2011
.
Which students are left behind? The racial impacts of the No Child Left Behind Act
.
Economics of Education Review
30
(
4
):
654
664
. doi:10.1016/j.econedurev.2011.02.004.
Ladd
,
Helen F.
, and
Douglas L.
Lauen
.
2010
.
Status versus growth: The distributional effects of school accountability policies
.
Journal of Policy Analysis and Management
29
(
3
):
426
450
. doi:10.1002/pam.20504.
Lauen
,
Douglas L.
, and
S. Michael
Gaddis
.
2012
.
Shining a light or fumbling in the dark? The effect of NCLB's subgroup-specific accountability on student achievement
.
Educational Evaluation and Policy Analysis
34
(
2
):
185
208
. doi:10.3102/0162373711429989.
Loveless
,
Tom
,
Steve
Farkas
, and
Ann
Duffett
.
2008
.
High-achievement students in the era of NCLB
.
Washington, DC
:
Thomas B. Fordham Institute
.
Murphy
,
Kevin J.
2000
.
Performance standard in incentive contracts
.
Journal of Accounting and Economics
30
(
3
):
245
278
. doi:10.1016/S0165–4101(01)00013–1.
Neal
,
Derek A.
, and
Diane W.
Schanzenbach
.
2010
.
Left behind by design: Proficiency counts and test-based accountability
.
Review of Economics and Statistics
92
(
2
):
263
283
. doi:10.1162/rest.2010.12318.
Provini
,
Celine
.
2012
.
No Child Left Behind: 10 years later
. Available www.educationworld.com/a_admin/no-child-left-behind-10-years-later.shtml.
Accessed 18 December 2014
.
Reback
,
Randall
,
Jonah
Rockoff
, and
Heather L.
Schwartz
.
2011
.
Under pressure: Job security, resource allocation, and productivity under NCLB
.
NBER Working Paper No. 16745
.
Rothstein
,
Richard
.
2009
.
No Child Left Behind has failed and should be abandoned
. Available http://dhs.wikispaces.com/file/view/No+Child+Left+Behind-failed.pdf.
Accessed 27 May 2011
.
Springer
,
Matthew G.
2008
.
The influence of an NCLB accountability plan on the distribution of student test score gains
.
Economics of Education Review
27
(
5
):
556
563
. doi:10.1016/j.econedurev.2007.06.004.
Stecher
,
Brian M.
2013
.
Letters: No Child Left Behind, act II
.
Philadelphia Inquirer
,
14 January
.
Viadero
,
Deborah
.
2007
.
Study finds no ‘educational triage’ driven by NCLB
. Available www.edweek.org/ew/articles/2007/10/31/10triage.h27.html.
Accessed 27 May 2011
.

Appendix A:  Data Sources

The principal source of data for this research was Northwest Evaluation Association's Growth Research Database. These data are proprietary. The descriptive statistics reported in table 2 were obtained from the Common Core of Data maintained by the National Center on Education Statistics. The Common Core was likewise the source of information on grades served by each school, used to conduct the terminal grade analyses depicted in the online appendix.

Appendix B:  NWEA Score Alignment Studies

Although all of these studies were at one time posted on the Web by NWEA, many can no longer be found at their original URL. Readers who wish to see these reports should contact us at matthew.g.springer@vanderbilt.edu.

Arizona

Cronin, J. 2003. Aligning the NWEA RIT Scale with Arizona's Instrument to Measure Standards (AIMS). Lake Oswego, OR: Northwest Evaluation Association Research Report 2003.3.

Cronin, J., and B. Bowe. 2003. A Study of the Ongoing Alignment of the NWEA RIT Scale with the Arizona Instrument to Measure Standards (AIMS). Lake Oswego, OR: Northwest Evaluation Association Research Report.

Cronin, J., and M. Dahlin. 2007. A Study of the Alignment of the NWEA RIT Scale with the Arizona Assessment System. Lake Oswego, OR: Northwest Evaluation Association Research Report.

Colorado

Cronin, J. 2003. Aligning the NWEA RIT Scale with the Colorado Student Assessment Program (CSAP) Tests. Lake Oswego, OR: Northwest Evaluation Association Research Report 2003.2.

Bowe, B., and J. Cronin. 2006. A Study of the Ongoing Alignment of the NWEA RIT Scale with the Colorado Student Assessment Program (CSAP). Lake Oswego, OR: Northwest Evaluation Association Research Report.

Cronin, J. 2007. A Study of the Alignment of the NWEA RIT Scale with the Colorado Assessment System. Lake Oswego, OR: Northwest Evaluation Association Research Report.

Idaho

NWEA test is the high-stakes assessment during the period under study.

Indiana

Cronin, J. 2003. Aligning the NWEA RIT Scale with the Indiana Statewide Testing for Educational Progress Plus (ISTEP+). Lake Oswego, OR: Northwest Evaluation Association Research Report 2003.3.

Cronin, J., and B. Bowe. 2005. A Study of the Ongoing Alignment of the NWEA RIT Scale with the Indiana Statewide Testing for Educational Progress Plus (ISTEP+). Lake Oswego, OR: Northwest Evaluation Association Research Report.

Cronin, J. 2007. A Study of the Alignment of the NWEA RIT Scale with the Indiana Assessment System. Lake Oswego, OR: Northwest Evaluation Association Research Report.

Michigan

Bowe, B. 2006. Aligning the NWEA RIT Scale with the Michigan Educational Assessment Program. Lake Oswego, OR: Northwest Evaluation Association Research Report.

Cronin, J. 2007. A Study of the Alignment of the NWEA RIT Scale with the Michigan Assessment System. Lake Oswego, OR: Northwest Evaluation Association Research Report.

Minnesota

Cronin, J. 2004. Adjustments Made to the Results of the NWEA RIT Scale Minnesota Comprehensive Assessment Alignment Study. Lake Oswego, OR: Northwest Evaluation Association Research Report.

Cronin, J. 2007. A Study of the Alignment of the NWEA RIT Scale with the Minnesota Assessment System. Lake Oswego, OR: Northwest Evaluation Association Research Report.

Wisconsin

Cronin, J. 2004. Aligning the NWEA RIT Scale with the Wisconsin Knowledge and Concepts Exams. Lake Oswego, OR: Northwest Evaluation Association Research Report.

Adkins, D. 2007. A Study of the Alignment of the NWEA RIT Scale with the Wisconsin Assessment System. Lake Oswego, OR: Northwest Evaluation Association Research Report.

Supplementary data