## Abstract

This paper reports the results of two randomized field experiments, each offering different populations of Chicago youth a supported summer job. The program consistently reduces violent-crime arrests, even after the summer, without improving employment, schooling, or other arrests; if anything, property crime increases over two to three years. Using a new machine learning method, we uncover heterogeneity in employment impacts that standard methods would miss, describe who benefits, and leverage the heterogeneity to explore mechanisms. We conclude that brief youth employment programs can generate important behavioral change, but for different outcomes, youth, and reasons than those most often considered in the literature.

## I. Introduction

FOR at least half a century, social scientists and policymakers have argued that a combination of job training, search assistance, remedial course work, and subsidized work can improve employment and wages by developing human capital and reducing search costs (LaLonde, 2003). Improving employment may in turn increase the opportunity cost of crime or improve other social outcomes, though nonemployment outcomes are often treated as ancillary benefits of improved employment in the literature (Crépon & van den Berg, 2016). Reviews of the evidence on whether employment programs cost-effectively achieve these goals in the United States, at least among the disadvantaged youth who are the focus of this paper, lean fairly negative, though vary in their level of pessimism. Most conclude that only very intensive and expensive training programs improve labor market outcomes, while only a tiny handful reduce crime, largely limited to the period of the program itself.1

Recent studies on summer youth employment programs (SYEPs), however, show starkly different results from other job training programs. Despite being neither intensive nor expensive, short programs in Chicago, New York, and Boston dramatically reduce violent crime and mortality, even after the program has ended (Gelber, Isen, & Kessler, 2016; Heller, 2014; Modestino, 2019). The programs do this without improving average employment outcomes (if anything, some youth have lower future earnings) and with small, if any, effects on education outcomes (Gelber et al., 2016; Valentine et al., 2017; Leos-Urbel, 2014; Schwartz, Leos-Urbel, & Wiswall, 2015; Heller, 2014). This pattern of results—big postprogram crime declines with no indication of improved human capital or increased opportunity costs—raises questions about what mechanisms are at work and why effects are so different from the youth training literature.

This paper tests whether treatment effect heterogeneity is part of the answer. In theory, SYEPs might improve human capital, increase employment, and thereby decrease crime among a subset of youth, while others either do not respond or allow the program to crowd out better opportunities. If so, zero average employment effects could mask heterogeneity that explains net crime declines. In addition to being important for program targeting, such heterogeneity might also explain why SYEPs' effects differ from those of other youth employment programs: they serve a younger, more school-attached population than the disconnected youth targeted by other training programs, and prevention (reaching youth before they leave school) might be easier than remediation (reaching them after unemployment spells).2 But because there has been so little overlap in participants of summer and other employment interventions, it has been impossible to separate program from population differences until now.

We use two randomized controlled trials (RCTs) of a Chicago summer jobs program, along with a new machine learning technique for estimating treatment heterogeneity, to test these ideas. The first RCT in 2012 offered an eight-week, part-time summer job at minimum wage ($8.25/hour) and an adult job mentor to a population of disadvantaged highschool students.3 The second RCT in 2013 offered a similar six-week program but purposefully expanded the eligibility criteria to include disconnected, out-of-school youth more like those targeted by other youth employment interventions. Most youth also participated in a curriculum built on cognitive behavioral therapy principles aimed at helping them manage their cognitive and emotional responses to conflict, as well as encouraging them to set and achieve personal goals. We track youth in administrative data through 2015 from the Chicago Public Schools, the Illinois State Police, and the Illinois Department of Employment Security. In both study years, a supported summer job generates dramatic and robust reductions in violent-crime arrests in the year after random assignment: the local average treatment effect is a 42% decline in the first study and a 33% decline in the second (4.1 and 7.9 fewer arrests per 100 participants, respectively). Across the whole sample, the effect is still significant after adjustments for multiple hypothesis testing. The pooled sample shows a 26% decline in violent-crime arrests even after removing the program months from the data ($p=.067$), meaning the behavior change is not simply a mechanical result of keeping youth busy over the summer that disappears as soon as the job ends. The program also does not seem to increase the overall opportunity cost of crime or keep youth out of trouble more generally: participants' total number of arrests does not change, and if anything, property crime increases in later years. Neither employment outcomes nor other indicators of human capital such as schooling improve, at least on average. We then estimate treatment heterogeneity based on observable characteristics. Tests for heterogeneity typically involve interacting a treatment indicator with a series of baseline covariates, one at a time. But each additional hypothesis test raises the probability of spurious findings. And if heterogeneity is driven by the interaction of more than one characteristic at a time (or a nonlinear function of a continuous variable), typical interaction tests may miss substantively important variation in treatment effects. To more flexibly estimate treatment heterogeneity, we use a causal forest (Athey & Imbens, 2016; Wager & Athey, 2018), predicting treatment effects based on high-dimensional, nonlinear functions of observables and mining the data for responsive subgroups in a principled way. We develop tests for whether the predicted heterogeneity works in detecting actual heterogeneity, describe who benefits, and use the patterns of heterogeneity to assess potential mechanisms. The causal forest identifies significant heterogeneity in employment impacts, which standard interaction-based approaches would miss. We identify a subgroup whose postprogram employment improves by 15 percentage points (44%), which on its own is an important result for policy. That subgroup is younger, more engaged in school, more Hispanic, more female, and less likely to have an arrest record. In other words, the employment benefiters are not the disconnected youth whom other employment programs typically target. Although the results are imprecise, we also show that the drop in violence is not concentrated among the employment benefiters. If anything, nonviolent crime increases among those with better employment. These findings offer little support for the traditional theory that improved human capital and increased opportunity cost explain crime declines. They do, however, emphasize the potential gains from more flexible approaches to treatment heterogeneity; we calculate that targeting the program using the causal forest could generate employment impacts four times larger than targeting using a more standard approach. We find no significant heterogeneity in effects on violent-crime arrests or school persistence. The fact that we cannot distinguish variation in violence impacts, at least within our disadvantaged urban population, suggests there is value to targeting SYEPs toward youth who are at risk of violent crime, a group that traditional training programs often screen out.4 The question of why violence declines for everyone, seemingly independent of changes in employment, is important. Expanded prosocial attitudes, improved beliefs about the future, or general “staying busy” explanations are not entirely satisfactory given that property crime increases in later follow-up years and that bigger employment gains seem to accompany increases in nonviolent crime. But more nuanced crime theory highlights the role of opportunity: a program that introduces youth to richer areas and new peers may increase opportunities for theft and drug purchases but decrease opportunities to fight, even without changing labor market outcomes (Cohen & Felson, 1979; Cook, 1986; Clarke, 1995). Anecdotal evidence from employers provides another hypothesis for why violence, which by definition involves conflict with other people, may change: employers report helping youth develop self-regulation and respond positively to criticism, which could reduce conflicts outside the workplace as well. There could also be a role for unmeasured informal sector work, peer networks, income, or violence-specific attitudes, norms, or beliefs. Further research is needed to sort out exact mechanisms. In the meantime, we show the potential of SYEPs to reduce violence among a new population, as well as the potential of machine learning to help policymakers rethink who benefits from what kind of employment program and why. ## II. Program Description and Experimental Design Chicago's Department of Family and Support Services (DFSS) designed One Summer Chicago Plus (OSC+) primarily as a violence-reduction intervention. The program structure was similar across both summers: youth were offered a summer job, five hours per day and five days a week, at minimum wage ($8.25 per hour) for eight weeks in 2012 and six weeks in 2013. All youth were assigned a job mentor—an adult to assist them in learning to be successful employees and to help them deal with barriers to employment—at a ratio of about 10 to 1. DFSS administered the program through contracts with local nonprofit agencies. These agencies recruited applicants, offered participating youth brief training, hired the mentors, recruited employers, placed youth in summer jobs, provided daily lunch and bus passes when appropriate, monitored the participants' progress over the course of the summer, and if youth were fired, worked with them to find an alternative placement.5

In the first year of the program, youth ages 14 to 21 were recruited from thirteen high-violence Chicago public high schools. A total of 1,634 youth (about 13% of the prior year's student population in these schools) applied for the 700 available program slots. The research team randomly assigned treatment slots at the individual level within school and gender blocks. Youth worked at a range of nonprofit and government employers on tasks such as supervising younger youth at summer camps, clearing lots to plant a community garden, improving infrastructure at local schools, and providing administrative support at aldermen's offices. Because of restrictions imposed by a funder, there were no private sector jobs.

In the second year of the program, 16- to 22-year-old male youth in one of two applicant pools could apply. The first pool ($n=2,127$) was invited to voluntarily apply directly from the criminal justice system (from probation offices, juvenile detention or prison, or a center to serve justice-involved youth). The rest ($n=3,089$) had applied to Chicago's broader summer programming; those who were ages 16 to 20, lived in one of the thirty highest-violence community areas, and included a social security number (SSN) on their application entered the lottery. Notably, participants were no longer required to be in school. The resulting 5,216 boys were individually randomly assigned to treatment or control groups within applicant pool-age-geography blocks (2,634 treatment and 2,582 control), with each block assigned to a specific service agency. Because of the time-constrained recruiting process, the number of youth assigned to the treatment group far exceeds the number of available slots (1,000).6 One important implication is that the maximum possible take-up rate, even if the first thousand youth were immediately located and agreed to participate, is 38%. This is by design and should not be interpreted as indicating low demand for the program among the treatment group. Private sector jobs were included in this program year.

In 2013, DFSS also encouraged treatment youth to keep participating in programming offered by the community service agencies after the summer ended, including a mix of additional social-emotional learning activities, job mentoring, and social outings such as sporting events and DJ classes. These activities were much lower intensity than the summer programming, and participants received a small stipend (approximately $200) rather than an hourly wage. Appendix B reports additional details about the program, randomization, and recruitment. ## III. Data and Descriptive Statistics We match study youth to existing administrative data sets from a variety of government sources. Program application and participation records come from DFSS. We measure crime with Illinois State Police (ISP) arrest records, which combine police records from departments across the state.7 We use the description of each offense to categorize offenses as violent, property, drug, or other (e.g., vandalism, trespassing, outstanding warrants). The data cover both juvenile and adult arrests from 2001 through two (2013 cohort) or three (2012 cohort) years postrandom assignment. Youth who have never been arrested will not be in the ISP records, so we assign 0 arrests for individuals not matched to the data. We use student-level administrative records from Chicago Public Schools (CPS) to capture schooling outcomes. These data include enrollment status, grade level, course grades, and attendance from the beginning of CPS enrollment through the 2015–16 academic year.8 Our main analysis excludes preprogram graduates ($n=1,422$) as well as anyone who never appeared in the CPS records (and so likely always attended school outside the district, $n=435$).9 Since these are both baseline characteristics, the exclusion should not undermine the integrity of random assignment (see appendix table A1 for balance tests on this subsample). We focus on the school year following the program, since missing GPA and attendance data become a bigger problem over time as more students graduate, drop out, or transfer. To assess longer-term performance, we define a school persistence measure that is available for everyone in the CPS data regardless of missing attendance and GPA data in future years: an indicator that equals 1 if the youth has graduated from CPS in the first two postprogram school years or is still attending school in the third postprogram school year. We measure employment using quarterly Unemployment Insurance (UI) records, which include earnings and employer for each formal sector UI-covered job. We obtain SSNs for matching to UI data through school records, although the school district did not require students to report SSNs. As such, our main employment analysis excludes youth who could not be matched due to missing SSNs (26% of the sample either never enrolled in CPS or were missing SSN in the school records). This approach assumes that SSNs are missing completely at random; appendix table A9 shows the results are robust to different approaches to missing data. A subset of OSC+ providers did not report program earnings to the UI system. For youth attending these providers, we impute program quarter earnings as the sum of earnings at other employers and their reported program hours times$8.25. For youth with SSNs but no UI data, we assign 0s for employment and earnings, assuming anyone not found in the matching process never worked in the formal sector. Appendix C reports additional details on all data sources, matching procedures, and variable definitions.

Table 1 shows select baseline characteristics for 2012 and 2013 control groups, as well as tests of treatment-control balance for each covariate conditional on randomization block fixed effects. No more of the differences are significant than would be expected by chance, and tests of joint significance suggest that randomization successfully balanced the two groups (pooling both samples together and testing balance using our full set of covariates, $F(69,6709)=0.85$ with $p=0.81$). (See appendix C.4 for other descriptive statistics and balance tests.)

Table 1.
Descriptive Statistics and Baseline Balance
 Program Year: 2012 2013 $N$ Control Mean Treatment Coefficient SE $N$ Control Mean Treatment Coefficient SE Age at program start 1,634 16.30 $-$0.05 (0.07) 5,216 18.42 0.03 (0.02) Black 1,634 0.96 0.00 (0.01) 5,216 0.91 0.01 (0.01) Any baseline arrest 1,634 0.20 0.01 (0.02) 5,216 0.47 0.02 (0.01) In CPS data 1,634 1.00 0.00 (0.00) 5,216 0.91 0.00 (0.01) Engaged in CPS in June (if ever in CPS) 1,634 0.99 0.00 (0.01) 4,781 0.51 0.00 (0.01) Days attended (if any attendance) 1,629 136.93 0.69 (1.40) 2,930 122.78 2.54 (1.82) Has SSN 1,634 0.81 0.02 (0.02) 5,216 0.71 0.01 (0.01) Worked in prior year (if has SSN) 1,334 0.07 −0.02 (0.01) 3,742 0.22 0.00 (0.01) Census tract: Median income 1,634 35,665 −347 (660) 5,216 33,759 −175 (360) Census tract: Unemployment rate 1,634 19.07 −0.03 (0.42) 5,216 12.81 0.14 (0.12)
 Program Year: 2012 2013 $N$ Control Mean Treatment Coefficient SE $N$ Control Mean Treatment Coefficient SE Age at program start 1,634 16.30 $-$0.05 (0.07) 5,216 18.42 0.03 (0.02) Black 1,634 0.96 0.00 (0.01) 5,216 0.91 0.01 (0.01) Any baseline arrest 1,634 0.20 0.01 (0.02) 5,216 0.47 0.02 (0.01) In CPS data 1,634 1.00 0.00 (0.00) 5,216 0.91 0.00 (0.01) Engaged in CPS in June (if ever in CPS) 1,634 0.99 0.00 (0.01) 4,781 0.51 0.00 (0.01) Days attended (if any attendance) 1,629 136.93 0.69 (1.40) 2,930 122.78 2.54 (1.82) Has SSN 1,634 0.81 0.02 (0.02) 5,216 0.71 0.01 (0.01) Worked in prior year (if has SSN) 1,334 0.07 −0.02 (0.01) 3,742 0.22 0.00 (0.01) Census tract: Median income 1,634 35,665 −347 (660) 5,216 33,759 −175 (360) Census tract: Unemployment rate 1,634 19.07 −0.03 (0.42) 5,216 12.81 0.14 (0.12)

Balance test shows treatment coefficient and robust standard error from a regression of each characteristic on a treatment indicator, block fixed effects, and duplicate indicators. Gender not included in the table since it is collinear with randomization blocks. The 2012 sample was 38.5% male; the 2013 sample was all male. $*p<0.1,**p<0.05,and***p<0.01$.

Youth in both cohorts are over 90% African American and largely from poor, highly disadvantaged neighborhoods: median neighborhood income is $33,000 to$36,000 with local unemployment rates around 13% to 19%. Thirty-eight percent of the 2012 cohort and all of the 2013 cohort are male. Recall that in part to test for heterogeneous program effects on a broader population of youth, the eligibility rules across program years changed. As a result, the 2013 cohort is older (18.4 versus 16.3 years old), more criminally involved (47% versus 20% have an arrest record), and less engaged in school (51% versus 99% still engaged in school before the program, and accounting for the longer school year in 2013, missing three months versus six weeks of the prior school year, conditional on any attendance). The 2013 youth are also more likely to have been employed in the prior year (22% versus 7%).

## IV. Analytical Methods

To make results easier to compare across two study cohorts with different take-up rates, we focus on local average treatment effects (LATEs), or the effect of participating on compliers (Angrist, Imbens, & Rubin, 1996). With almost no control crossover in our setting, these estimates should be quite close to the treatment-on-the-treated. Intent-to-treat (ITT) results are in appendix D. We estimate LATEs using random assignment as an instrument for any program participation, including block fixed effects and the controls discussed in appendix E (which also shows that results are similar without covariates). To help judge the magnitude of the LATEs, we estimate average outcomes for the control youth who would have participated had they been assigned to treatment—the “control complier mean” (CCM; see Heller et al., 2017; Katz, Kling, & Liebman, 2001).10 We report heteroskedasticity-robust standard errors, clustered on individuals when using the pooled sample to account for the 140 youth in both study cohorts. Appendix table A4 shows similar $p$-values from randomization inference (permuting treatment assignment 10,000 times to approximate Fisher's exact test). This tests the sharp null of no treatment effects for anyone and avoids relying on modeling assumptions and large-sample approximations that may not hold in finite samples (Athey & Imbens, 2017).

In any experiment testing program effects on multiple outcomes, not to mention heterogeneous treatment effects by subgroup, one might worry that the probability of type I error increases with the number of tests conducted. We take a number of steps to ensure that our results are not just the result of data mining. First, we note that because DFSS built the program and recruiting strategy mainly to reduce youth violence, the impact on violent-crime arrests was the first prespecified outcome of interest.

Second, we present two inference adjustments to account for multiple hypothesis testing. The first uses a free step-down resampling method to control the family-wise error rate (FWER), the probability that at least one of the true null hypotheses in a family of hypothesis tests is rejected (Anderson, 2008; Westfall & Young, 1993). The second shows the q-value, or the smallest level at which we can control the false-discovery rate (FDR) in a group of hypotheses and still reject the null for that outcome (Benjamini & Hochberg, 1995). This adjustment increases the power of individual tests in exchange for allowing some specified proportion of rejections to be false. We define our families as: (a) the four types of crime separately for each follow-up year (excluding total arrests since it is a linear combination of the rest); (b) the four main schooling outcomes for youth with a CPS record who had not yet graduated prior to the program (enrollment, days present, GPA, and school persistence); (c) total earnings and overall, provider, and nonprovider employment in program quarters; and (d) total earnings and overall, provider, and nonprovider employment in post-program quarters. (Appendix G provides implementation details.)

Third, we eschew the standard approach to treatment heterogeneity: choosing several one-way interactions a priori to test for heterogeneity (or, worse, searching over a large number of interaction effects for particularly responsive subgroups, which risks overfitting and detecting spurious effects). Instead, we implement a version of Wager and Athey's (2018) causal forest algorithm, which predicts treatment effects based on an individual's covariates. For this prediction, we focus on estimating conditional ITT effects $(E(Y1-Y0|X=x))$, which capture differences in both youths' responses to the program and their propensity to participate if offered the program.11 This method allows flexible, high-dimensional combinations of covariates to identify who gains from the program in a way that researcher-determined interaction effects would typically avoid.

For example, suppose $τi,y=Y1i-Y0i$ is the true treatment effect on outcome $y$ for individual $i$. Typical approaches might estimate and compare $E(τi,y|male=1)$ to $E(τi,y|male=0)$, or perhaps $E(τi,y|male=1,African-American=1)$ based on what the researcher specifies. If the true treatment heterogeneity is more complex than differences by gender or race (e.g., only African American males with more than three prior arrests who live in neighborhoods with less than 12% unemployment rates benefit from the program), then researcher-specified interactions will miss it. But in theory, the causal forest can capture this pattern by searching over all values of all the covariates to isolate the combination of covariate values that predict the most heterogeneity in effects. The goal becomes predicting heterogeneity in $E(τi,y|X=x)$ using all the available information on Xs rather than testing whether particular Xs are associated with significantly different treatment effects.12

Our methodology for estimating causal forests, based on Athey and Imbens (2016) and Wager and Athey (2018), is described in Davis and Heller (2017). We give an intuitive explanation of the method here, attempting to avoid machine learning jargon to make the discussion accessible. (Technical details are in appendix H.) The basic goal is to divide the sample into bins that share similar covariates and use the within-bin treatment effect as the predicted treatment effect for anyone with that bin's Xs. However, using the same observations to bin the data and predict the treatment effects within bins could induce overfitting, so the procedure uses different subsamples for binning and for effect estimation. It repeats the procedure over many subsamples and averages the predictions to reduce variance.

To predict ITT effects conditional on covariates for a particular outcome, we repeat the following procedure. First, draw a 20% subsample without replacement from the data. Using a random half of the subsample, use a regression tree–based algorithm to bin the observations by values of $X$.13 The algorithm recursively searches over possible ways to break the data into bins based on values of covariates, choosing the divisions that maximize the variance of treatment effects across bins subject to a penalty for within-bin variance (see appendix H for algorithm details).14 Once the bins are formed, switch to the other half of the subsample and sort the new observations into the same bins. Calculate the treatment effect ($τ^b=y¯T,b-y¯C,b$, or the difference in mean outcomes between treatment and control observations) using the new observations within each bin $b$.

Next, switch to the other 80% of the sample (observations that are not part of the subsample), figure out in which bin each observation would belong based on its $Xs$, and assign that bin's $τ^b$ as the predicted treatment effect.15 Predictions averaged across many trees have better predictive accuracy than estimates from a single tree given the high variance of a single tree's predictions (James et al., 2013). We repeat this process with 100,000 subsamples (the causal parallel of a random forest rather than a single regression tree), averaging an observation's prediction across iterations to obtain a single predicted treatment effect. We find that increasing the number of trees from 25,000 to 100,000 dramatically increases the stability of our estimates across different random seeds.

## V. Participation

In the first program year, 75% of youth offered the program actually participated, and participants averaged 35 days of work out of a possible 40. In the second program year, when the maximum possible take-up rate was 38% by construction (see section II), actual program take-up was 30%. Participants worked an average of 18 days out of a possible 30, reflecting in part the greater challenge of recruiting and retaining a more disconnected and criminally active population in the second year. There was no control crossover in the first cohort; ten control youth in the second cohort (0.4%) participated in the program. Twenty percent of the 2013 treatment group participated in any postsummer programming. On average, these participants attended about 18.5 days of additional programming over about a nine-month period. Across both cohorts, the $F$-statistic on the first-stage regression measuring any OSC+ participation is 2,205. (See appendix section I.1 for additional participation and first-stage details.)

To give a better sense of the counterfactual over the summer16 for those with employment data available, appendix table A3 shows rates at which youth worked in a formal sector job, worked in OSC+, or did not work at all. Few youth worked in other UI-covered jobs: in 2012, about 8% of the treatment group and 15% of the control group worked outside of OSC+, with 17% of the treatment group and 23% of the control group working nonprogram jobs in 2013. The treatment-control differences in nonprogram employment suggest that OSC+ generates a small amount of crowd-out, though the program still dramatically increases the overall proportion of youth who work over the summer. The treatment-control difference in having no job is 68 percentage points in 2012 (from a control mean of 84%) and 24 percentage points in 2013 (from a control mean of 75%).

## VI. Main Results

Panels A and B of table 2 show LATE estimates of the program's impact on crime separately by cohort (coefficients and standard errors are multiplied by 100, representing the effect of any program participation per 100 compliers). (Appendix D shows the ITT crime results.) The patterns of behavioral change are remarkably similar across studies: both cohorts show large and statistically significant declines in violent-crime arrests during the first postlottery year, followed in later years by increases in property-crime arrests that are statistically significant in the 2012 cohort. Given that the main goal of the program was violence reduction, the magnitude of the results is quite promising: the first study shows that the program causes 4.1 fewer violent-crime arrests per 100 participants in the first postprogram year, a 42% decline relative to the control complier mean (CCM). That finding is replicated in the second study, where the absolute magnitude of the change is somewhat larger (7.9 fewer violent-crime arrests) but proportionally slightly smaller (a 33% decline). This pattern is consistent with the fact that the second cohort was more criminally active (more crime to prevent) but worked fewer hours (slightly smaller proportional change). The substantively similar and statistically indistinguishable results across the two studies are important on their own, suggesting that the first study's results were not just a statistical fluke.

Table 2.
Local Average Treatment Effect on Number of Arrests ($×$ 100) by Cohort and Year
 Year 1 Year 2 Year 3 LATE SE CCM LATE SE CCM LATE SE CCM Arrest Type A. 2012 Program Violent −4.13** (1.97) 9.83 −0.20 (1.70) 5.17 0.15 (1.63) 5.37 Property 1.64 (1.39) 3.14 3.80** (1.73) 1.34 2.75** (1.33) 0.74 Drugs 0.52 (2.16) 3.89 −2.38 (1.86) 8.08 −2.87 (2.08) 7.65 Other 1.36 (2.71) 10.77 1.12 (2.63) 9.36 1.69 (2.67) 10.45 B. 2013 Program Violent −7.90** (3.77) 24.20 1.31 (3.25) 12.71 Property 1.76 (3.09) 11.62 2.16 (3.11) 6.37 Drugs 4.4 (5.02) 19.92 −7.92 (5.02) 26.65 Other −11.6 (8.21) 56.18 2.82 (7.75) 36.41 C. Pooled Year 1 Year 2 Year 1 and Year 2 Violent −6.34*** (2.24) 18.31 0.76 (1.94) 9.55 −5.58* (3.16) 27.85 Property 1.68 (1.80) 8.18 2.97 (1.89) 4.19 4.65 (2.86) 12.36 Drugs 2.35 (2.90) 13.84 −5.26* (2.86) 18.66 −2.91 (4.42) 32.50 Other −5.05 (4.62) 36.36 2.44 (4.38) 25.03 −2.61 (7.21) 61.39
 Year 1 Year 2 Year 3 LATE SE CCM LATE SE CCM LATE SE CCM Arrest Type A. 2012 Program Violent −4.13** (1.97) 9.83 −0.20 (1.70) 5.17 0.15 (1.63) 5.37 Property 1.64 (1.39) 3.14 3.80** (1.73) 1.34 2.75** (1.33) 0.74 Drugs 0.52 (2.16) 3.89 −2.38 (1.86) 8.08 −2.87 (2.08) 7.65 Other 1.36 (2.71) 10.77 1.12 (2.63) 9.36 1.69 (2.67) 10.45 B. 2013 Program Violent −7.90** (3.77) 24.20 1.31 (3.25) 12.71 Property 1.76 (3.09) 11.62 2.16 (3.11) 6.37 Drugs 4.4 (5.02) 19.92 −7.92 (5.02) 26.65 Other −11.6 (8.21) 56.18 2.82 (7.75) 36.41 C. Pooled Year 1 Year 2 Year 1 and Year 2 Violent −6.34*** (2.24) 18.31 0.76 (1.94) 9.55 −5.58* (3.16) 27.85 Property 1.68 (1.80) 8.18 2.97 (1.89) 4.19 4.65 (2.86) 12.36 Drugs 2.35 (2.90) 13.84 −5.26* (2.86) 18.66 −2.91 (4.42) 32.50 Other −5.05 (4.62) 36.36 2.44 (4.38) 25.03 −2.61 (7.21) 61.39

Coefficients, standard errors, and control complier means (CCMs) multiplied by 100. All regressions estimated using 2SLS including block fixed effects, duplicate application indicators, and the baseline covariates listed in the appendix. Huber-White standard errors in parentheses, clustered on individual in panel C. $*p<0.1,**p<0.05,and***p<0.01$.

Given the similarity in results across cohorts for crime and other outcomes (see appendix tables A5 and A6), we focus the remainder of our discussion on results pooling the two cohorts, using two years of follow-up data to be comparable across study years. Table 2, panel C shows the crime results with the pooled sample. The drop in violent-crime arrests during the first year is statistically significant and substantively large: 6.3 fewer violent-crime arrests per 100 participants, a 35% decline. The drop is not limited to the program summer, when youth are mechanically kept busier; excluding program months, violent-crime arrests decline by 26% (not shown, LATE = −3.5 per 100 youth, $p=0.067$). We also see positive but not statistically significant point estimates for property-crime arrests in all years and a marginally significant decline in drug arrests in year 2. As a result, there are no significant changes in the number of total arrests (see appendix table A7).

Panel A of table 3 shows $p$-values for the pooled crime results adjusted for multiple hypothesis testing. The first two columns show the LATE and standard error. The third column shows the $q$-value from Benjamini & Hochberg's (1995) FDR control procedure, or the smallest level of $q$ at which the null hypothesis would be rejected (where $q$ is the expected proportion of false rejections within the family). Column 4 shows $p$-values that control the FWER with a free step-down resampling method, followed by the CCM and sample size for each outcome. (See appendix G for details.) The reduction in year 1 arrests for violent crime remains significant after adjusting inference to control the FWER across the four crime categories ($p=0.018$) or to control the FDR ($q=0.019$).

Table 3.
IV Program Impacts and MHT Adjustments for Both Cohorts, Pooled
 LATE SE FDR FWER CCM N A. Arrests (×100) Violent, year 1 −6.34*** (2.24) 0.019 0.018 18.31 6,850 Property, year 1 1.68 (1.80) 0.417 0.583 8.18 6,850 Drugs, year 1 2.35 (2.90) 0.417 0.420 13.84 6,850 Other, year 1 −5.05 (4.62) 0.417 0.623 36.36 6,850 Violent, year 2 0.76 (1.94) 0.695 0.696 9.55 6,850 Property, year 2 2.97 (1.89) 0.232 0.307 4.19 6,850 Drugs, year 2 −5.26* (2.86) 0.232 0.239 18.66 6,850 Other, year 2 2.44 (4.38) 0.695 0.826 25.03 6,850 B. Schooling Any enrollment, year 1 0.01 (0.02) 0.82 0.982 0.74 4,993 Days present, year 1 −0.59 (2.60) 0.82 0.827 91.53 4,993 GPA, year 1 0.01 (0.05) 0.82 0.969 1.95 2,447 Persistence through year 3 −0.01 (0.02) 0.82 0.960 0.62 4,993 C. Employment and earnings Any employment, program quarter 0.88*** (0.02) 0.001 <0.001 0.12 5,076 Provider employment, program quarter 1.04*** (0.01) 0.001 <0.001 0.00 5,076 Nonprovider employment, program quarter −0.05** (0.02) 0.030 0.033 0.16 5,076 Earnings, program quarter 739.49*** (66.6) 0.001 <0.001 70.71 5,076 Any employment, postprogram 0.03 (0.03) 0.553 0.526 0.47 5,076 Provider employment, postprogram quarter 0.11*** (0.02) 0.566 <0.001 0.04 5,076 Nonprovider employment, postprogram quarter −0.02 (0.03) 0.001 0.620 0.44 5,076 Quarterly earnings, postprogram 56.39 (70.55) 0.620 0.665 336.88 5,076
 LATE SE FDR FWER CCM N A. Arrests (×100) Violent, year 1 −6.34*** (2.24) 0.019 0.018 18.31 6,850 Property, year 1 1.68 (1.80) 0.417 0.583 8.18 6,850 Drugs, year 1 2.35 (2.90) 0.417 0.420 13.84 6,850 Other, year 1 −5.05 (4.62) 0.417 0.623 36.36 6,850 Violent, year 2 0.76 (1.94) 0.695 0.696 9.55 6,850 Property, year 2 2.97 (1.89) 0.232 0.307 4.19 6,850 Drugs, year 2 −5.26* (2.86) 0.232 0.239 18.66 6,850 Other, year 2 2.44 (4.38) 0.695 0.826 25.03 6,850 B. Schooling Any enrollment, year 1 0.01 (0.02) 0.82 0.982 0.74 4,993 Days present, year 1 −0.59 (2.60) 0.82 0.827 91.53 4,993 GPA, year 1 0.01 (0.05) 0.82 0.969 1.95 2,447 Persistence through year 3 −0.01 (0.02) 0.82 0.960 0.62 4,993 C. Employment and earnings Any employment, program quarter 0.88*** (0.02) 0.001 <0.001 0.12 5,076 Provider employment, program quarter 1.04*** (0.01) 0.001 <0.001 0.00 5,076 Nonprovider employment, program quarter −0.05** (0.02) 0.030 0.033 0.16 5,076 Earnings, program quarter 739.49*** (66.6) 0.001 <0.001 70.71 5,076 Any employment, postprogram 0.03 (0.03) 0.553 0.526 0.47 5,076 Provider employment, postprogram quarter 0.11*** (0.02) 0.566 <0.001 0.04 5,076 Nonprovider employment, postprogram quarter −0.02 (0.03) 0.001 0.620 0.44 5,076 Quarterly earnings, postprogram 56.39 (70.55) 0.620 0.665 336.88 5,076

Table shows estimated LATEs pooling both years of the program and inference controlling for the false discovery rate (FDR) and the familywise error rate (FWER). See text for details about multiple hypothesis testing adjustments. Negative control complier means (CCMs) set to 0. Asterisks based on conventional inference: $*p<0.1,**p<0.05,and***p<0.01$.

The decline in violent-crime arrests does not continue in the second year. Fade-out is almost universal in social interventions, though it is also worth noting that the program occurred at a high-violence moment in the youths' trajectories: the violence control complier means in year 2 are about half of the size of year 1. This pattern suggests that part of the fade-out may stem from well-timed program delivery, after which youth start aging out of violent crime. It is also possible that more control compliers were incarcerated for their earlier violent crimes during the first year of the program, which could mechanically lead to lower crime rates among the control group during year 2. However, the CCMs for drug-crime arrests are higher in year 2 than in year 1, which is not entirely consistent with the idea that the control youth just have less time free to offend. Even if so, providing a SYEP is likely a less socially costly way to reduce crime than incarceration, and cumulatively across years, the violence drop remains significant ($p=0.078$).

We see a marginally significant decline in drug crimes during year 2 and an imprecise but substantively large increase in property crime (which is statistically significant if we add the third year of outcome data for the 2012 cohort: 5.8 more property-crime arrests per 100 participants, a 46% increase, $p=0.053$). Program effects that go in opposite directions for violent and property crime are fairly common in the literature (Kling, Ludwig, & Katz, 2005; Deming, 2011; Jacob & Lefgren, 2003); in fact, a short-term violence decline followed by a longer-term property crime increase is notably similar to the pattern of results in the Moving-to-Opportunity study. An increase in property crime might be expected if youth are spending more time traveling or working, since they have more access to better things to steal (Clarke, 1995). However, the changes in nonviolent crime are less robust to multiple hypothesis testing adjustments, so we interpret them more cautiously. The fact that violence is so much more socially costly than other types of crime highlights the importance of analyzing crime types separately rather than aggregating the differences away.

One possible explanation for the violence decline could be that participants learn about the returns to schooling, or develop motivation, self-efficacy, or other prosocial beliefs, and so spend more time engaged in school in the year after the program. The schooling results in panel B, however, suggest this is not the case: we find no significant changes in CPS enrollment, days present, or GPA during the school year after the program, and the confidence interval in the pooled sample rules out more than a four- to five-day increase in attendance.17 The conclusions are unchanged after adjusting inference for multiple hypothesis testing. These results focus on the year after the program, since missing data become a larger problem as youth age (more graduation and dropout in later years). To capture longer-term school engagement, the last row of panel B shows the program's impact on whether a youth persists in school (remains enrolled or graduates) through the start of the third year after random assignment. The point estimate is small, negative, and statistically insignificant. Overall, there is little evidence of changes in schooling outcomes.

## IX. Conclusion

This paper shows that a supported summer jobs program in Chicago generates large one-year declines in violent-crime arrests, both in an initial study (42% decline) and in an expansion study with more disconnected youth (33%). The drop in violence continues after the program summer and remains substantively large after two to three years, though it stops accruing after the first year. And it occurs despite no detectable improvements in schooling, UI-covered employment, or other types of crime during the follow-up period. If anything, property crime increases in future years, though the large social cost of violence means that social benefits may still outweigh the program's administrative costs (see appendix J).

Using a new supervised machine learning method called the causal forest, we show that the 0 average employment effect masks a group whose formal sector employment improves by 15 percentage points (44%). We show that this subgroup is younger and more engaged in school than the group with no employment gains—fairly different from the out-of-school and out-of-work young people usually targeted by youth employment programs. However, the employment benefiters do not seem to drive the crime decline. Predicted employment impacts are almost completely uncorrelated with the impact on violent-crime arrests. And if anything, the impact on nonviolent arrests is positively correlated with employment gains. This is not consistent with the idea that changes in opportunity cost explain the crime effects. But it is consistent with other crime theory: better employment provides more opportunity for theft and more money for drug markets.

We do not find any detectable heterogeneity in program impacts on violence. Although this may be a question of power, it suggests that everyone—at least everyone in our disadvantaged, urban sample—benefits on this outcome. This finding highlights another reason why SYEPs may have different effects from other youth training programs. To reduce violence, a program must serve youth at risk of violence. But programs like Job Corps and Year Up screen out youth with certain criminal backgrounds, so may not have much room to make big strides on violence.

There tends to be a fair amount of pessimism in the youth employment literature about how difficult and costly it is to improve youth outcomes. The evidence we present here, combined with growing evidence from programs in other cities, suggests that this pessimism may stem in part from mistaken beliefs about what these programs achieve and for whom. Rethinking what youth training programs do and how to target them, as well as further exploring why SYEPs decrease violence, may help better direct limited government resources and improve our understanding of youth behavior.

## Notes

1

See appendix A for a summary of these reviews and the youth job training literature, including recent more positive findings.

2

Appendix A documents how standard youth job programs serve almost exclusively out-of-school, out-of-work youth but often screen for criminal involvement, while SYEPs serve mostly high school students, who may be closer to the peak of the age-crime curve.

3

Crime results from within Chicago over the first sixteen months and one-year schooling outcomes for this RCT were reported in Heller (2014). This paper adds two more years of school data, two more years of crime data that now include all arrests statewide, and previously unreported employment outcomes, as well as the entire second study in 2013.

4

This is not to argue for serving exclusively this population without additional research, since changing peer composition could change program impacts.

5

In 2012, program providers were Sinai Community Institute, St. Sabina Employment Resource Center, and Phalanx Family Services. In 2013, they were the Black Star Project, Blue Sky Inn, Kleo Community Family Life Center, Phalanx Family Services, St. Sabina Employment Resource Center, Westside Health Authority, and Youth Outreach Services.

6

In serving a very mobile and arrest-prone population, it was clear that filling all the available slots would take considerable time. To speed up recruiting, we gave providers lists of hundreds of more youth than available program slots upfront. We count everyone on the list as treatment, since we could not enforce the rule that providers work down the list in order.

7

The prior study on the first cohort (Heller, 2014) used Chicago Police Department data rather than statewide data. That study included only arrests within Chicago and covered a somewhat different time period, so the amount of crime reported here is slightly different.

8

CPS underwent a major reform of how they recorded disciplinary incidents during this time, so it is not clear how comparable recording is across or even within schools. Therefore, we do not use the disciplinary data as outcome measures.

9

Appendix table A8 shows that the results are similar if we impute data for students who never appear in the CPS data. Appendix I.3 shows other missing data approaches.

10

The 2012 study had two treatment arms that differed by the provision of a social-emotional learning curriculum. Because the differences between treatment arms are generally not statistically significant, we focus the main text on the overall treatment-control contrast; results by treatment arm are in appendix F.

11

It is not clear that the causal forest works directly with an IV involving noncompliance. Take-up rates within leaves may be 0 or close to 0 because of the small samples in each leaf. This will make the LATE either incalculable or huge in the leaves resulting from some potential splits. But the causal forest implements the splits that maximize the variance of treatment effects across leaves; if some treatment effects are enormous because of small-sample variation in take-up rates, the key Athey and Imbens result—that an objective function maximizing treatment effect variance is equivalent to minimizing the expected mean squared error of the unobservable prediction error—may not hold. We report how take-up rates vary across predicted ITTs to assess how much heterogeneity is from differences in participation. Alternative strategies might involve the generalized random forest (Athey, Tibshirani, & Wager, 2019), which does not estimate LATEs within leaves, or running separate causal forests to predict an individual's ITT and take-up rate, then constructing individual LATE estimates as the ratio of the two predictions.

12

The causal forest's flexibility in searching for benefiters while still avoiding overfitting is desirable because our key research question is whether any subgroup benefits. Other regression tree–based approaches, including Bayesian additive regression trees, share this flexibility and may have different stability and regularization properties. If the question of interest is instead whether (or which of) a small number of Xs predict heterogeneity, alternative approaches like Lasso could be more appropriate.

13

We use a subset of covariates that are available for nearly everyone in the sample, including demographics (age in years and indicator variables for being male, black, or Hispanic), neighborhood characteristics from the ACS (census tract unemployment rate, median income, proportion with at least a high school diploma, and proportion who rents their home), prior arrests (number of prerandomization arrests for violent crime, property crime, drug crime, and other crime), prior schooling (indicator variables for having graduated from CPS prior to the program, being enrolled in CPS in the school year prior to the program, not being enrolled in the year prior to the program despite having a prior CPS record, and not being in the CPS data at all), and prior employment (indicator variables for having worked in the year prior to the quarter of randomization, for having not worked in the year prior to the quarter of randomization despite having a valid SSN, and for not having a valid SSN).

14

The variance penalty comes from Athey and Imbens (2016). We also use inverse probability weights to deal with different treatment probabilities across randomization blocks.

15

This step is a slight deviation from Wager and Athey, who assign $τ^b$ to the entire sample rather than the 80% excluded from the initial subsample. We find that this adjustment, using only out “out-of-bag” estimates, reduces overfitting in our finite-sample setting, although it may require adjusted theoretical justification (Davis & Heller, 2017).

16

UI data are quarterly, and the 2012 program started in the last week of June. So we define the “summer” program period as quarters 2 and 3 of 2012 (April to September) in the first study year and quarter 3 only (July to September) in the second study year, when the program started at the beginning of July.

17

Section III explains how we treat missing data in this table, with more details in appendix C.2. Appendix I.3 shows that the results are generally robust to other treatments of missing data, including logical imputation that accounts for transfers out of the district; multiple imputation, which relaxes the MCAR assumption in this sample; and the inclusion of multiply imputed data for youth who were never in CPS records.

18

Youth are in this sample if they have a valid SSN. Appendix I.4 shows that results using various imputation techniques for missing data do not change the pattern of results.

19

Some coefficients are greater than 1 in part because we are using a linear probability model. Appendix table A10 shows estimated average marginal effects using a probit, which are substantively very similar.

20

In theory, any optimal targeting strategy should maximize net social welfare, not just behavioral benefits. Youth may generate heterogeneous program costs if some individuals require more resources to recruit and serve or have heterogeneous private valuations of the program. And policymakers may place value on equity or particular distributional consequences of a targeted program. Taking a stand on the social welfare function is beyond the scope of this paper, so we focus on estimating who benefits most, which is one crucial input to decisions about optimal allocation.

21

In a recent working paper, Chernozhukov et al. (2018) suggest a different functional form for this test. The results from their specification, shown in appendix I.5, yield identical conclusions.

22

Appendix I.5.2 details this regression and shows the results are not sensitive to different divisions of the predictions. Both Davis and Heller (2017) and, in a different setting, Athey and Wager (2019) show a related exercise using an above/below median split. The former uses a split-sample comparison with fewer trees, which produces results that are not entirely stable across different splits of the sample. Since the goal here is to learn from the predictions rather than assess the method, we use our full sample to increase stability, relying on the “adjusted honest” approach. We also increase the number of trees we use from 25,000 to 100,000. The predictions themselves are generally similar in both cases (correlations across the two different sets of predictions are over 0.99 for all three of our outcomes). But since we are using a quartile cutoff to test for treatment heterogeneity, Monte Carlo error can generate small changes in predictions around the cutoff, which in turn changes the composition of our subgroups. The additional trees reduce the number of observations switching quartile across two different sets of predictions by 50% to 75%.

23

For the slope test, there is an argument that prediction error should not matter: we are testing a question about the predictions, not the construct the predictions are trying to capture, and the predictions exactly define themselves. Prediction error is a larger issue below when we use the predictions as a measure of true underlying heterogeneity.

24

We exclude preprogram graduates from the persistence column since the program could not change high school outcomes for this group. We report causal forest results for other outcomes in appendix I.5.4; none successfully predicts heterogeneity.

25

Appendix I.5 explores the relationship between the employment predictions and other outcomes.

26

If instead we pool all nonviolent crime into one outcome, $p=0.108$.

27

The most socially beneficial targeting might be to target those with the biggest impacts on violent crime, since violence is so socially costly. But since we find no heterogeneity in violence impacts, we focus this exercise on employment.

28

Since average take-up does not vary much across employment predictions, we assume for simplicity that take-up probability is uncorrelated with predicted treatment effects (i.e., participants have the same average treatment effect as those offered a slot).

## REFERENCES

REFERENCES
Anderson
,
Michael L.
, “
Multiple Inference and Gender Differences in the Effects of Early Intervention: A Reevaluation of the Abecedarian, Perry Preschool, and Early Training Projects,
Journal of the American Statistical Association
103
(
2008
),
1481
1495
.
Angrist
,
Joshua D.
,
Guido W.
Imbens
, and
Donald B.
Rubin
, “
Identification of Causal Effects Using Instrumental Variables,
Journal of the American Statistical Association
91
(
1996
),
444
455
.
Athey
,
Susan
, and
Guido
Imbens
, “
Recursive Partitioning for Heterogeneous Causal Effects
,”
Proceedings of the National Academy of Sciences
113
(2016)
,
7353
7360
.
Athey
,
Susan
, and
Guido
Imbens
The Econometrics of Randomized Experiments
” (vol.
1
, pp.
73
140
), in
Abhijit Vinayak
Banerjee
and
Esther
Duflo
, eds.,
Handbook of Field Experiments
(
Amsterdam
:
North-Holland
,
2017
).
Athey
,
Susan
,
Julie
Tibshirani
, and
Stefan
Wager
, “
Generalized Random Forests,
Annals of Statistics
47
(
2019
),
1148
1178
.
Athey
,
Susan
, and
Stefan
Wager
, “
Estimating Treatment Effects with Causal Forests: An Application
(2019)
,
arXiv:1902.07409
.
Benjamini
,
Yoav
, and
Yosef
Hochberg
, “
Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing,
Journal of the Royal Statistical Society, Series B (Methodological)
57
(
1995
),
289
300
.
Chernozhukov
,
Victor
,
M.
Demirer
,
E.
Duflo
, and
I.
Fernandez-Val
, “
Generic Machine Learning Algorithms for Heterogeneous Treatment Effects in Randomized Experiments
(2018)
,
arXiv:1712.04802
.
Clarke
,
Ronald V.
, “
Situational Crime Prevention
” (pp.
91
150
), in
M.
Tonry
and
D.
Farrington
, eds.,
Building a Safer Society: Strategic Approaches to Crime Prevention Crime and Justice
, vol.
19
of
M.
Tonry
, ed.,
Crime and Justice: A Review of Research
(
Chicago
:
University of Chicago Press
,
1995
).
Cohen
,
Lawrence E.
, and
Marcus
Felson
, “
Social Change and Crime Rate Trends: A Routine Activity Approach,
American Sociological Review
44
(
1979
),
588
608
.
Cook
,
Phillip J.
, “
The Demand and Supply of Criminal Opportunities
,”
Crime and Justice
7
(1986)
1
27
.
Crépon
,
Bruno
, and
Gerard J.
van den Berg
, “
Active Labor Market Policies
,”
Annual Review of Economics
8
(2016)
,
521
546
.
Davis
,
Jonathan M. V.
, and
Sara B.
Heller
, “
Using Causal Forests to Predict Treatment Heterogeneity: An Application to Summer Jobs,
American Economic Review: Papers and Proceedings
107
(
2017
),
546
550
.
Deming
,
David J.
, “
Better Schools, Less Crime,
Quarterly Journal of Economics
126
(
2011
),
2063
2115
.
Gelber
,
Alexander
,
Isen
, and
Judd B.
Kessler
, “
The Effects of Youth Employment: Evidence from New York City Lotteries,
Quarterly Journal of Economics
133
(
2016
),
423
460
.
Heller
,
Sara B.
, “
Summer Jobs Reduce Violence among Disadvantaged Youth
,”
Science
346:6214
(2014)
,
1219
1223
.
Heller
,
Sara B.
,
Anuj K.
Shah
,
Jonathan
Guryan
,
Jens
Ludwig
,
Sendhil
Mullainathan
, and
Harold A.
Pollack
, “
Thinking, Fast and Slow? Some Field Experiments to Reduce Crime and Dropout in Chicago
,”
Quarterly Journal of Economics
132
:
1
(
2017
),
1
54
.
Jacob
,
Brian
, and
Lars
Lefgren
, “
Are Idle Hands the Devil's Workshop? Incapacitation, Concentration and Juvenile Crime
,”
American Economic Review
93
(2003)
,
1560
1577
.
James
,
Gareth
,
Daniela
Witten
,
Trevor
Hastie
, and
Robert
Tibshirani
,
An Introduction to Statistical Learning
(
New York
:
Springer
,
2013
).
Katz
,
Lawrence F.
,
Jeffrey R.
Kling
, and
Jeffrey B.
Liebman
, “
Moving to Opportunity in Boston: Early Results of a Randomized Mobility Experiment,
Quarterly Journal of Economics
116
(
2001
),
607
654
.
Kling
,
Jeffrey R.
,
Jens
Ludwig
, and
Lawrence F.
Katz
, “
Neighborhood Effects on Crime for Female and Male Youth: Evidence from a Randomized Housing Voucher Experiment
,”
Quarterly Journal of Economics
120
:
1
(
2005
),
87
130
.
LaLonde
,
Robert J.
, “Employment and Training Programs” (pp.
517
586
), in
Robert A
Moffitt
, ed.,
Means-Tested Transfer Programs in the United States
(
Chicago
:
University of Chicago Press
,
2003
).
Leos-Urbel
,
Jacob
, “
What Is a Summer Job Worth? The Impact of Summer Youth Employment on Academic Outcomes,
Journal of Policy Analysis and Management
33
(
2014
),
891
911
.
Modestino
,
Alicia Sasser
, “
How Do Summer Youth Employment Programs Improve Criminal Justice Outcomes, and for Whom?
Journal of Policy Analysis and Management
38
(
2019
),
600
628
.
Schwartz
,
Amy Ellen
,
Jacob
Leos-Urbel
, and
Matthew
Wiswall
, “
Making Summer Matter: The Impact of Youth Employment on Academic Performance
,”
NBER working paper
21470
(
2015
).
Valentine
,
Erin Jacobs
,
Chloe
Anderson
,
Farhana
Hossain
, and
Rebecca
Unterman
, “
An Introduction to the World of Work: A Study of the Implementation and Impacts of New York City's Summer Youth Employment Program
,”
MDRC report
(2017)
.
Wager
,
Stefan
, and
Susan
Athey
, “
Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests,
Journal of the American Statistical Association
113
(
2018
),
1228
1242
.
Westfall
,
Peter H.
, and
S. Stanley
Young
,
Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment
(
New York
:
Wiley-Interscience
,
1993
).

## Author notes

This research was generously supported by contract B139634411 and a Scholars award from the US Department of Labor grant 2012MIJ-FX-002 from the Office of Juvenile Justice and Delinquency Prevention, Office of Justice Programs, US Department of Justice, and graduate research fellowship 2014-IJ-CX-0011 from the National Institute of Justice. The 2012 study was preregistered at clinicaltrials.gov. Both studies are registered in the American Economic Association Registry under trial numbers 1472 and 2222. For helpful comments, we thank Stephane Bonhomme, Eric Janofsky, Avi Feller, Jon Guryan, Kelly Hallberg, Jens Ludwig, Parag Pathak, Harold Pollack, Guillaume Pouliot, Sebastian Sotelo, Alexander Volfovsky, and numerous seminar participants. We are grateful to the staff of the University of Chicago Crime and Poverty Labs (especially Roseanna Ander) and the Department of Family and Support Services for supporting and facilitating the research, to Susan Athey for providing the beta causal forest code, and to Valerie Michelman and Stuart Hean for research assistance. We thank Chicago Public Schools, the Department of Family and Support Services, the Illinois Department of Employment Security, and the Illinois State Police via the Illinois Criminal Justice Information Authority for providing data. The analysis and opinions here do not represent the views of any of these agencies, and any further use of the data must be approved by each agency. Any errors are our own.

A supplemental appendix is available online at http://www.mitpressjournals.org/doi/suppl/10.1162/rest_a_00850.