Abstract

This paper reports the results of two randomized field experiments, each offering different populations of Chicago youth a supported summer job. The program consistently reduces violent-crime arrests, even after the summer, without improving employment, schooling, or other arrests; if anything, property crime increases over two to three years. Using a new machine learning method, we uncover heterogeneity in employment impacts that standard methods would miss, describe who benefits, and leverage the heterogeneity to explore mechanisms. We conclude that brief youth employment programs can generate important behavioral change, but for different outcomes, youth, and reasons than those most often considered in the literature.

I. Introduction

FOR at least half a century, social scientists and policymakers have argued that a combination of job training, search assistance, remedial course work, and subsidized work can improve employment and wages by developing human capital and reducing search costs (LaLonde, 2003). Improving employment may in turn increase the opportunity cost of crime or improve other social outcomes, though nonemployment outcomes are often treated as ancillary benefits of improved employment in the literature (Crépon & van den Berg, 2016). Reviews of the evidence on whether employment programs cost-effectively achieve these goals in the United States, at least among the disadvantaged youth who are the focus of this paper, lean fairly negative, though vary in their level of pessimism. Most conclude that only very intensive and expensive training programs improve labor market outcomes, while only a tiny handful reduce crime, largely limited to the period of the program itself.1

Recent studies on summer youth employment programs (SYEPs), however, show starkly different results from other job training programs. Despite being neither intensive nor expensive, short programs in Chicago, New York, and Boston dramatically reduce violent crime and mortality, even after the program has ended (Gelber, Isen, & Kessler, 2016; Heller, 2014; Modestino, 2019). The programs do this without improving average employment outcomes (if anything, some youth have lower future earnings) and with small, if any, effects on education outcomes (Gelber et al., 2016; Valentine et al., 2017; Leos-Urbel, 2014; Schwartz, Leos-Urbel, & Wiswall, 2015; Heller, 2014). This pattern of results—big postprogram crime declines with no indication of improved human capital or increased opportunity costs—raises questions about what mechanisms are at work and why effects are so different from the youth training literature.

This paper tests whether treatment effect heterogeneity is part of the answer. In theory, SYEPs might improve human capital, increase employment, and thereby decrease crime among a subset of youth, while others either do not respond or allow the program to crowd out better opportunities. If so, zero average employment effects could mask heterogeneity that explains net crime declines. In addition to being important for program targeting, such heterogeneity might also explain why SYEPs' effects differ from those of other youth employment programs: they serve a younger, more school-attached population than the disconnected youth targeted by other training programs, and prevention (reaching youth before they leave school) might be easier than remediation (reaching them after unemployment spells).2 But because there has been so little overlap in participants of summer and other employment interventions, it has been impossible to separate program from population differences until now.

We use two randomized controlled trials (RCTs) of a Chicago summer jobs program, along with a new machine learning technique for estimating treatment heterogeneity, to test these ideas. The first RCT in 2012 offered an eight-week, part-time summer job at minimum wage ($8.25/hour) and an adult job mentor to a population of disadvantaged highschool students.3 The second RCT in 2013 offered a similar six-week program but purposefully expanded the eligibility criteria to include disconnected, out-of-school youth more like those targeted by other youth employment interventions. Most youth also participated in a curriculum built on cognitive behavioral therapy principles aimed at helping them manage their cognitive and emotional responses to conflict, as well as encouraging them to set and achieve personal goals. We track youth in administrative data through 2015 from the Chicago Public Schools, the Illinois State Police, and the Illinois Department of Employment Security.

In both study years, a supported summer job generates dramatic and robust reductions in violent-crime arrests in the year after random assignment: the local average treatment effect is a 42% decline in the first study and a 33% decline in the second (4.1 and 7.9 fewer arrests per 100 participants, respectively). Across the whole sample, the effect is still significant after adjustments for multiple hypothesis testing. The pooled sample shows a 26% decline in violent-crime arrests even after removing the program months from the data (p=.067), meaning the behavior change is not simply a mechanical result of keeping youth busy over the summer that disappears as soon as the job ends. The program also does not seem to increase the overall opportunity cost of crime or keep youth out of trouble more generally: participants' total number of arrests does not change, and if anything, property crime increases in later years. Neither employment outcomes nor other indicators of human capital such as schooling improve, at least on average.

We then estimate treatment heterogeneity based on observable characteristics. Tests for heterogeneity typically involve interacting a treatment indicator with a series of baseline covariates, one at a time. But each additional hypothesis test raises the probability of spurious findings. And if heterogeneity is driven by the interaction of more than one characteristic at a time (or a nonlinear function of a continuous variable), typical interaction tests may miss substantively important variation in treatment effects. To more flexibly estimate treatment heterogeneity, we use a causal forest (Athey & Imbens, 2016; Wager & Athey, 2018), predicting treatment effects based on high-dimensional, nonlinear functions of observables and mining the data for responsive subgroups in a principled way. We develop tests for whether the predicted heterogeneity works in detecting actual heterogeneity, describe who benefits, and use the patterns of heterogeneity to assess potential mechanisms.

The causal forest identifies significant heterogeneity in employment impacts, which standard interaction-based approaches would miss. We identify a subgroup whose postprogram employment improves by 15 percentage points (44%), which on its own is an important result for policy. That subgroup is younger, more engaged in school, more Hispanic, more female, and less likely to have an arrest record. In other words, the employment benefiters are not the disconnected youth whom other employment programs typically target. Although the results are imprecise, we also show that the drop in violence is not concentrated among the employment benefiters. If anything, nonviolent crime increases among those with better employment. These findings offer little support for the traditional theory that improved human capital and increased opportunity cost explain crime declines. They do, however, emphasize the potential gains from more flexible approaches to treatment heterogeneity; we calculate that targeting the program using the causal forest could generate employment impacts four times larger than targeting using a more standard approach.

We find no significant heterogeneity in effects on violent-crime arrests or school persistence. The fact that we cannot distinguish variation in violence impacts, at least within our disadvantaged urban population, suggests there is value to targeting SYEPs toward youth who are at risk of violent crime, a group that traditional training programs often screen out.4 The question of why violence declines for everyone, seemingly independent of changes in employment, is important. Expanded prosocial attitudes, improved beliefs about the future, or general “staying busy” explanations are not entirely satisfactory given that property crime increases in later follow-up years and that bigger employment gains seem to accompany increases in nonviolent crime. But more nuanced crime theory highlights the role of opportunity: a program that introduces youth to richer areas and new peers may increase opportunities for theft and drug purchases but decrease opportunities to fight, even without changing labor market outcomes (Cohen & Felson, 1979; Cook, 1986; Clarke, 1995).

Anecdotal evidence from employers provides another hypothesis for why violence, which by definition involves conflict with other people, may change: employers report helping youth develop self-regulation and respond positively to criticism, which could reduce conflicts outside the workplace as well. There could also be a role for unmeasured informal sector work, peer networks, income, or violence-specific attitudes, norms, or beliefs. Further research is needed to sort out exact mechanisms. In the meantime, we show the potential of SYEPs to reduce violence among a new population, as well as the potential of machine learning to help policymakers rethink who benefits from what kind of employment program and why.

II. Program Description and Experimental Design

Chicago's Department of Family and Support Services (DFSS) designed One Summer Chicago Plus (OSC+) primarily as a violence-reduction intervention. The program structure was similar across both summers: youth were offered a summer job, five hours per day and five days a week, at minimum wage ($8.25 per hour) for eight weeks in 2012 and six weeks in 2013. All youth were assigned a job mentor—an adult to assist them in learning to be successful employees and to help them deal with barriers to employment—at a ratio of about 10 to 1. DFSS administered the program through contracts with local nonprofit agencies. These agencies recruited applicants, offered participating youth brief training, hired the mentors, recruited employers, placed youth in summer jobs, provided daily lunch and bus passes when appropriate, monitored the participants' progress over the course of the summer, and if youth were fired, worked with them to find an alternative placement.5

In the first year of the program, youth ages 14 to 21 were recruited from thirteen high-violence Chicago public high schools. A total of 1,634 youth (about 13% of the prior year's student population in these schools) applied for the 700 available program slots. The research team randomly assigned treatment slots at the individual level within school and gender blocks. Youth worked at a range of nonprofit and government employers on tasks such as supervising younger youth at summer camps, clearing lots to plant a community garden, improving infrastructure at local schools, and providing administrative support at aldermen's offices. Because of restrictions imposed by a funder, there were no private sector jobs.

In the second year of the program, 16- to 22-year-old male youth in one of two applicant pools could apply. The first pool (n=2,127) was invited to voluntarily apply directly from the criminal justice system (from probation offices, juvenile detention or prison, or a center to serve justice-involved youth). The rest (n=3,089) had applied to Chicago's broader summer programming; those who were ages 16 to 20, lived in one of the thirty highest-violence community areas, and included a social security number (SSN) on their application entered the lottery. Notably, participants were no longer required to be in school. The resulting 5,216 boys were individually randomly assigned to treatment or control groups within applicant pool-age-geography blocks (2,634 treatment and 2,582 control), with each block assigned to a specific service agency. Because of the time-constrained recruiting process, the number of youth assigned to the treatment group far exceeds the number of available slots (1,000).6 One important implication is that the maximum possible take-up rate, even if the first thousand youth were immediately located and agreed to participate, is 38%. This is by design and should not be interpreted as indicating low demand for the program among the treatment group. Private sector jobs were included in this program year.

In 2013, DFSS also encouraged treatment youth to keep participating in programming offered by the community service agencies after the summer ended, including a mix of additional social-emotional learning activities, job mentoring, and social outings such as sporting events and DJ classes. These activities were much lower intensity than the summer programming, and participants received a small stipend (approximately $200) rather than an hourly wage. Appendix B reports additional details about the program, randomization, and recruitment.

III. Data and Descriptive Statistics

We match study youth to existing administrative data sets from a variety of government sources. Program application and participation records come from DFSS. We measure crime with Illinois State Police (ISP) arrest records, which combine police records from departments across the state.7 We use the description of each offense to categorize offenses as violent, property, drug, or other (e.g., vandalism, trespassing, outstanding warrants). The data cover both juvenile and adult arrests from 2001 through two (2013 cohort) or three (2012 cohort) years postrandom assignment. Youth who have never been arrested will not be in the ISP records, so we assign 0 arrests for individuals not matched to the data.

We use student-level administrative records from Chicago Public Schools (CPS) to capture schooling outcomes. These data include enrollment status, grade level, course grades, and attendance from the beginning of CPS enrollment through the 2015–16 academic year.8 Our main analysis excludes preprogram graduates (n=1,422) as well as anyone who never appeared in the CPS records (and so likely always attended school outside the district, n=435).9 Since these are both baseline characteristics, the exclusion should not undermine the integrity of random assignment (see appendix table A1 for balance tests on this subsample). We focus on the school year following the program, since missing GPA and attendance data become a bigger problem over time as more students graduate, drop out, or transfer. To assess longer-term performance, we define a school persistence measure that is available for everyone in the CPS data regardless of missing attendance and GPA data in future years: an indicator that equals 1 if the youth has graduated from CPS in the first two postprogram school years or is still attending school in the third postprogram school year.

We measure employment using quarterly Unemployment Insurance (UI) records, which include earnings and employer for each formal sector UI-covered job. We obtain SSNs for matching to UI data through school records, although the school district did not require students to report SSNs. As such, our main employment analysis excludes youth who could not be matched due to missing SSNs (26% of the sample either never enrolled in CPS or were missing SSN in the school records). This approach assumes that SSNs are missing completely at random; appendix table A9 shows the results are robust to different approaches to missing data. A subset of OSC+ providers did not report program earnings to the UI system. For youth attending these providers, we impute program quarter earnings as the sum of earnings at other employers and their reported program hours times $8.25. For youth with SSNs but no UI data, we assign 0s for employment and earnings, assuming anyone not found in the matching process never worked in the formal sector. Appendix C reports additional details on all data sources, matching procedures, and variable definitions.

Table 1 shows select baseline characteristics for 2012 and 2013 control groups, as well as tests of treatment-control balance for each covariate conditional on randomization block fixed effects. No more of the differences are significant than would be expected by chance, and tests of joint significance suggest that randomization successfully balanced the two groups (pooling both samples together and testing balance using our full set of covariates, F(69,6709)=0.85 with p=0.81). (See appendix C.4 for other descriptive statistics and balance tests.)

Table 1.
Descriptive Statistics and Baseline Balance
Program Year: 2012 2013 
 N Control Mean Treatment Coefficient SE N Control Mean Treatment Coefficient SE 
Age at program start 1,634 16.30 -0.05 (0.07) 5,216 18.42 0.03 (0.02) 
Black 1,634 0.96 0.00 (0.01) 5,216 0.91 0.01 (0.01) 
Any baseline arrest 1,634 0.20 0.01 (0.02) 5,216 0.47 0.02 (0.01) 
In CPS data 1,634 1.00 0.00 (0.00) 5,216 0.91 0.00 (0.01) 
Engaged in CPS in June (if ever in CPS) 1,634 0.99 0.00 (0.01) 4,781 0.51 0.00 (0.01) 
Days attended (if any attendance) 1,629 136.93 0.69 (1.40) 2,930 122.78 2.54 (1.82) 
Has SSN 1,634 0.81 0.02 (0.02) 5,216 0.71 0.01 (0.01) 
Worked in prior year (if has SSN) 1,334 0.07 −0.02 (0.01) 3,742 0.22 0.00 (0.01) 
Census tract: Median income 1,634 35,665 −347 (660) 5,216 33,759 −175 (360) 
Census tract: Unemployment rate 1,634 19.07 −0.03 (0.42) 5,216 12.81 0.14 (0.12) 
Program Year: 2012 2013 
 N Control Mean Treatment Coefficient SE N Control Mean Treatment Coefficient SE 
Age at program start 1,634 16.30 -0.05 (0.07) 5,216 18.42 0.03 (0.02) 
Black 1,634 0.96 0.00 (0.01) 5,216 0.91 0.01 (0.01) 
Any baseline arrest 1,634 0.20 0.01 (0.02) 5,216 0.47 0.02 (0.01) 
In CPS data 1,634 1.00 0.00 (0.00) 5,216 0.91 0.00 (0.01) 
Engaged in CPS in June (if ever in CPS) 1,634 0.99 0.00 (0.01) 4,781 0.51 0.00 (0.01) 
Days attended (if any attendance) 1,629 136.93 0.69 (1.40) 2,930 122.78 2.54 (1.82) 
Has SSN 1,634 0.81 0.02 (0.02) 5,216 0.71 0.01 (0.01) 
Worked in prior year (if has SSN) 1,334 0.07 −0.02 (0.01) 3,742 0.22 0.00 (0.01) 
Census tract: Median income 1,634 35,665 −347 (660) 5,216 33,759 −175 (360) 
Census tract: Unemployment rate 1,634 19.07 −0.03 (0.42) 5,216 12.81 0.14 (0.12) 

Balance test shows treatment coefficient and robust standard error from a regression of each characteristic on a treatment indicator, block fixed effects, and duplicate indicators. Gender not included in the table since it is collinear with randomization blocks. The 2012 sample was 38.5% male; the 2013 sample was all male. *p<0.1,**p<0.05,and***p<0.01.

Youth in both cohorts are over 90% African American and largely from poor, highly disadvantaged neighborhoods: median neighborhood income is $33,000 to $36,000 with local unemployment rates around 13% to 19%. Thirty-eight percent of the 2012 cohort and all of the 2013 cohort are male. Recall that in part to test for heterogeneous program effects on a broader population of youth, the eligibility rules across program years changed. As a result, the 2013 cohort is older (18.4 versus 16.3 years old), more criminally involved (47% versus 20% have an arrest record), and less engaged in school (51% versus 99% still engaged in school before the program, and accounting for the longer school year in 2013, missing three months versus six weeks of the prior school year, conditional on any attendance). The 2013 youth are also more likely to have been employed in the prior year (22% versus 7%).

IV. Analytical Methods

To make results easier to compare across two study cohorts with different take-up rates, we focus on local average treatment effects (LATEs), or the effect of participating on compliers (Angrist, Imbens, & Rubin, 1996). With almost no control crossover in our setting, these estimates should be quite close to the treatment-on-the-treated. Intent-to-treat (ITT) results are in appendix D. We estimate LATEs using random assignment as an instrument for any program participation, including block fixed effects and the controls discussed in appendix E (which also shows that results are similar without covariates). To help judge the magnitude of the LATEs, we estimate average outcomes for the control youth who would have participated had they been assigned to treatment—the “control complier mean” (CCM; see Heller et al., 2017; Katz, Kling, & Liebman, 2001).10 We report heteroskedasticity-robust standard errors, clustered on individuals when using the pooled sample to account for the 140 youth in both study cohorts. Appendix table A4 shows similar p-values from randomization inference (permuting treatment assignment 10,000 times to approximate Fisher's exact test). This tests the sharp null of no treatment effects for anyone and avoids relying on modeling assumptions and large-sample approximations that may not hold in finite samples (Athey & Imbens, 2017).

In any experiment testing program effects on multiple outcomes, not to mention heterogeneous treatment effects by subgroup, one might worry that the probability of type I error increases with the number of tests conducted. We take a number of steps to ensure that our results are not just the result of data mining. First, we note that because DFSS built the program and recruiting strategy mainly to reduce youth violence, the impact on violent-crime arrests was the first prespecified outcome of interest.

Second, we present two inference adjustments to account for multiple hypothesis testing. The first uses a free step-down resampling method to control the family-wise error rate (FWER), the probability that at least one of the true null hypotheses in a family of hypothesis tests is rejected (Anderson, 2008; Westfall & Young, 1993). The second shows the q-value, or the smallest level at which we can control the false-discovery rate (FDR) in a group of hypotheses and still reject the null for that outcome (Benjamini & Hochberg, 1995). This adjustment increases the power of individual tests in exchange for allowing some specified proportion of rejections to be false. We define our families as: (a) the four types of crime separately for each follow-up year (excluding total arrests since it is a linear combination of the rest); (b) the four main schooling outcomes for youth with a CPS record who had not yet graduated prior to the program (enrollment, days present, GPA, and school persistence); (c) total earnings and overall, provider, and nonprovider employment in program quarters; and (d) total earnings and overall, provider, and nonprovider employment in post-program quarters. (Appendix G provides implementation details.)

Third, we eschew the standard approach to treatment heterogeneity: choosing several one-way interactions a priori to test for heterogeneity (or, worse, searching over a large number of interaction effects for particularly responsive subgroups, which risks overfitting and detecting spurious effects). Instead, we implement a version of Wager and Athey's (2018) causal forest algorithm, which predicts treatment effects based on an individual's covariates. For this prediction, we focus on estimating conditional ITT effects (E(Y1-Y0|X=x)), which capture differences in both youths' responses to the program and their propensity to participate if offered the program.11 This method allows flexible, high-dimensional combinations of covariates to identify who gains from the program in a way that researcher-determined interaction effects would typically avoid.

For example, suppose τi,y=Y1i-Y0i is the true treatment effect on outcome y for individual i. Typical approaches might estimate and compare E(τi,y|male=1) to E(τi,y|male=0), or perhaps E(τi,y|male=1,African-American=1) based on what the researcher specifies. If the true treatment heterogeneity is more complex than differences by gender or race (e.g., only African American males with more than three prior arrests who live in neighborhoods with less than 12% unemployment rates benefit from the program), then researcher-specified interactions will miss it. But in theory, the causal forest can capture this pattern by searching over all values of all the covariates to isolate the combination of covariate values that predict the most heterogeneity in effects. The goal becomes predicting heterogeneity in E(τi,y|X=x) using all the available information on Xs rather than testing whether particular Xs are associated with significantly different treatment effects.12

Our methodology for estimating causal forests, based on Athey and Imbens (2016) and Wager and Athey (2018), is described in Davis and Heller (2017). We give an intuitive explanation of the method here, attempting to avoid machine learning jargon to make the discussion accessible. (Technical details are in appendix H.) The basic goal is to divide the sample into bins that share similar covariates and use the within-bin treatment effect as the predicted treatment effect for anyone with that bin's Xs. However, using the same observations to bin the data and predict the treatment effects within bins could induce overfitting, so the procedure uses different subsamples for binning and for effect estimation. It repeats the procedure over many subsamples and averages the predictions to reduce variance.

To predict ITT effects conditional on covariates for a particular outcome, we repeat the following procedure. First, draw a 20% subsample without replacement from the data. Using a random half of the subsample, use a regression tree–based algorithm to bin the observations by values of X.13 The algorithm recursively searches over possible ways to break the data into bins based on values of covariates, choosing the divisions that maximize the variance of treatment effects across bins subject to a penalty for within-bin variance (see appendix H for algorithm details).14 Once the bins are formed, switch to the other half of the subsample and sort the new observations into the same bins. Calculate the treatment effect (τ^b=y¯T,b-y¯C,b, or the difference in mean outcomes between treatment and control observations) using the new observations within each bin b.

Next, switch to the other 80% of the sample (observations that are not part of the subsample), figure out in which bin each observation would belong based on its Xs, and assign that bin's τ^b as the predicted treatment effect.15 Predictions averaged across many trees have better predictive accuracy than estimates from a single tree given the high variance of a single tree's predictions (James et al., 2013). We repeat this process with 100,000 subsamples (the causal parallel of a random forest rather than a single regression tree), averaging an observation's prediction across iterations to obtain a single predicted treatment effect. We find that increasing the number of trees from 25,000 to 100,000 dramatically increases the stability of our estimates across different random seeds.

V. Participation

In the first program year, 75% of youth offered the program actually participated, and participants averaged 35 days of work out of a possible 40. In the second program year, when the maximum possible take-up rate was 38% by construction (see section II), actual program take-up was 30%. Participants worked an average of 18 days out of a possible 30, reflecting in part the greater challenge of recruiting and retaining a more disconnected and criminally active population in the second year. There was no control crossover in the first cohort; ten control youth in the second cohort (0.4%) participated in the program. Twenty percent of the 2013 treatment group participated in any postsummer programming. On average, these participants attended about 18.5 days of additional programming over about a nine-month period. Across both cohorts, the F-statistic on the first-stage regression measuring any OSC+ participation is 2,205. (See appendix section I.1 for additional participation and first-stage details.)

To give a better sense of the counterfactual over the summer16 for those with employment data available, appendix table A3 shows rates at which youth worked in a formal sector job, worked in OSC+, or did not work at all. Few youth worked in other UI-covered jobs: in 2012, about 8% of the treatment group and 15% of the control group worked outside of OSC+, with 17% of the treatment group and 23% of the control group working nonprogram jobs in 2013. The treatment-control differences in nonprogram employment suggest that OSC+ generates a small amount of crowd-out, though the program still dramatically increases the overall proportion of youth who work over the summer. The treatment-control difference in having no job is 68 percentage points in 2012 (from a control mean of 84%) and 24 percentage points in 2013 (from a control mean of 75%).

VI. Main Results

Panels A and B of table 2 show LATE estimates of the program's impact on crime separately by cohort (coefficients and standard errors are multiplied by 100, representing the effect of any program participation per 100 compliers). (Appendix D shows the ITT crime results.) The patterns of behavioral change are remarkably similar across studies: both cohorts show large and statistically significant declines in violent-crime arrests during the first postlottery year, followed in later years by increases in property-crime arrests that are statistically significant in the 2012 cohort. Given that the main goal of the program was violence reduction, the magnitude of the results is quite promising: the first study shows that the program causes 4.1 fewer violent-crime arrests per 100 participants in the first postprogram year, a 42% decline relative to the control complier mean (CCM). That finding is replicated in the second study, where the absolute magnitude of the change is somewhat larger (7.9 fewer violent-crime arrests) but proportionally slightly smaller (a 33% decline). This pattern is consistent with the fact that the second cohort was more criminally active (more crime to prevent) but worked fewer hours (slightly smaller proportional change). The substantively similar and statistically indistinguishable results across the two studies are important on their own, suggesting that the first study's results were not just a statistical fluke.

Table 2.
Local Average Treatment Effect on Number of Arrests (× 100) by Cohort and Year
 Year 1 Year 2 Year 3 
 LATE SE CCM LATE SE CCM LATE SE CCM 
Arrest Type A. 2012 Program 
Violent −4.13** (1.97) 9.83 −0.20 (1.70) 5.17 0.15 (1.63) 5.37 
Property 1.64 (1.39) 3.14 3.80** (1.73) 1.34 2.75** (1.33) 0.74 
Drugs 0.52 (2.16) 3.89 −2.38 (1.86) 8.08 −2.87 (2.08) 7.65 
Other 1.36 (2.71) 10.77 1.12 (2.63) 9.36 1.69 (2.67) 10.45 
 B. 2013 Program 
Violent −7.90** (3.77) 24.20 1.31 (3.25) 12.71    
Property 1.76 (3.09) 11.62 2.16 (3.11) 6.37    
Drugs 4.4 (5.02) 19.92 −7.92 (5.02) 26.65    
Other −11.6 (8.21) 56.18 2.82 (7.75) 36.41    
 C. Pooled 
 Year 1 Year 2 Year 1 and Year 2 
Violent −6.34*** (2.24) 18.31 0.76 (1.94) 9.55 −5.58* (3.16) 27.85 
Property 1.68 (1.80) 8.18 2.97 (1.89) 4.19 4.65 (2.86) 12.36 
Drugs 2.35 (2.90) 13.84 −5.26* (2.86) 18.66 −2.91 (4.42) 32.50 
Other −5.05 (4.62) 36.36 2.44 (4.38) 25.03 −2.61 (7.21) 61.39 
 Year 1 Year 2 Year 3 
 LATE SE CCM LATE SE CCM LATE SE CCM 
Arrest Type A. 2012 Program 
Violent −4.13** (1.97) 9.83 −0.20 (1.70) 5.17 0.15 (1.63) 5.37 
Property 1.64 (1.39) 3.14 3.80** (1.73) 1.34 2.75** (1.33) 0.74 
Drugs 0.52 (2.16) 3.89 −2.38 (1.86) 8.08 −2.87 (2.08) 7.65 
Other 1.36 (2.71) 10.77 1.12 (2.63) 9.36 1.69 (2.67) 10.45 
 B. 2013 Program 
Violent −7.90** (3.77) 24.20 1.31 (3.25) 12.71    
Property 1.76 (3.09) 11.62 2.16 (3.11) 6.37    
Drugs 4.4 (5.02) 19.92 −7.92 (5.02) 26.65    
Other −11.6 (8.21) 56.18 2.82 (7.75) 36.41    
 C. Pooled 
 Year 1 Year 2 Year 1 and Year 2 
Violent −6.34*** (2.24) 18.31 0.76 (1.94) 9.55 −5.58* (3.16) 27.85 
Property 1.68 (1.80) 8.18 2.97 (1.89) 4.19 4.65 (2.86) 12.36 
Drugs 2.35 (2.90) 13.84 −5.26* (2.86) 18.66 −2.91 (4.42) 32.50 
Other −5.05 (4.62) 36.36 2.44 (4.38) 25.03 −2.61 (7.21) 61.39 

Coefficients, standard errors, and control complier means (CCMs) multiplied by 100. All regressions estimated using 2SLS including block fixed effects, duplicate application indicators, and the baseline covariates listed in the appendix. Huber-White standard errors in parentheses, clustered on individual in panel C. *p<0.1,**p<0.05,and***p<0.01.

Given the similarity in results across cohorts for crime and other outcomes (see appendix tables A5 and A6), we focus the remainder of our discussion on results pooling the two cohorts, using two years of follow-up data to be comparable across study years. Table 2, panel C shows the crime results with the pooled sample. The drop in violent-crime arrests during the first year is statistically significant and substantively large: 6.3 fewer violent-crime arrests per 100 participants, a 35% decline. The drop is not limited to the program summer, when youth are mechanically kept busier; excluding program months, violent-crime arrests decline by 26% (not shown, LATE = −3.5 per 100 youth, p=0.067). We also see positive but not statistically significant point estimates for property-crime arrests in all years and a marginally significant decline in drug arrests in year 2. As a result, there are no significant changes in the number of total arrests (see appendix table A7).

Panel A of table 3 shows p-values for the pooled crime results adjusted for multiple hypothesis testing. The first two columns show the LATE and standard error. The third column shows the q-value from Benjamini & Hochberg's (1995) FDR control procedure, or the smallest level of q at which the null hypothesis would be rejected (where q is the expected proportion of false rejections within the family). Column 4 shows p-values that control the FWER with a free step-down resampling method, followed by the CCM and sample size for each outcome. (See appendix G for details.) The reduction in year 1 arrests for violent crime remains significant after adjusting inference to control the FWER across the four crime categories (p=0.018) or to control the FDR (q=0.019).

Table 3.
IV Program Impacts and MHT Adjustments for Both Cohorts, Pooled
 LATE SE FDR FWER CCM 
A. Arrests (×100) 
Violent, year 1 −6.34*** (2.24) 0.019 0.018 18.31 6,850 
Property, year 1 1.68 (1.80) 0.417 0.583 8.18 6,850 
Drugs, year 1 2.35 (2.90) 0.417 0.420 13.84 6,850 
Other, year 1 −5.05 (4.62) 0.417 0.623 36.36 6,850 
Violent, year 2 0.76 (1.94) 0.695 0.696 9.55 6,850 
Property, year 2 2.97 (1.89) 0.232 0.307 4.19 6,850 
Drugs, year 2 −5.26* (2.86) 0.232 0.239 18.66 6,850 
Other, year 2 2.44 (4.38) 0.695 0.826 25.03 6,850 
B. Schooling 
Any enrollment, year 1 0.01 (0.02) 0.82 0.982 0.74 4,993 
Days present, year 1 −0.59 (2.60) 0.82 0.827 91.53 4,993 
GPA, year 1 0.01 (0.05) 0.82 0.969 1.95 2,447 
Persistence through year 3 −0.01 (0.02) 0.82 0.960 0.62 4,993 
C. Employment and earnings 
Any employment, program quarter 0.88*** (0.02) 0.001 <0.001 0.12 5,076 
Provider employment, program quarter 1.04*** (0.01) 0.001 <0.001 0.00 5,076 
Nonprovider employment, program quarter −0.05** (0.02) 0.030 0.033 0.16 5,076 
Earnings, program quarter 739.49*** (66.6) 0.001 <0.001 70.71 5,076 
Any employment, postprogram 0.03 (0.03) 0.553 0.526 0.47 5,076 
Provider employment, postprogram quarter 0.11*** (0.02) 0.566 <0.001 0.04 5,076 
Nonprovider employment, postprogram quarter −0.02 (0.03) 0.001 0.620 0.44 5,076 
Quarterly earnings, postprogram 56.39 (70.55) 0.620 0.665 336.88 5,076 
 LATE SE FDR FWER CCM 
A. Arrests (×100) 
Violent, year 1 −6.34*** (2.24) 0.019 0.018 18.31 6,850 
Property, year 1 1.68 (1.80) 0.417 0.583 8.18 6,850 
Drugs, year 1 2.35 (2.90) 0.417 0.420 13.84 6,850 
Other, year 1 −5.05 (4.62) 0.417 0.623 36.36 6,850 
Violent, year 2 0.76 (1.94) 0.695 0.696 9.55 6,850 
Property, year 2 2.97 (1.89) 0.232 0.307 4.19 6,850 
Drugs, year 2 −5.26* (2.86) 0.232 0.239 18.66 6,850 
Other, year 2 2.44 (4.38) 0.695 0.826 25.03 6,850 
B. Schooling 
Any enrollment, year 1 0.01 (0.02) 0.82 0.982 0.74 4,993 
Days present, year 1 −0.59 (2.60) 0.82 0.827 91.53 4,993 
GPA, year 1 0.01 (0.05) 0.82 0.969 1.95 2,447 
Persistence through year 3 −0.01 (0.02) 0.82 0.960 0.62 4,993 
C. Employment and earnings 
Any employment, program quarter 0.88*** (0.02) 0.001 <0.001 0.12 5,076 
Provider employment, program quarter 1.04*** (0.01) 0.001 <0.001 0.00 5,076 
Nonprovider employment, program quarter −0.05** (0.02) 0.030 0.033 0.16 5,076 
Earnings, program quarter 739.49*** (66.6) 0.001 <0.001 70.71 5,076 
Any employment, postprogram 0.03 (0.03) 0.553 0.526 0.47 5,076 
Provider employment, postprogram quarter 0.11*** (0.02) 0.566 <0.001 0.04 5,076 
Nonprovider employment, postprogram quarter −0.02 (0.03) 0.001 0.620 0.44 5,076 
Quarterly earnings, postprogram 56.39 (70.55) 0.620 0.665 336.88 5,076 

Table shows estimated LATEs pooling both years of the program and inference controlling for the false discovery rate (FDR) and the familywise error rate (FWER). See text for details about multiple hypothesis testing adjustments. Negative control complier means (CCMs) set to 0. Asterisks based on conventional inference: *p<0.1,**p<0.05,and***p<0.01.

The decline in violent-crime arrests does not continue in the second year. Fade-out is almost universal in social interventions, though it is also worth noting that the program occurred at a high-violence moment in the youths' trajectories: the violence control complier means in year 2 are about half of the size of year 1. This pattern suggests that part of the fade-out may stem from well-timed program delivery, after which youth start aging out of violent crime. It is also possible that more control compliers were incarcerated for their earlier violent crimes during the first year of the program, which could mechanically lead to lower crime rates among the control group during year 2. However, the CCMs for drug-crime arrests are higher in year 2 than in year 1, which is not entirely consistent with the idea that the control youth just have less time free to offend. Even if so, providing a SYEP is likely a less socially costly way to reduce crime than incarceration, and cumulatively across years, the violence drop remains significant (p=0.078).

We see a marginally significant decline in drug crimes during year 2 and an imprecise but substantively large increase in property crime (which is statistically significant if we add the third year of outcome data for the 2012 cohort: 5.8 more property-crime arrests per 100 participants, a 46% increase, p=0.053). Program effects that go in opposite directions for violent and property crime are fairly common in the literature (Kling, Ludwig, & Katz, 2005; Deming, 2011; Jacob & Lefgren, 2003); in fact, a short-term violence decline followed by a longer-term property crime increase is notably similar to the pattern of results in the Moving-to-Opportunity study. An increase in property crime might be expected if youth are spending more time traveling or working, since they have more access to better things to steal (Clarke, 1995). However, the changes in nonviolent crime are less robust to multiple hypothesis testing adjustments, so we interpret them more cautiously. The fact that violence is so much more socially costly than other types of crime highlights the importance of analyzing crime types separately rather than aggregating the differences away.

One possible explanation for the violence decline could be that participants learn about the returns to schooling, or develop motivation, self-efficacy, or other prosocial beliefs, and so spend more time engaged in school in the year after the program. The schooling results in panel B, however, suggest this is not the case: we find no significant changes in CPS enrollment, days present, or GPA during the school year after the program, and the confidence interval in the pooled sample rules out more than a four- to five-day increase in attendance.17 The conclusions are unchanged after adjusting inference for multiple hypothesis testing. These results focus on the year after the program, since missing data become a larger problem as youth age (more graduation and dropout in later years). To capture longer-term school engagement, the last row of panel B shows the program's impact on whether a youth persists in school (remains enrolled or graduates) through the start of the third year after random assignment. The point estimate is small, negative, and statistically insignificant. Overall, there is little evidence of changes in schooling outcomes.

Panel C of table 3 shows estimated program effects on the probability of being employed and on quarterly earnings for the sample of youth we can match to UI records.18 As expected, there is a large increase in formal employment during the program quarters driven by greater employment at program providers.19 There is also a small amount of crowd-out (employment outside the program falls by about 5 percentage points), although participants' overall employment rates are about eight times higher than among their control counterparts, leading to total summer quarter earnings that are about $740 greater.

To exclude the mechanical program effect over the summer, we show employment during the six quarters after the program. The impact on both any postprogram employment and quarterly earnings is small and statistically insignificant. Youth appear to have formed relationships with program providers that continued through the second follow-up year (LATE = 0.11, p<0.01) but not enough to change overall employment. The increase in program-quarter employment and the increase in postprogram employment at program providers remain statistically significant after making adjustments for MHT. The confidence intervals on overall employment include what might be substantively interesting effects (e.g., we cannot rule out up to a 8.8 percentage point, or 19% increase in postprogram employment). But both the crowd-out and the lack of employment increase are quite similar to the findings from New York City (Gelber, Isen, & Kessler, 2016), providing some additional support for the lack of improvement (and some signs of decline) in postsummer employment outcomes.

VII. Treatment Heterogeneity with the Causal Forest

We are interested in estimating treatment heterogeneity for three reasons. First, knowing who benefits most can help direct limited SYEP funding to those with the largest potential gains.20 Second, it helps predict external validity. Knowing which characteristics are correlated with which responses to treatment could not only suggest other settings where similar programs should (or should not) be tested, but also assess whether observable differences in youth populations across summer jobs and classical youth training programs help explain the differing program effects. Third, analyzing treatment heterogeneity across outcomes may help sort out the mechanisms driving the results. For example, it is possible that one subgroup benefits on employment, which drives a decrease in crime, while a different subgroup experiences less employment from crowd-out. But if crime benefits are not concentrated among the subgroup that benefits most from employment, then employment would seem unlikely to explain the overall violence decline.

As explained in section IV and appendix H.3, we use a causal forest to predict an individual's expected treatment effect for each outcome based on covariates. We estimate heterogeneity in the cumulative effects of the program for the pooled sample, using the number of violent-crime arrests in the first two postrandomization years, an indicator for any formal sector employment within six postprogram quarters, and school persistence through two postprogram years (still attending school or having graduated by the third postprogram school year) as outcomes. This section answers three questions: (a) whether the causal forest identifies treatment heterogeneity for each of these outcomes; (b) if so, who benefits; and (c) whether benefits are concentrated among subgroups in a way that helps explain mechanisms.

We start by testing whether the causal forest detects any treatment heterogeneity. Although the method will basically always predict some variation in program effects in a finite sample (see appendix H.4), we need to distinguish actual heterogeneity from noise. Figure 1 shows a visual presentation of how the “out-of-bag” predictions and actual treatment effects are related. Along the x-axis, we bin observations into twenty groups by percentile of predicted treatment effect. We calculate the actual ITT within each bin, then plot the actual versus predicted effects for our three main cumulative outcomes. If the predictions were perfect, we would expect the points to line up on the 45-degree line. If they were just noise, we would see a flat relationship. The employment predictions appear to do quite well on average, with the fitted line close to the 45-degree line. For the other two outcomes, it does not appear that the observables consistently predict actual variation in treatment effects.
Figure 1.

Predicted versus Actual Effects

For each of twenty bins defined by percentile of predicted intent-to-treat effects, figures show predicted versus actual intent-to-treat effects on the specified outcome for individuals in that bin. The dashed line shows the linear relationship between the actual and predicted effects. The solid line is the 45-degree line, which is included for reference.

Figure 1.

Predicted versus Actual Effects

For each of twenty bins defined by percentile of predicted intent-to-treat effects, figures show predicted versus actual intent-to-treat effects on the specified outcome for individuals in that bin. The dashed line shows the linear relationship between the actual and predicted effects. The solid line is the 45-degree line, which is included for reference.

There is no single statistical test for whether the variation in predictions represents meaningful treatment heterogeneity in the data (i.e., if the predictions “work”). Our approach is to show several variants of tests that answer slightly different questions, all of which lead to similar substantive conclusions. First, we do the equivalent of testing the slope of the scatterplots in figure 1 by running the regression Y=α+Xβ+γT+μτ^+ηTτ^+ɛ, where X is a vector of our normal covariates, T is a treatment indicator, and τ^ is the causal forest prediction of the treatment effect on Y. Using this equation, as appendix I.5.1 explains, implies that E(Y1-Y0|τ^,X)=γ+ητ^. So the interaction coefficient η, like the slopes in figure 1, shows how E(Y1-Y0|τ^,X) changes when the predicted effect increases by 1. If the true η is 0, it would mean either that treatment effects are homogeneous or that any heterogeneity in effects is not related to τ^. If the estimated η is 0, it could also be because the predictions have too much noise relative to signal in our sample. So rejecting that η=0 means (a) there is treatment heterogeneity, (b) our Xs are associated with that heterogeneity, and (c) the predictions are capturing that heterogeneity (although if the coefficient is negative, the information in the predictions would be getting it backward; one might prefer a one-sided test). If η^ is statistically indistinguishable from 1, then with enough power, we would conclude that a one-unit change in the predictions is associated with a one-unit change in E(Y1-Y0|τ^,X) on average.21

The idea of testing the slope η is intuitive, but it is also potentially unsatisfying for two reasons. First, predictions that do not provide a good linear fit across the whole distribution may still be of interest. For example, suppose only a small group benefits from the program, which is predictable based on observables, but observables do not predict heterogeneity among the remainder of the sample (i.e., the true interaction between treatment and prediction is highly nonlinear). Then the slope test may not reject 0, but the predictions could still uncover a group of benefiters worth identifying. To test the less parametric question of whether the causal forest identifies any group that benefits, we create an indicator for whether a youth has a predicted treatment effect in the largest quartile of predictions (the most positive quartile for employment and school persistence and the most negative quartile for arrests). We estimate separate LATEs for this group of “predicted big responders” and the rest of the sample and test the difference between the two groups.22

The second reason the slope test may be unsatisfying is that there is not yet a consensus on calculating uniformly valid standard errors on causal forest predictions. We use a prediction that contains error, and that comes from a procedure using randomness, on the right-hand side,23 but bootstrap and subsampling procedures are computationally infeasible with 100,000 trees. We may also worry that using other observations in generating an individual's predicted treatment effect will induce correlation in errors.

We implement two other tests to ensure our conclusions are not sensitive to potentially problematic standard errors. We first report confidence intervals on the slope coefficient and quartile difference tests from the split-sample procedure proposed in Chernozhukov et al. (2018). This procedure generates an empirical distribution of each test statistic across many hold-out samples rather than relying entirely on asymptotic standard error results, while also incorporating the uncertainty generated by sample splitting (see appendix I.5.3 for details). For computational feasibility, we use 1,000 instead of 100,000 trees in each iteration, meaning the confidence intervals are likely conservative (they could get smaller with more iterations).

We then use a novel permutation test to ask a slightly different question: whether there is more information in the predictions than we would expect by chance if the Xs were actually independent of treatment effects. To implement this test, we randomly permute an observation's set of baseline covariates across observations 1,000 times (enforcing the null that covariates are independent of treatment effects by assigning each observation someone else's covariates), estimate a new causal forest within each permutation using 1,000 trees (for computational feasibility), and save the R2 from a regression of each outcome on treatment interacted with the predicted effect for that outcome in that permutation, treatment, the prediction, our usual set of controls, and block fixed effects (the same regression underlying the slope test). Where the true R2 falls in the distribution of permuted estimates provides a p-value for the hypothesis that the predictions are no more informative about how Y varies within the treatment group than they would be by chance if Xs were unrelated to τs.

Table 4 shows the results of these tests for the same three cumulative outcomes as in figure 1.24 Column 1 uses an indicator for any employment over the six postprogram quarters as the dependent variable, where all the tests suggest that the causal forest picks up important heterogeneity in treatment effects. In panel A, the η coefficient on the treatment-by-prediction interaction is 1.43 with a 95% confidence interval ranging from 0.65 to 2.21. The conservative split-sample approach yields a median point estimate of 3.75 with a 95% confidence interval ranging from 1.21 to 6.29. Both rule out 0. The permutation test also rejects the null that the predictions are no more informative than they would be if the covariates were independent of treatment heterogeneity (p=0.03). Additionally, panel B shows that the group predicted to have the largest employment response has a significant 15 percentage point increase in employment, which is significantly different from the effect in the rest of the sample (p-value of difference = 0.02, split-sample 95% confidence interval [0.09, 0.46]). In other words, the predictions successfully locate youth with large, positive treatment effects. We can conclude that there is treatment heterogeneity in employment impacts related to observables and that the causal forest successfully predicts that heterogeneity. As shown in appendix I.6 and discussed in section IX, a typical researcher-specified interaction approach that involved some data mining with adjustments for multiple testing would have missed this employment heterogeneity entirely.

Table 4.
Assessing the Causal Forest Predictions
 Any Postprogram Formal Employment Number of Violent Crime Arrests School Persistence 
Panel A 
Treatment × Prediction (η1.43*** −0.69 0.19 
 (0.40) (0.78) (0.56) 
Split sample 95% CI [1.21, 6.29] [−3.21, 3.87] [−2.00, 3.78] 
p-value, R2 permutation test 0.033 0.260 0.771 
Panel B 
Largest quartile of predicted responders 0.15*** −5.08 0.03 
 (0.06) (8.62) (0.05) 
Rest of sample −0.01 −5.74* −0.02 
 (0.04) (3.07) (0.02) 
Difference 0.16** 0.66 0.05 
 (0.07) (9.13) (0.05) 
Split sample 95% CI for difference [0.09, 0.46] [−23.78, 12.74] [−0.04, 0.21] 
CCM largest quartile of predicted responders 0.34 49.44 0.61 
CCM rest of sample 0.52 19.68 0.62 
5,076 6,850 4,993 
 Any Postprogram Formal Employment Number of Violent Crime Arrests School Persistence 
Panel A 
Treatment × Prediction (η1.43*** −0.69 0.19 
 (0.40) (0.78) (0.56) 
Split sample 95% CI [1.21, 6.29] [−3.21, 3.87] [−2.00, 3.78] 
p-value, R2 permutation test 0.033 0.260 0.771 
Panel B 
Largest quartile of predicted responders 0.15*** −5.08 0.03 
 (0.06) (8.62) (0.05) 
Rest of sample −0.01 −5.74* −0.02 
 (0.04) (3.07) (0.02) 
Difference 0.16** 0.66 0.05 
 (0.07) (9.13) (0.05) 
Split sample 95% CI for difference [0.09, 0.46] [−23.78, 12.74] [−0.04, 0.21] 
CCM largest quartile of predicted responders 0.34 49.44 0.61 
CCM rest of sample 0.52 19.68 0.62 
5,076 6,850 4,993 

Panel A shows estimates of η, the coefficient on the treatment-by-prediction interaction discussed in section VII. Panel B shows LATE estimates for the largest quartile of predicted responders and the rest of the sample. All regressions include block fixed effects and standard controls. See text for details. *p<0.1,**p<0.05,and***p<0.01.

For the other outcomes, however, we do not find significant treatment heterogeneity. The estimated interaction coefficient η for violent-crime arrests is −0.69. With traditional standard errors, we cannot reject that the slope is 0, but we can reject that it is 1 (95% confidence interval [−2.22, 0.84], although the split-sample approach is less precise). The permutation test also fails to reject the null (p=0.26). And the group predicted to have the largest decline in violent crime has a similar point estimate as the rest of the sample, albeit with large standard errors on the test of the difference. In other words, observables do not seem to predict treatment heterogeneity in violent-crime arrests: everyone in our sample benefits. Conversely for schooling persistence, it seems that no one benefits across all tests (minimum detectable effect between quartiles = 9.8 percentage points).

As with more standard interaction tests, the failure to predict treatment heterogeneity for two outcomes could be because treatment effects are actually homogeneous, because heterogeneity is not related to observables, because the form of our tests obscures true variability in treatment effects, or because we are underpowered. Heterogeneity analysis always requires a lot of data, and the standard errors on our tests are large enough to merit some caution. For violent-crime arrests, the minimum detectable difference using traditional standard errors between the top quartile and rest of the sample is 17.2. If the violence decline were entirely concentrated among the top quartile, a −17.2 coefficient would be a 35% decline in violence among that group (which is not completely unreasonable since that is about how much overall violence declines in the sample). And when we use our full sample, we have the power to rule out a slope of 1 on the interaction test; we do not have enough data to do so with the split-sample approach, so we consider this suggestive, but not definitive, evidence of a lack of heterogeneity in violence impacts based on our Xs. The school persistence results are noisier, with a standard slope confidence interval between −0.91 and 1.2, making it harder to rule out the possibility that our tests are underpowered. Nonetheless, we can conclude that the causal forest identifies a group that benefits from the treatment in terms of employment. Both individual interaction tests with adjustments for multiple testing and a split-sample approach that uses a fully interacted model to predict out-of-sample heterogeneity would miss the differences in employment impacts (see appendix I.6).

Having found variation in employment effects, we use it in two ways: to describe who benefits and to explore mechanisms. Table 5 shows preprogram descriptive statistics broken down by quartile of predicted employment treatment effects. The top row shows the mean ITT predicted effect in each quartile. The second row shows that the variation in ITT effects is not simply driven2 by differences in take-up rates. Although the participation rate increases a small amount across the quartiles of predicted employment effects, it is not enough to explain the differences in the ITT predictions (e.g., unadjusted for randomization block, the implied LATE for quartile 3 is 0.05 compared to 0.11 for quartile 4). The rest of the table suggests that the youth with the largest predicted employment benefits are more likely to be from the 2012 cohort (implying the positive effects are not driven solely by the low-intensity, year-round programming offered only to the 2013 cohort), somewhat younger, more Hispanic, more female, and less criminally involved than those who are not predicted to have a positive employment response (although almost a third of the biggest benefiters still have a preprogram arrest on record). The biggest responders also live in neighborhoods with somewhat lower unemployment rates, perhaps consistent with the possibility that labor demand plays a role in youths' ability to capitalize on their summer experience, although the patterns here are not necessarily causal.

Table 5.
Summary Statistics by Quartile of Predicted Employment Impact
 Quartile of Predicted Employment Impact 
Variable Q1 Q2 Q3 Q4 
Prediction, any postprogram employment −0.03 0.00 0.02 0.05 
Take-up rate 0.39 0.41 0.43 0.46 
In 2012 cohort 0.20 0.22 0.25 0.38 
Age at program start 18.66 18.48 18.00 16.89 
Hispanic 0.01 0.02 0.06 0.16 
Male 0.89 0.85 0.84 0.77 
Any baseline arrest 0.59 0.54 0.51 0.31 
Graduated preprogram 0.35 0.30 0.24 0.10 
Engaged in CPS in June 0.45 0.48 0.64 0.85 
Days attended in prior school year (if any) 109.69 119.51 122.50 138.50 
GPA 1.97 2.11 2.06 2.17 
Worked in prior year 0.19 0.22 0.20 0.09 
Census tract unemployment rate 17.32 14.72 12.89 12.26 
 Quartile of Predicted Employment Impact 
Variable Q1 Q2 Q3 Q4 
Prediction, any postprogram employment −0.03 0.00 0.02 0.05 
Take-up rate 0.39 0.41 0.43 0.46 
In 2012 cohort 0.20 0.22 0.25 0.38 
Age at program start 18.66 18.48 18.00 16.89 
Hispanic 0.01 0.02 0.06 0.16 
Male 0.89 0.85 0.84 0.77 
Any baseline arrest 0.59 0.54 0.51 0.31 
Graduated preprogram 0.35 0.30 0.24 0.10 
Engaged in CPS in June 0.45 0.48 0.64 0.85 
Days attended in prior school year (if any) 109.69 119.51 122.50 138.50 
GPA 1.97 2.11 2.06 2.17 
Worked in prior year 0.19 0.22 0.20 0.09 
Census tract unemployment rate 17.32 14.72 12.89 12.26 

Table shows mean baseline characteristics for each quartile of predicted employment treatment impacts. Predictions from a causal forest.

The employment benefiters are also more engaged in school. Eighty-five percent of youth in the top quartile were still in school the year before the program, attending an average of 139 days of school, with 10% having already graduated. In the bottom quartile, by contrast, only 45% of youth were still in school, attending an average of 110 days of school, with 35% already having graduated. So on average, those who benefited most in terms of employment were more likely to be in school attending more days than those who did not show improved employment.

This descriptive exercise highlights two points. First, although the program seems to have little employment impact overall, a subset of participants become more engaged in the formal labor force. But they are not the youth whom other employment programs typically target. Most existing training programs focus on out-of-school, out-of-work youth; by contrast, the people whose employment outcomes improved in our sample, at least over the six postprogram quarters in our data, tend to be younger and more engaged in school. Second, targeting the biggest benefiters is not as simple as limiting program eligibility to the characteristics that are more common among big responders, such as still being a high school student or being Hispanic. High school students are more likely to benefit, but almost half of the youth in the lowest quartile of predicted employment responders are still in school. The top quartile is more likely to be Hispanic than the bottom quartile, but 84% of the top quartile is African American. So simply targeting one or two characteristics may result in slightly larger gains on average but would generally not maximize the gains from the program. The next section quantifies the difference in targeting methods.

Our last task is to use the causal forest to assess mechanisms, in particular whether employment impacts could be driving the observed changes in criminal behavior. Figure 2 plots employment causal forest predictions on the x-axis (again binned into twenty groups by percentile of predicted treatment effect on postprogram formal employment, as in figure 1) against actual ITT crime effects on our four main crime outcomes within each bin.25 Each crime outcome is the cumulative number of arrests in the two years after random assignment. If better employment were driving the crime declines, we would expect negative slopes on these figures; crime declines would be concentrated among employment benefiters.
Figure 2.

Predicted Employment versus Actual Crime Effects

For each of twenty bins defined by percentile of predicted intent-to-treat effects on postprogram employment, figures show predicted employment effects versus the actual intent-to-treat effect on the specified crime outcome for individuals in that bin. The dashed line shows the linear relationship between the actual and predicted effects.

Figure 2.

Predicted Employment versus Actual Crime Effects

For each of twenty bins defined by percentile of predicted intent-to-treat effects on postprogram employment, figures show predicted employment effects versus the actual intent-to-treat effect on the specified crime outcome for individuals in that bin. The dashed line shows the linear relationship between the actual and predicted effects.

The left panel shows that the relationship between predicted employment effects and actual violent crime effects is an almost perfectly flat line. In contrast, the property, drug, and other crime slopes are all positive, meaning youth who benefit on employment are more likely to be arrested for these crimes. Appendix I.5 shows other tests consistent with the findings that youth with no changes in employment (quartiles 1 to 3 of the predictions) also experience a violence decline and that other arrests go up among those who are working more. This pattern is not consistent with the idea that crime benefits are a result of the increased opportunity cost of crime from better employment. The results are more consistent with the idea that better employment generates more opportunities for theft and more income for drugs, while changes in violent crime are driven by mechanisms unrelated to employment.

All these tests, however, are quite imprecise. The 95% confidence interval on the violent-crime arrest slope implies that a 1 standard deviation increase in the predicted employment effect is associated with somewhere between a −2.85 and 2.48 change in violent-crime arrests per 100 treated youth. While all the other slope estimates are positive, only the slope between drug crime and the employment predictions is marginally statistically different from 0 (p=0.087).26 Still, here we are using predictions as imperfect proxies for underlying true heterogeneity, so prediction error becomes a bigger issue with these tests. And the power issue is amplified when using a more conservative split-sample approach for inference (see appendix I.5.3). Because of the imprecision and inexact standard errors in these tests, we emphasize the basic pattern of results—no clear indication that crime effects are concentrated among employment benefiters—rather than any single result's statistical significance. More broadly, we emphasize our view that the causal forest is a useful way to generate new hypotheses for future testing in new settings rather than to establish any definitive truths.

VIII. Potential Gains from Program Targeting

We conclude with a back-of-the-envelope exercise to help quantify the gains of using the causal forest. Suppose policymakers were still interested in making 3,364 treatment offers to fill 1,349 slots, but they wanted to target particular youth to maximize the impact on postprogram employment.27 To compare different targeting strategies, we need a way to estimate what treatment effects are for each person. We use the causal forest predictions to stand in for the true treatment effects (clearly false, but it provides a way to compare average effects across allocation choices, not ground truth).

A more standard way to target without machine learning might be to use a set of linear baseline covariate-treatment interactions in one regression to predict program effects on employment using the full sample, then offer slots to the 3,364 youth with the highest predicted treatment effects. If we use seven covariates that ex ante seemed like they could matter (which include many of the biggest differences across quartiles in table 5; see appendix I.7), this version of targeting would yield an average ITT impact on employment of 0.009.28 This average effect is identical to the average predicted impact from randomly assigning spots; the linear interactions do no better than chance. If we tried to refine the approach by using some rough model selection, rerunning the interacted regression and keeping only the interactions that were initially statistically significant, the average employment effect among those offered slots would actually be negative, −0.004.

But if policymakers instead used the causal forest predictions to target treatment, the average ITT impact would be 0.036, nearly four times higher than either random assignment or the interactions. In other words, the causal forest—predicted heterogeneity outperforms a naive interaction-based approach in terms of program targeting by a factor of four. One might worry that improving the employment targeting could increase the social damage from crime, since employment benefiters show some imprecisely estimated increase in nonviolent crime. But if we use causal forest predictions of program effects on the social cost crime to stand in for the true impacts, we see this is not the case: Using the causal forest to target employment benefiters generates about twice the social savings from reduced crime as the more standard linear interaction approach (−$6,697 versus −$3,496). Ignoring confidence intervals and assuming the cost of serving the same number of youth would be about equal across targeting schemes, this implies that the benefit-cost ratio with causal forest targeting would be nearly twice as large as targeting with one-way interactions (see appendixes I.7 and J for details on this exercise and benefit-cost calculations).

IX. Conclusion

This paper shows that a supported summer jobs program in Chicago generates large one-year declines in violent-crime arrests, both in an initial study (42% decline) and in an expansion study with more disconnected youth (33%). The drop in violence continues after the program summer and remains substantively large after two to three years, though it stops accruing after the first year. And it occurs despite no detectable improvements in schooling, UI-covered employment, or other types of crime during the follow-up period. If anything, property crime increases in future years, though the large social cost of violence means that social benefits may still outweigh the program's administrative costs (see appendix J).

Using a new supervised machine learning method called the causal forest, we show that the 0 average employment effect masks a group whose formal sector employment improves by 15 percentage points (44%). We show that this subgroup is younger and more engaged in school than the group with no employment gains—fairly different from the out-of-school and out-of-work young people usually targeted by youth employment programs. However, the employment benefiters do not seem to drive the crime decline. Predicted employment impacts are almost completely uncorrelated with the impact on violent-crime arrests. And if anything, the impact on nonviolent arrests is positively correlated with employment gains. This is not consistent with the idea that changes in opportunity cost explain the crime effects. But it is consistent with other crime theory: better employment provides more opportunity for theft and more money for drug markets.

We do not find any detectable heterogeneity in program impacts on violence. Although this may be a question of power, it suggests that everyone—at least everyone in our disadvantaged, urban sample—benefits on this outcome. This finding highlights another reason why SYEPs may have different effects from other youth training programs. To reduce violence, a program must serve youth at risk of violence. But programs like Job Corps and Year Up screen out youth with certain criminal backgrounds, so may not have much room to make big strides on violence.

There tends to be a fair amount of pessimism in the youth employment literature about how difficult and costly it is to improve youth outcomes. The evidence we present here, combined with growing evidence from programs in other cities, suggests that this pessimism may stem in part from mistaken beliefs about what these programs achieve and for whom. Rethinking what youth training programs do and how to target them, as well as further exploring why SYEPs decrease violence, may help better direct limited government resources and improve our understanding of youth behavior.

Notes

1

See appendix A for a summary of these reviews and the youth job training literature, including recent more positive findings.

2

Appendix A documents how standard youth job programs serve almost exclusively out-of-school, out-of-work youth but often screen for criminal involvement, while SYEPs serve mostly high school students, who may be closer to the peak of the age-crime curve.

3

Crime results from within Chicago over the first sixteen months and one-year schooling outcomes for this RCT were reported in Heller (2014). This paper adds two more years of school data, two more years of crime data that now include all arrests statewide, and previously unreported employment outcomes, as well as the entire second study in 2013.

4

This is not to argue for serving exclusively this population without additional research, since changing peer composition could change program impacts.

5

In 2012, program providers were Sinai Community Institute, St. Sabina Employment Resource Center, and Phalanx Family Services. In 2013, they were the Black Star Project, Blue Sky Inn, Kleo Community Family Life Center, Phalanx Family Services, St. Sabina Employment Resource Center, Westside Health Authority, and Youth Outreach Services.

6

In serving a very mobile and arrest-prone population, it was clear that filling all the available slots would take considerable time. To speed up recruiting, we gave providers lists of hundreds of more youth than available program slots upfront. We count everyone on the list as treatment, since we could not enforce the rule that providers work down the list in order.

7

The prior study on the first cohort (Heller, 2014) used Chicago Police Department data rather than statewide data. That study included only arrests within Chicago and covered a somewhat different time period, so the amount of crime reported here is slightly different.

8

CPS underwent a major reform of how they recorded disciplinary incidents during this time, so it is not clear how comparable recording is across or even within schools. Therefore, we do not use the disciplinary data as outcome measures.

9

Appendix table A8 shows that the results are similar if we impute data for students who never appear in the CPS data. Appendix I.3 shows other missing data approaches.

10

The 2012 study had two treatment arms that differed by the provision of a social-emotional learning curriculum. Because the differences between treatment arms are generally not statistically significant, we focus the main text on the overall treatment-control contrast; results by treatment arm are in appendix F.

11

It is not clear that the causal forest works directly with an IV involving noncompliance. Take-up rates within leaves may be 0 or close to 0 because of the small samples in each leaf. This will make the LATE either incalculable or huge in the leaves resulting from some potential splits. But the causal forest implements the splits that maximize the variance of treatment effects across leaves; if some treatment effects are enormous because of small-sample variation in take-up rates, the key Athey and Imbens result—that an objective function maximizing treatment effect variance is equivalent to minimizing the expected mean squared error of the unobservable prediction error—may not hold. We report how take-up rates vary across predicted ITTs to assess how much heterogeneity is from differences in participation. Alternative strategies might involve the generalized random forest (Athey, Tibshirani, & Wager, 2019), which does not estimate LATEs within leaves, or running separate causal forests to predict an individual's ITT and take-up rate, then constructing individual LATE estimates as the ratio of the two predictions.

12

The causal forest's flexibility in searching for benefiters while still avoiding overfitting is desirable because our key research question is whether any subgroup benefits. Other regression tree–based approaches, including Bayesian additive regression trees, share this flexibility and may have different stability and regularization properties. If the question of interest is instead whether (or which of) a small number of Xs predict heterogeneity, alternative approaches like Lasso could be more appropriate.

13

We use a subset of covariates that are available for nearly everyone in the sample, including demographics (age in years and indicator variables for being male, black, or Hispanic), neighborhood characteristics from the ACS (census tract unemployment rate, median income, proportion with at least a high school diploma, and proportion who rents their home), prior arrests (number of prerandomization arrests for violent crime, property crime, drug crime, and other crime), prior schooling (indicator variables for having graduated from CPS prior to the program, being enrolled in CPS in the school year prior to the program, not being enrolled in the year prior to the program despite having a prior CPS record, and not being in the CPS data at all), and prior employment (indicator variables for having worked in the year prior to the quarter of randomization, for having not worked in the year prior to the quarter of randomization despite having a valid SSN, and for not having a valid SSN).

14

The variance penalty comes from Athey and Imbens (2016). We also use inverse probability weights to deal with different treatment probabilities across randomization blocks.

15

This step is a slight deviation from Wager and Athey, who assign τ^b to the entire sample rather than the 80% excluded from the initial subsample. We find that this adjustment, using only out “out-of-bag” estimates, reduces overfitting in our finite-sample setting, although it may require adjusted theoretical justification (Davis & Heller, 2017).

16

UI data are quarterly, and the 2012 program started in the last week of June. So we define the “summer” program period as quarters 2 and 3 of 2012 (April to September) in the first study year and quarter 3 only (July to September) in the second study year, when the program started at the beginning of July.

17

Section III explains how we treat missing data in this table, with more details in appendix C.2. Appendix I.3 shows that the results are generally robust to other treatments of missing data, including logical imputation that accounts for transfers out of the district; multiple imputation, which relaxes the MCAR assumption in this sample; and the inclusion of multiply imputed data for youth who were never in CPS records.

18

Youth are in this sample if they have a valid SSN. Appendix I.4 shows that results using various imputation techniques for missing data do not change the pattern of results.

19

Some coefficients are greater than 1 in part because we are using a linear probability model. Appendix table A10 shows estimated average marginal effects using a probit, which are substantively very similar.

20

In theory, any optimal targeting strategy should maximize net social welfare, not just behavioral benefits. Youth may generate heterogeneous program costs if some individuals require more resources to recruit and serve or have heterogeneous private valuations of the program. And policymakers may place value on equity or particular distributional consequences of a targeted program. Taking a stand on the social welfare function is beyond the scope of this paper, so we focus on estimating who benefits most, which is one crucial input to decisions about optimal allocation.

21

In a recent working paper, Chernozhukov et al. (2018) suggest a different functional form for this test. The results from their specification, shown in appendix I.5, yield identical conclusions.

22

Appendix I.5.2 details this regression and shows the results are not sensitive to different divisions of the predictions. Both Davis and Heller (2017) and, in a different setting, Athey and Wager (2019) show a related exercise using an above/below median split. The former uses a split-sample comparison with fewer trees, which produces results that are not entirely stable across different splits of the sample. Since the goal here is to learn from the predictions rather than assess the method, we use our full sample to increase stability, relying on the “adjusted honest” approach. We also increase the number of trees we use from 25,000 to 100,000. The predictions themselves are generally similar in both cases (correlations across the two different sets of predictions are over 0.99 for all three of our outcomes). But since we are using a quartile cutoff to test for treatment heterogeneity, Monte Carlo error can generate small changes in predictions around the cutoff, which in turn changes the composition of our subgroups. The additional trees reduce the number of observations switching quartile across two different sets of predictions by 50% to 75%.

23

For the slope test, there is an argument that prediction error should not matter: we are testing a question about the predictions, not the construct the predictions are trying to capture, and the predictions exactly define themselves. Prediction error is a larger issue below when we use the predictions as a measure of true underlying heterogeneity.

24

We exclude preprogram graduates from the persistence column since the program could not change high school outcomes for this group. We report causal forest results for other outcomes in appendix I.5.4; none successfully predicts heterogeneity.

25

Appendix I.5 explores the relationship between the employment predictions and other outcomes.

26

If instead we pool all nonviolent crime into one outcome, p=0.108.

27

The most socially beneficial targeting might be to target those with the biggest impacts on violent crime, since violence is so socially costly. But since we find no heterogeneity in violence impacts, we focus this exercise on employment.

28

Since average take-up does not vary much across employment predictions, we assume for simplicity that take-up probability is uncorrelated with predicted treatment effects (i.e., participants have the same average treatment effect as those offered a slot).

REFERENCES

REFERENCES
Anderson
,
Michael L.
, “
Multiple Inference and Gender Differences in the Effects of Early Intervention: A Reevaluation of the Abecedarian, Perry Preschool, and Early Training Projects,
Journal of the American Statistical Association
103
(
2008
),
1481
1495
.
Angrist
,
Joshua D.
,
Guido W.
Imbens
, and
Donald B.
Rubin
, “
Identification of Causal Effects Using Instrumental Variables,
Journal of the American Statistical Association
91
(
1996
),
444
455
.
Athey
,
Susan
, and
Guido
Imbens
, “
Recursive Partitioning for Heterogeneous Causal Effects
,”
Proceedings of the National Academy of Sciences
113
(2016)
,
7353
7360
.
Athey
,
Susan
, and
Guido
Imbens
The Econometrics of Randomized Experiments
” (vol.
1
, pp.
73
140
), in
Abhijit Vinayak
Banerjee
and
Esther
Duflo
, eds.,
Handbook of Field Experiments
(
Amsterdam
:
North-Holland
,
2017
).
Athey
,
Susan
,
Julie
Tibshirani
, and
Stefan
Wager
, “
Generalized Random Forests,
Annals of Statistics
47
(
2019
),
1148
1178
.
Athey
,
Susan
, and
Stefan
Wager
, “
Estimating Treatment Effects with Causal Forests: An Application
(2019)
,
arXiv:1902.07409
.
Benjamini
,
Yoav
, and
Yosef
Hochberg
, “
Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing,
Journal of the Royal Statistical Society, Series B (Methodological)
57
(
1995
),
289
300
.
Chernozhukov
,
Victor
,
M.
Demirer
,
E.
Duflo
, and
I.
Fernandez-Val
, “
Generic Machine Learning Algorithms for Heterogeneous Treatment Effects in Randomized Experiments
(2018)
,
arXiv:1712.04802
.
Clarke
,
Ronald V.
, “
Situational Crime Prevention
” (pp.
91
150
), in
M.
Tonry
and
D.
Farrington
, eds.,
Building a Safer Society: Strategic Approaches to Crime Prevention Crime and Justice
, vol.
19
of
M.
Tonry
, ed.,
Crime and Justice: A Review of Research
(
Chicago
:
University of Chicago Press
,
1995
).
Cohen
,
Lawrence E.
, and
Marcus
Felson
, “
Social Change and Crime Rate Trends: A Routine Activity Approach,
American Sociological Review
44
(
1979
),
588
608
.
Cook
,
Phillip J.
, “
The Demand and Supply of Criminal Opportunities
,”
Crime and Justice
7
(1986)
1
27
.
Crépon
,
Bruno
, and
Gerard J.
van den Berg
, “
Active Labor Market Policies
,”
Annual Review of Economics
8
(2016)
,
521
546
.
Davis
,
Jonathan M. V.
, and
Sara B.
Heller
, “
Using Causal Forests to Predict Treatment Heterogeneity: An Application to Summer Jobs,
American Economic Review: Papers and Proceedings
107
(
2017
),
546
550
.
Deming
,
David J.
, “
Better Schools, Less Crime,
Quarterly Journal of Economics
126
(
2011
),
2063
2115
.
Gelber
,
Alexander
,
Adam
Isen
, and
Judd B.
Kessler
, “
The Effects of Youth Employment: Evidence from New York City Lotteries,
Quarterly Journal of Economics
133
(
2016
),
423
460
.
Heller
,
Sara B.
, “
Summer Jobs Reduce Violence among Disadvantaged Youth
,”
Science
346:6214
(2014)
,
1219
1223
.
Heller
,
Sara B.
,
Anuj K.
Shah
,
Jonathan
Guryan
,
Jens
Ludwig
,
Sendhil
Mullainathan
, and
Harold A.
Pollack
, “
Thinking, Fast and Slow? Some Field Experiments to Reduce Crime and Dropout in Chicago
,”
Quarterly Journal of Economics
132
:
1
(
2017
),
1
54
.
Jacob
,
Brian
, and
Lars
Lefgren
, “
Are Idle Hands the Devil's Workshop? Incapacitation, Concentration and Juvenile Crime
,”
American Economic Review
93
(2003)
,
1560
1577
.
James
,
Gareth
,
Daniela
Witten
,
Trevor
Hastie
, and
Robert
Tibshirani
,
An Introduction to Statistical Learning
(
New York
:
Springer
,
2013
).
Katz
,
Lawrence F.
,
Jeffrey R.
Kling
, and
Jeffrey B.
Liebman
, “
Moving to Opportunity in Boston: Early Results of a Randomized Mobility Experiment,
Quarterly Journal of Economics
116
(
2001
),
607
654
.
Kling
,
Jeffrey R.
,
Jens
Ludwig
, and
Lawrence F.
Katz
, “
Neighborhood Effects on Crime for Female and Male Youth: Evidence from a Randomized Housing Voucher Experiment
,”
Quarterly Journal of Economics
120
:
1
(
2005
),
87
130
.
LaLonde
,
Robert J.
, “Employment and Training Programs” (pp.
517
586
), in
Robert A
Moffitt
, ed.,
Means-Tested Transfer Programs in the United States
(
Chicago
:
University of Chicago Press
,
2003
).
Leos-Urbel
,
Jacob
, “
What Is a Summer Job Worth? The Impact of Summer Youth Employment on Academic Outcomes,
Journal of Policy Analysis and Management
33
(
2014
),
891
911
.
Modestino
,
Alicia Sasser
, “
How Do Summer Youth Employment Programs Improve Criminal Justice Outcomes, and for Whom?
Journal of Policy Analysis and Management
38
(
2019
),
600
628
.
Schwartz
,
Amy Ellen
,
Jacob
Leos-Urbel
, and
Matthew
Wiswall
, “
Making Summer Matter: The Impact of Youth Employment on Academic Performance
,”
NBER working paper
21470
(
2015
).
Valentine
,
Erin Jacobs
,
Chloe
Anderson
,
Farhana
Hossain
, and
Rebecca
Unterman
, “
An Introduction to the World of Work: A Study of the Implementation and Impacts of New York City's Summer Youth Employment Program
,”
MDRC report
(2017)
.
Wager
,
Stefan
, and
Susan
Athey
, “
Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests,
Journal of the American Statistical Association
113
(
2018
),
1228
1242
.
Westfall
,
Peter H.
, and
S. Stanley
Young
,
Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment
(
New York
:
Wiley-Interscience
,
1993
).

External Supplements

Author notes

This research was generously supported by contract B139634411 and a Scholars award from the US Department of Labor grant 2012MIJ-FX-002 from the Office of Juvenile Justice and Delinquency Prevention, Office of Justice Programs, US Department of Justice, and graduate research fellowship 2014-IJ-CX-0011 from the National Institute of Justice. The 2012 study was preregistered at clinicaltrials.gov. Both studies are registered in the American Economic Association Registry under trial numbers 1472 and 2222. For helpful comments, we thank Stephane Bonhomme, Eric Janofsky, Avi Feller, Jon Guryan, Kelly Hallberg, Jens Ludwig, Parag Pathak, Harold Pollack, Guillaume Pouliot, Sebastian Sotelo, Alexander Volfovsky, and numerous seminar participants. We are grateful to the staff of the University of Chicago Crime and Poverty Labs (especially Roseanna Ander) and the Department of Family and Support Services for supporting and facilitating the research, to Susan Athey for providing the beta causal forest code, and to Valerie Michelman and Stuart Hean for research assistance. We thank Chicago Public Schools, the Department of Family and Support Services, the Illinois Department of Employment Security, and the Illinois State Police via the Illinois Criminal Justice Information Authority for providing data. The analysis and opinions here do not represent the views of any of these agencies, and any further use of the data must be approved by each agency. Any errors are our own.

A supplemental appendix is available online at http://www.mitpressjournals.org/doi/suppl/10.1162/rest_a_00850.

Supplementary data