Eight-Month-Old Infants Meta-Learn by Downweighting Irrelevant Evidence

Abstract Infants learn to navigate the complexity of the physical and social world at an outstanding pace, but how they accomplish this learning is still largely unknown. Recent advances in human and artificial intelligence research propose that a key feature to achieving quick and efficient learning is meta-learning, the ability to make use of prior experiences to learn how to learn better in the future. Here we show that 8-month-old infants successfully engage in meta-learning within very short timespans after being exposed to a new learning environment. We developed a Bayesian model that captures how infants attribute informativity to incoming events, and how this process is optimized by the meta-parameters of their hierarchical models over the task structure. We fitted the model with infants’ gaze behavior during a learning task. Our results reveal how infants actively use past experiences to generate new inductive biases that allow future learning to proceed faster.


1
We addressed the first problem by checking the extent to which infants watched fewer trials for later sequences (Poisson generalized linear model). We found that, although data for later sequences were fewer ( = -0.04, 94%HDI = [-0.06, -0.03]), they still showed a considerable number of trials (see supplementary figure 2 for expected means), which should allow for good estimates also for later sequences. To adjust for these changes across trials and sequences, we will consider trial and sequence number when estimating the effects of interest.
We addressed the second problem by checking whether infants with faster saccadic latencies or shorter looking times watched a higher number of trials (Poisson generalized linear models). We do not find evidence for a relationship between sequence number and either individuals' average saccadic latency ( = -0.00006, 94%HDI = [0, 0]) or looking time ( = -0.0001, SD =, 94%HDI = [0, 0]).
In summary, the current assessment of data randomness, together with the control of time-varying covariates and the model comparison procedures described below, indicate that the model results cannot be ascribed to the non-random missingness of data alone.

Model priors.
We report the prior distributions of all the parameters in the model. It must be noted that we always preferred vague priors (i.e., high variance), allowing the model to be predominantly informed by the data. The parameters 0 , 1 , 0 , and 1 are assumed to follow a half-Normal distribution, which constrains them to take only values equal or greater than zero: 2 This is a constraint due to the fact that KL-Divergence is a non-negative score, and thus it must scale with nonnegative terms. The half-Normal distribution was chosen to ease the interpretation of these parameters as regression coefficients. Similarly, hazard (which is proportional to a probability) cannot be negative, and thus the parameter has a Gamma distribution, which can only take positive values: Where values of 0 would indicate that infants never look away from the screen, and the likelihood of looking away increases as values increase. A different value of was estimated for each sequence s, and the general values across sequences were captured by the hyperpriors κ and θ: κ~ Gamma (1,1) θ~ Gamma(1,1) We chose Gamma distributions following Ibrahim, Chen, and Sinha (2001). This prior requires the partitioning of the time range in question into intervals, and it thus fits well with the trial-by-trial structure of our design. All * for the regressions follow normal distributions, and they are thus free to detect both positive and negative relationships between the independent variable (i.e., information gain) and the data: The values of * were estimated hierarchically. The mean estimate for each participant was inferred starting from the hyperparameters * , that also followed a normal distribution: * ~ Normal ( We chose half-Cauchy distributions because they are recommended by Gelman (2006) for modelling standard deviations, given the advantage of being approximately uniform in the tail and weak near zero.

Model comparison.
Here, we assess additional models that favour alternative explanations of the results, and/or control for potential issues with the multivariate data. First, we checked whether the effect of meta-learning could be observed over and above the effect of change in saccadic latency and looking time across sequences. We added a covariate to both regressions, thus obtaining the following: , , ~ Norm( 0 + 1 i,s,t + 2 + 3 , ) , , ~ T − Student( 0 + 1 i,s,t + 2 + 3 , , v = 15) Where an additional coefficient * 3 is added to control for changes across sequences. When running the full model, we still find an effect of down-weighting (mean = 0.052, SD = 0.019, 94%HDI = [0.019, 0.087]). Moreover, this full model has better model fit than a model that assumes no meta-learning, both in terms of WAIC (11,142 vs. 11,156) and LOO (11,141 vs. 11,156).
Second, some of our dependent variables were correlated. Specifically, looking time is negatively correlated with saccadic latency and look-away probability. Two reviewers observed that this might add erroneous precision to the model. To make sure that the effect of meta-learning was observed for this reason, we run a reduced model that included saccadic latency and look-away, but not looking time. We could still observe the down-weighting effect (mean = 0.037, SD = 0.015, 94%HDI = [0.009, 0.065]), confirming that the correlation between measures was not driving the effect.
Third, data missingness is non-random in our model, as data is missing especially at the end of the sequences and at the end of the task. To make sure that the effects we observe are not solely due to data missing non-randomly, we run an additional model in which we only included the data from the first 10 sequences. Results remained the same, with an effect of meta-learning showing up in terms of downweighting (mean = 0.064, SD = 0.019, 94%HDI = [0.031, 0.100]). This suggests that the data of the last sequences, which are missing in greater number, and potentially noisier, are not driving the effect we observe.