Neural correlates of cognitive ability and visuo-motor speed: Validation of IDoCT on UK Biobank Data

Abstract Automated online and App-based cognitive assessment tasks are becoming increasingly popular in large-scale cohorts and biobanks due to advantages in affordability, scalability, and repeatability. However, the summary scores that such tasks generate typically conflate the cognitive processes that are the intended focus of assessment with basic visuo-motor speeds, testing device latencies, and speed-accuracy tradeoffs. This lack of precision presents a fundamental limitation when studying brain-behaviour associations. Previously, we developed a novel modelling approach that leverages continuous performance recordings from large-cohort studies to achieve an iterative decomposition of cognitive tasks (IDoCT), which outputs data-driven estimates of cognitive abilities, and device and visuo-motor latencies, whilst recalibrating trial-difficulty scales. Here, we further validate the IDoCT approach with UK BioBank imaging data. First, we examine whether IDoCT can improve ability distributions and trial-difficulty scales from an adaptive picture-vocabulary task (PVT). Then, we confirm that the resultant visuo-motor and cognitive estimates associate more robustly with age and education than the original PVT scores. Finally, we conduct a multimodal brain-wide association study with free-text analysis to test whether the brain regions that predict the IDoCT estimates have the expected differential relationships with visuo-motor versus language and memory labels within the broader imaging literature. Our results support the view that the rich performance timecourses recorded during computerised cognitive assessments can be leveraged with modelling frameworks like IDoCT to provide estimates of human cognitive abilities that have superior distributions, re-test reliabilities, and brain-wide associations.


INTRODUCTION
Automated and app-based assessment technologies provide a scalable, cost-effective, and reliable way to measure different aspects of cognitive abilities (1) and to monitor cognitive changes in clinical populations (2)(3)(4).Consequently, this technology is becoming popular in large-scale citizen science projects (5,6), cohorts (7) and registers (8).Building on the resultant big data, a major research drive has been to map associations between the summary scores that these computerised cognitive tasks output, and features of brain structure and function from large-scale imaging cohort studies (9,10).However, it is common practice to summarise a participants' performance by estimating or contrasting average accuracy and reaction times (RT) across task conditions (11).The resultant scores relate not only to individual differences in abilities to process the specific cognitive demands that are the intended target of the task (12,13), but also to other confounding factors such as visuo-motor processing speeds and the latency of the devices that people are assessed with.This lack of cognitive precision in summary score estimates is a non-trivial limitation for both the strength and the specificity of the associations that can be achieved.
A commonly overlooked advantage of computerised cognitive tasks is that, unlike classic pen-and paper assessment scales, they record every stimulus and response in a detailed performance timecourse.These performance timecourses can be modelled in more sophisticated ways to obtain ability estimates that have superior reliability and process specificity than simple contrasts or averaging.Previously, we reported development of one such modelling framework, IDoCT (Iterative Decomposition of Cognitive Tasks -see Box 1), which we designed to disentangle individuals' cognitive abilities from other confounding factors in a manner that is robust, computationally inexpensive and sufficiently flexible to be adapted for practically any task that manipulates cognitive difficulty across trials (14).
More specifically, IDoCT leverages trial-by-trial performance data from large cohorts to (a) estimate individuals' abilities to specifically cope with higher cognitive difficulty across trials (AS) and estimate their visuo-motor delay times (DT) while accounting for individual speed-accuracy tradeoffs.Notably, the approach concurrently recalculates the relative difficulty assigned to trials across task conditions using a fixed-point iterative process that handles the circularity of simultaneously defining individual performance from trial difficulty and trial difficulty from individual performance (14).This is achieved in a manner that accounts for any bias towards sampling more difficult conditions in more able individuals, as is the case in some adaptive designs, resulting in a robust data-driven recalibration of trial-difficulty scales.
We initially validated IDoCT by applying it to data from >400,000 participants who undertook 12 cognitive tasks during the Great British Intelligence Test (6).The results showed a successful decomposition of cognitive vs. device and visuomotor latencies, as gauged by superior sociodemographic associations and AS estimates that were not dwarfed by an inflated global intelligence factor (14).This combination of sensitivity and decorrelation holds promise for achieving stronger and more process-specific functional-anatomical mappings in behavioural-imaging association studies.
Here, we further validate IDoCT in the context of an adaptive Picture Vocabulary Task (PVT) (15,16) that was designed to measure 'crystallised' comprehension and reading decoding abilities (8,17), and that was deployed with 34,927 participants as part of the UK BioBank (UKB) imaging extension (18).Extracting summary measures for this PVT presents a notable challenge because it applied an adaptive staircase sampling algorithm to efficiently measure each participant's ability level, but the wordpicture difficulty scale that the sampling algorithm traversed proved to be sub-optimally calibrated for the assessed population.This resulted in aberrant sampling trajectories with accuracy ceiling effects and an unexpected bimodal distribution of population performance scores that is not ideal for analysis of associations (18).In theory, IDoCT should be able to resolve this ill-posed problem by recalibrating the trial-difficulty scale whilst using both speed and accuracy to produce more precise estimates of participant performance, with applications in functional-brain mapping studies.
To test this theory, we first confirm the expected improvements in distributions and test-retest reliability of PVT performance estimates for IDoCT relative to the original summary scores.Next, we evaluate whether the IDoCT PVT performance estimates associate more robustly, and in an interpretable manner, with participant age and education.Then, we use a simple linear machine learning pipeline to test the hypothesis that the IDoCT PVT estimates can be more reliably predicted from four distinct feature sets of the UKB structural imaging database.Finally, we conduct free text mining across the imaging literature to determine whether the brain regions that predict IDoCT PVT estimates of visuomotor and cognitive abilities are differentially associated in the neuroscience literature with visual and motor functions vs. language and memory functions respectively.

Data driven assessment of trial's difficulty D(t)
The performance P(i,t) of participant i in trial t is calculated based on the RT of participant i in trial t and the difficulty D(t) of trial t, if the answer is correct.In case of incorrect answers, the performance is equal to 0.

Measure of scaled trial difficulty DS(i)
The trial difficulty D(t) is scaled according to the specific ability AS(i) of all the N participants i that completed the trial t.

MATERIALS AND METHODS
Our full analysis pipeline consisted of 9 steps, summarized in Figure 1, namely 1) data curation, 2) IDoCT modelling, 3) evaluation of distributions, 4) estimation of trials' difficulty trajectories, 5) assessment of test-retest reliability, 6) association with age and education, 7) imaging associations, 8) automated literature extraction and preprocessing, and 9) free text analysis.

Study design and participants
The data analysed in this study were collected as part of UKB, a population-based prospective study that recruited >500,000 participants, aged between 40 and 69, in the 2010-2016 timeframe.The aim of UKB is to understand the genetic and non-genetic factors that contribute to different diseases that affect mainly the middle and older aged population.As part of the study, a subset of individuals undertook imaging, genetic, as well as health and demographics measures longitudinally, across 22 assessment centers throughout the United Kingdom (18).All participants provided informed, written, consent.Detailed information on the study design is available in the original paper (10) or online https://www.ukbiobank.ac.uk/.In the current study, demographics (i.e.age and education), imaging (i.e.fractional anisotropy from diffusion weighted images, and volume, intensity and thickness measures from structural MRI), and behaviorual data (i.e.performance in PVT) were used.In total, 34927 participants completed the PVT during at least one timepoint.33890 had complete picture vocabulary task data at baseline, of whom 20593 had data for all imaging measures of interest, and 4962 had cognitive data at follow up. Figure 2 shows the exact numbers of participants used for each step of the analysis and the exclusion criteria.

Picture Vocabulary task (PVT)
The PVT used for the UK Biobank was adapted from the NIH Toolbox Picture Vocabulary Test (15,16,19).This task was designed to assess a person's semantic/language functions by measuring their ability to match words to pictures.At each trial, participants were presented with a written word along with a set of 4 images, and they were required to select the picture that matched the word.Every participant was started on the same word, which was at an easy vocabulary level.The difficulty of the trials was then adaptively changed, using an Item Reponse Theory (IRT) model, in response to the participant's performance, according to a maximum likelihood estimate (MLE) of their vocabulary level, calculated from their answers so far.For example, if a participant selected the correct picture at trial n, then the word-picture combination presented at trial n+1 was at a higher level of difficulty, unless the supply of more difficult words had run out.Each participant's sequence of trials lasted for at least 20 words, and was terminated when either the accuracy of the estimate was close enough (using the standard error value), or a maximum of 30 trials had been reached.
The similarities and differences from the NIH toolbox test were as follows.The UK Biobank version used a touchscreen to present the words in written form only, unlike the NIH version administered by staff who also pronounced the words.The dataset of pictures, words, and their associated difficulty levels, was the same as the English-language version of the NIH PVT, but with modification for obvious difficulties: words differing between UK and US English were changed or removed, for example the "intersection" (US) was replaced with "crossroads" (UK), and the word "minute" was removed due to ambiguity of its written meaning, as the picture of a clock could have detracted from its intended meaning of a small size.This resulted in a dataset for 340 words.The same methods for estimating vocabulary levels and algorithmic adaptation for the next question were used in both tests.However the starting word and the maximum number of trials differed for the UK Biobank test.

Processing of UKB imaging data
The imaging data consisted of 26 measures of fractional anisotropy (FA) from diffusion weighted images (DWI), and 139 volume, 43 intensity and 60 cortical thickness measures from T1-weighted structural magnetic resonance images (MRI).All these imaging features were provided by UKB.Detailed information on the imaging processing is available in (20) and UKB feature labels are provided in the Supplementary Table A1.In brief, DICOM images were converted to NIFTI, fully anonymized using a defacing mask, and corrected for gradient distortion (GD).Specifically, in case of T1 images, the field of view (FOV) was cut down, the images were non-linearly registered to the standard MNI152 space (1mm resolution) and the brain was extracted.Finally, tissue segmentation was conducted to identify the different tissues and subcortical structures, and final volume measures were extracted after correcting for total brain size.In case of diffusion weighted images (dMRI), the first step of the processing consisted of the correction for head motion and eddy currents, followed by gradient distortion correction.Then, measures of fractional anisotropy (FA) and mean diffusivity were extracted.

IDoCT: extraction of measures of visuo-motor latency and cognitive ability
IDoCT is an iterative-based method ( 14) that takes as input trial-by-trial measures of reaction time (RT), and of accuracy for each participant, which can be binary (i.e.whether the participant replied correctly or not in each trial) or continuous (i.e.how correct was each individual response).In addition, it requires condition labels for each trial in the timecourse, which are defined based on the design of the task.Here, for example, the words presented at each trial are assumed to vary in difficulty; therefore, they are each assigned their own unique condition label.Then, through two separate iterative processes, the model returns measures of trial difficulties (D), of specific ability (AS) and of basic visuo-motor response speed (DT) for each participant.In a further step, it calculates scaled measures of trial difficulties (DS), that account for cases in which the trial assignment across the participants of differing abilities was biased towards different difficulty levels.This occurs, for example, when the most difficult trials are presented exclusively or predominantly to the most able participants, which should be the case here if the PVT sampling algorithm has sampled across an ordered difficulty scale.DS is obtained by scaling D based on the ability of the participants that completed the specific trials.Details on the implementation of IDoCT are available in the original publication (14) and are summarised in Box 1.In case of PVT, the accuracy measures are binary, and each trial is defined according to the word that is presented to the participants.Therefore, a given trial could occur just once per participant per session, with different participants completing different combinations of trials.

Test-Retest reliability
IDoCT estimates were computed on the follow up data to assess the test-retest reliability of the model.The results were compared to the reliability of the original ability scale (AB).The model computation was the same as for the baseline data, with the difference that the model parameters D, RTmax and ATmax obtained from the baseline analysis were used when estimating abilities at the second timepoint.Pearson correlations and Bland-Altman plots were used to compare and visualize the reliability of DT, AS and AB estimates across sessions, as well as any trend towards improved or worsened ability scores across time.

Age and education association
Multiple linear regression was used to compare the sensitivity of the IDoCT AS, DT estimates, and the original AB scores to age and education.Age was converted into 5-year age bins and one-hot encoded.Education was one-hot encoded into 8 categories (College or University Degree, A levels/AS levels or equivalent, O levels/GCSEs or equivalent, CSEs or equivalent, NVQ or HND or HNC or equivalent, Other professional qualifications, None of the above).Individual's selecting "Prefer not to answer" were treated as having missing information.An alpha threshold of p<0.01 was used to determine significance.

Neural correlates of AS and DT: feature selection
A summary of the imaging association pipeline is provided in Figure 3. Models were optimised and fitted separately for each imaging modality (e.g.measures of volume, cortical thickness, fractional anisotropy and intensity) and for each set of cognitive summary scores (AS, AB and DT).First, the dataset was split into train and held-out test set with a 75/25 split.Then, age was regressed out of the imaging and cognitive feature vectors.The bivariate Spearman's correlation between each imaging feature with the target cognitive summary score was computed for the train set only, and the features ranked into scales according to the magnitude of the obtained correlation coefficients r.Next, the combination of features that yielded the best prediction of the target variable was identified in a simple stepwise process using multiple linear regression with five-fold cross validation.Specifically, models were trained and evaluated across the five folds, iteratively removing the imaging feature with the lowest magnitude correlation coefficient r on the scale until just one remained within the predictor matrix.The optimal number of features was defined as the one producing models with the highest mean R 2 across the validation folds.The model was refitted to all of the train data using that optimal number of features.The train-test sets and folds were the same for all models, to enable cross comparison of model performance.The relative predictability of the different cognitive score estimates from the imaging data was evaluated by comparing the R 2 value when the optimal trained models for the different imaging modality -cognitive score combinations were applied to the held-out test data.
Figure 3 Imaging analysis summary.The imaging analysis was computed separately for each dataset (FA, Volume, Intensity and Thickness) and each potential outcome (AS, AB and DT).The analysis can be summarised in 6 main steps: 1) the data were split into train and test sets, 2) the imaging features were ranked in ascending order based on their correlation coefficient r with the outcome variable (AS, AB or DT), 3) N models were trained using 5-fold cross validation (with N equal to the number of imaging features), where in each model the feature with the next lowest coefficient r was progressively excluded from the analysis, 4) the model with the highest mean R 2 on the validation set was selected as the best model and the features used to train it as the best features, 5) the selected features were used to train a final multiple linear regression model using the full train test and then tested on the held-out test set, which was not used in the selection process, and 6) the eta squared of the significant features was calculated and the top 15 features with the highest eta squared as well as the top 15 features with the highest correlation coefficient r with AS or DT were identified as best neural correlates of the outcome of interest.

Literature review using natural language processing
The imaging features that contributed the most to the prediction of AS and DT were selected for further investigation using a Natural Language Processing (NLP) (21) pipeline in order to confirm whether they mapped onto the expected cognitive and visuomotor systems.To select features, both univariate and multivariate approaches were used.First, the Eta 2 values of the significant features (alpha threshold at 0.05) were calculated from the best fit models as described in the previous section.The features that were both significant and that were among the top 15 features with the highest Eta 2 were selected for further analysis (multivariate approach, derived from the multiple regression beta coefficients).Separately, for the univariate approach, the top 15 features with the highest magnitude of Pearson correlation with AS and DT were selected.The resultant feature labels were pooled across imaging modalities, producing AS and DT lists.We used an NLP literature search approach to summarise previously published literature in order to minimise author bias when choosing and interpreting papers related to the different brain regions.
Research papers were identified via multiple advanced search criteria based on the brain feature labels, provided in detail in the Supplementary Table A2.In general, all papers with the name of the brain region in the title and/or in the abstract, and with the words cognition and/or cognitive function in the body of the main text were selected.The addition of the "cognitive function" criteria was necessary to prevent from analysing papers that were mainly related to the cellular and biological aspects of the different brain areas.All papers that matched the advanced search criteria and that could be downloaded in HTML format from PubMed Central were included in the analysis.Papers that were only provided a PDF version, which mostly corresponded to paper published prior to 2000, were excluded, as were papers that were not open access.In total, across all brain regions, 1602 papers were analysed.Detailed numbers on how many papers were analysed for each individual brain region are provided in the Supplementary Table A2.
The HTML documents were pre-processed using the Auto-CORPus pipeline (Automated pipeline for Consistent Outputs from Research Publications) (22), an NLP tool that converts publications with an HTML structure into a BioC JSON format (23).The BioC JSON format uses a standard structure developed to allow for the interoperability of text mining outputs across different systems.Concretely, it consists of collections of documents, extracted from a corpus, characterized by different elements, that contain the actual text as well as additional information about the original document.The text is automatically divided into common sections (e.g.abstracts, results ...) and paragraphs, and each section is associated to the respective, and unique, Information Artifact Ontology (IAO) annotation (24).IAO annotations enable identification of the same sections across different publications, even when they are associated to different titles (e.g.Methods and methodology).Specifically, for our NLP analysis, only the sections corresponding to the Abstract, Discussion, and Introduction were used, which corresponded respectively to the IAO annotations IAO:0000316, IAO:0000315, and IAO:0000319.
The extracted paragraphs were cleaned by converting all words to lower case, removing words that were less than 2 characters or more than 20 characters long, by removing all special characters and punctuations, and by removing all stop words, which correspond to commonly used words in the English language, such as "you", "an", or "in".In addition, words that are related to the different brain structures (e.g.gyrus, nucleus, cerebellum…), or that are commonly associated to the study design (e.g.controls, patients, humans …) were excluded.After the data cleaning, the frequency of occurrence of each word was calculated across all papers for each individual brain region, and normalized between 0 and 1, with 1 corresponding to the most frequent word.
In order to assess the main brain functions associated to AS and DT, the frequency scores of the 5 words with the highest values were added separately for all the brain regions related to AS and DT.In this way, if a word appeared frequently in multiple brain regions, then it was assigned a higher frequency score, compared to a frequent word appearing in one individual brain region.

Samples characteristics
The full sample consisted of 34927 participants at baseline, among which only 33890 fully completed the PVT.The mean age was 64.7 ± 7.8 and 53% of the participants had an education level comparable to A/AS levels, or higher.Among these 34927 participants, 4962 people (mean age: 61.7±7.2) completed the PVT at a follow up timepoint.Full details on the sample demographics are available in Table 1.Prefer not to answer 116 (0.3%) 54 (1%) Missing information 419 (1%) 0 (0%)

Measures of individual trial difficulty, ability and visuo-motor speed from IDoCT
The IDoCT model converged after 250 iterations when estimating trial difficulty measures (D), reaching a mean percent change in word difficulty that tended to 0. In the second iteration, necessary to determine AS and DT, the model converged in 10 iterations while defining a measure of ability and performance.DS was calculated in a final step of the model computation.
The distributions of D (unscaled difficulty) and DS (scaled difficulty) are available in Figure 4A and 4B, together with the original difficulty scale (Dold) (Figure 4C), and the association between DS and Dold (Figure 4D).The mean D, DS and Dold were respectively 0.60 ± 0.07, 0.72 ± 0.06, and 0.51 ± 0.24.The full list of the words presented and of their associated measure of D, DS and Dold is available in the Supplementary Table A3.In brief, according to the IDoCT D scale, the five most difficult words, in ascending order of difficulty, were: glower, malefactor, pachyderm, matron and plethora.Instead, the five easiest words in ascending order of difficulty were: calm, weld, herd, desolate and engraved.On the other hand, according to DS, the top 5 hardest and easiest words were respectively buffet, trivet, prodigious, bucolic, truncate, and fabricate, angry, monarch, fly, run.In general, most of the words that were assigned to higher difficulty scores were of Latin, Greek or French origin.The change in assigned difficulty level per word-picture pair between D and DS is presented in Figure 5.In brief, the difficulty of words that were presented exclusively to participants with lower abilities tended to decrease after the scaling.On the other hand, words that were assigned only to participants with higher abilities tended to increase in difficulty after scaling.This pattern of results accords with the method working to correct for sampling bias.The data-driven measures of trial's difficulty D were scaled according to the ability of the participants to which the trials were presented, in order to account for potential biased sampling in the task design, meaning cases in which words were presented exclusively to participants with high/low ability.In the figure, a red and blue line correspond respectively to the increase and decrease in difficulty score for each word after the scaling.Each dot in the figure corresponds to one word, the size of the dot indicates the number of participants that were presented that word (i.e.bigger dots mean that a word was presented to a higher number of participants) and the color is related to the average AS of the participants that were presented a specific word (or trial).
The mean AS predicted by IDoCT across the cohort was 0.39 ± 0.08, while the mean DT was 3160 ± 1081.The mean AB was 0.85 ± 0.09.The distributions of AS, DT and AB were compared, as well as their associations to the raw median RT and the number of correct answers, as presented in Figure 6.As can be observed, AB was characterized by an atypical bimodal distribution centered around high ability scores (~0.85), which supports the hypothesis of the ceiling effect.On the other hand, both DT and AS had the expected near-gaussian distribution of abilities, which, in case of AS, was centered around average scores (~0.4).By comparing the associations of AB and AS (Figure 7) with the number of correct replies and the median RT, expected associations are observed in case of AS, with more correct answers and faster

A B C
RTs being related to higher AS.On the other hand, in case of AB, some participants were assigned a low AB score despite giving correct answers in the majority of cases and despite having low median RTs.

Figure 7 AS and AB association with median RT and number of correct answers. A) Association between AS and number of correct answers, B) Association between AB and number of correct answers, C) Association between AS and median RT, and D) Association between AB and median RT, E) Association between DT and number of correct answers, F) Association between DT and median RT, G) Association between AS and AB, and H) Association between DT and AB
In case of DT, no association with the number of correct answers was observed (Figure 7E), which is expected considered that the level of visuo-motor latency should not affect how correctly each participant replies.On the other hand, DT was almost linearly associated with the median RT (Figure 7F), with higher median RTs for participants with longer DT.This is also expected since participants with longer visuo-motor latency are expected to be overall slower.No clear association was observed between AS and AB and DT and AB.The latter is not surprising considered that DT is not supposed to capture cognitive performance like AB.On the other hand, the association between AS and AB resembles that of AB and the number of correct, which is expected considered that AS is almost linearly associated with accuracy (Figure 7G and 7H).

Trials-by-trial difficulty trajectories
To validate that Dold led to ceiling effects, trial difficulty trajectories were plotted, that is, showing how the difficulty of sampled word-picture combinations changed sequentially through the task.Separate trajectories were computed for D, DS and Dold, and for participants with different ability levels.
Participants were divided into 10 groups based on AB (0-0.1, 0.1-0.2,… 0.9-1.0), and their mean trajectories obtained by averaging the difficulty scores of the words presented at each, as shown in Figure 8.If the adaptive staircase approach was correct, then a gradual increase in difficulty should have been observed for all participants with medium-high ability (AB > 0.3).Instead, already after up to 5 trials, Dold either reached a plateau or started to decrease, which suggests that no words with higher difficulty as defined by Dold were available after that point, forcing the algorithm to provide words of either equal or lower difficulty.Furthermore, the difficulty level as defined by D and DS appeared not to increase across time.Together, these results support the hypothesis that the original difficulty scale was not optimally calibrated for the assessed cohort.

AS has better test-retest reliability compared to AB
IDoCT was applied to the follow up data to evaluate retest reliability.The model converged after 250 and 10 iterations respectively when predicting D and AS/DT.The distributions of the predicted AS and DT, as well as of AB, are available in Figure 9.The mean AS, DT and AB at follow up were respectively 0.4 ± 0.08, 3009 ± 1035 and 0.83 ± 0.09.Similar to the baseline, both AS and DT were characterised by gaussian shaped distributions, with AS being centered around average values (~0.4), while AB again consisted of a bimodal distribution centered around high ability scores.
By comparing the iDoCT predictions at baseline and follow up, both AS and DT had a good test-retest reliability (r = 0.77 for AS, and r = 0.57 for DT) across all ranges of scores (Figure 10).The test-retest reliability of AB was good but substantially lower compared to AS (r = 0.66).Furthermore, the distribution of the Bland-Altman plots showed a poor spread and only high AB scores being consistent across timepoints.

Older and better educated participants have higher cognitive abilities scores
When predicting AS from age decade and education level, a significant regression equation was found (F = 710.6 (13, 33876), p = < 0.001) with an R 2 of 0.21.The regression equation for the original AB score was also significant (F = 561.2(13, 33864), p = < 0.001) but with a lower R 2 of 0.17.Both age and education were significant predictors of AS and AB.Regarding education, the strongest predictors were having a college/University degree (ed1) and A/AS level (ed2), which had respectively a positive effect size in standard deviation (SD) units of 0.66 and 0.35 for AS, and somewhat less at 0.57 and 0.33 SDs for AB, compared to the reference category.In the case of age, older participants had higher AS compared to participants aged 45.The increase was gradual until age 70, at which point a plateau was reached, and participants reached a 0.48 SD increase in AS vs 0.41 increase in AB (for participants aged 75 compared to the reference category) (Figure 11).

Older participants have longer visuo-motor latency times
In the case of DT, a significant regression equation was found (F = 39.41 (10, 33876), p = < 0.001) with an R 2 of 0.015.Both age and education were significant predictors of DT, but the effect size of education was consistently small to negligible (below 0.1 SD units).The relationship between age and DT was comparable to the association with AS, with older participants having a gradual increase in DT, or visuo-motor latency time.More specifically, participants at age 75 and 80 were associated to respectively a 0.38 SD and 0.52 SD increase in DT compared to participants at age 45 (Figure 11).

Imaging analysis feature selection
The results of the fine-tuning are reported in Figure 12 for AS and Figure 13 for DT.In total, P models were trained for each dataset (i.e.FA, Thickness, Volume and Intensity), where P corresponds to the number of features available.Model0 corresponds to the model trained on all features except for the feature with the lowest correlation coefficient.For each of the following models, the feature with the lowest coefficient r among those remaining was dropped, until Model(P-2), which was trained exclusively on the last feature available (i.e. the feature with the highest r correlation coefficient).In each iteration of the model, the features (among those still available) with the lowest correlation coefficient with the predictor was excluded from the regressors, meaning that Model1 was trained on all features except 1 (i.e. the feature with the lowest r), and ModelP was trained only on one feature (i.e.feature with highest r).The R 2 was calculated on the train and validation set.The train/validation R 2 (and their standard deviation across 5 folds) are reported in the figure.The red line represents the best performing model, namely the model that yielded the highest R 2 on the validation set.
When predicting AS, in the case of FA and Thickness, Model0 was the best performing, which included all features except for respectively the FA in the right cingulate gyrus (part of cingulum), and the thickness in the right pars triangularis.Similarly, Model1 was the best performing for Volume, meaning that only two features (i.e. the Brain Stem and Crus I Cerebellum vermis) were dropped when predicting AS.Finally, in case of Intensity, Model11 was the one that yielded the highest R  A7.
When predicting DT, Model12 was the best FA model, which resulted from dropping 13 features (i.e.FA in the left/right superior thalamic radiation, left/right corticospinal tract, right posterior thalamic radiation, left inferior longitudinal fasciculus, left superior longitudinal fasciculus, left parahippocampal part of cingulum, middle cerebellar peduncle, right/left inferior fronto occipital fasciculus, left anterior thalamic radiation and forceps minor).For Volume and Thickness, Model97 and Model45 were the best performing, as a result respectively only 41 out of 139 and 13 out of 60 features were kept.When fine-tuning the models on the intensity dataset, all the models had R 2 <=0.The only model with a positive R 2 (R 2 = 0.001) was obtained after dropping all the features except one.Due to the low performance of the intensity-based models, no features were extracted from the intensity dataset during the feature selection step.More detailed information on which features were dropped during the fine-tuning is available in the Supplementary Tables A8-A11, as well as the obtained R 2 on the train and validation set of each model.

A B D C
Figure 13 DT feature selection.P models were trained using as regressors the imaging data in each individual dataset (A) FA, B) Thickness, C) Volume and D) Intensity) and as predictor DT, where P corresponds to the number of features in each dataset.In each iteration of the model, the features (among those still available) with the lowest correlation coefficient with the predictor was excluded from the regressors, meaning that Model1 was trained on all features except 1 (i.e. the feature with the lowest r), and Model(P-2) was trained only on one feature (i.e.feature with highest r).The R 2 was calculated on the train and validation set.The train/validation R 2 (and their standard deviation across 5 folds) are reported in the figure.The red line represents the best performing model, namely the model that yielded the highest R 2 on the validation set.
Results of the feature selection process when applying the same train-validation pipeline to AB are reported in Figure 14. with the lowest r), and Model(P-2) was trained only on one feature (i.e.feature with highest r).The R 2 was calculated on the train and validation set.The train/validation R 2 (and their standard deviation across 5 folds) are reported in the figure.The red line represents the best performing model, namely the model that yielded the highest R 2 on the validation set.

Identification of neural correlates of AS and DT: univariate and multivariate analysis
Figure 15 Multiple linear regression using imaging features.The imaging features identified during the feature selection step were used as regressors of multiple linear regression models to predict AS (A, B, C, D) and DT (E, F, G).The measured eta squared of the features identified as significant in the models are reported in the figure.The color and asterisk represent the significance level of each feature.The red dotted line separates the features belonging to the top 15, which were selected as a neural correlate of AS/DT, from the other features.
The selected features were used as regressors of multiple linear regression models trained on the full train set and tested on the held-out test set.Two models were trained for each dataset, predicting either AS, or DT.Overall, significant regression models were found for all analyzed datasets, with an average train and test R 2 that was respectively 0.02 ± 0.01 and 0.016 ± 0.005 across all datasets for AS, and 0.004 ± 0.002 and 0.002 ± 0.001 across all datasets for DT.Full results are available in Table 2.The model for the intensity dataset and DT was not computed due to the identified low performance at the feature selection step of the analysis.For AB, significant regression models were found for all analyzed datasets, with an average train and test R 2 that was respectively 0.018 ± 0.01 and 0.013 ± 0.005.Full results are available in Table 2. Therefore, while the models predicting AS had modest R2 values, they performed numerically better than the models predicting AB across all four imaging datasets.
The significant features of the AS and DT models, and their respective Eta 2 are presented in Figure 15.
The 15 top significant features with the highest Eta 2 across all data modalities were identified as the best neural correlates of AS and DT and used in the next steps of the analysis.The identified neural correlates of AS were: Hippocampus, superior and transverse temporal gyrus, medial lemniscus, anterior corpus callosum, medial frontal cortex, uncinate fasciculus, nucleus Accumbens, caudal middle frontal gyrus, inferior temporal gyrus, superior frontal gyrus, and parahippocampus part of cingulum.The identified neural correlates of DT were forceps major, medial lemniscus, posterior thalamic radiation, middle temporal gyrus, lateral orbitofrontal cortex, occipital fusiform gyrus, and lateral occipital cortex.
In order to assess the robustness of the results, a second set of neural correlates were identified by completing a univariate analysis.In this case, the selected features were ranked according to the magnitude of their Pearson correlation coefficient obtained after correlating each feature with either AS or DT.The top 15 features across all data modalities with a significant p-value (p-value < 0.001) were selected as neural correlates of AS and DT.The identified neural correlates of AS were: Amygdala, Hippocampus, Frontal Pole, Insular Cortex, Cerebellum, Superior temporal gyrus, and Temporal fusiform cortex.On the other hand, the identified neural correlates of DT were: middle temporal area, Cerebellum, Inferior temporal gyrus, Intracalcarine cortex, Occipital Pole, superior temporal and lateral orbito-frontal area.

DT is associated more strongly to brain regions with visuo-motor functions and AS to regions with memory and language functions
The full list of the words with the highest frequency of occurrence in the articles that use the anatomical labels associated with AS and DT is available in the Supplementary Tables A12-A15, both for the univariate and multivariate analysis.A summary of the results is presented in Figure 16.
In  16) for respectively the multivariate and univariate analysis.

DISCUSSION
IDoCT is a flexible method designed to fractionate the detailed timecourses that are collected during performance of a computerised cognitive task into components that can be explained by inter-subject variability in basic visuomotor processing speed and device latency on the one hand, and the specific cognitive abilities that the task was intended to manipulate on the other ( 14).This is achieved in a simple, robust, and data-driven manner that iteratively re-estimates individuals' abilities and trialdifficulty scales whilst handling the speed accuracy tradeoff.
This method was initially applied to improve the precision of performance estimates from the Cognitron library of online tasks (14).However, the flexibility of the approach can be adapted for practically any computerised task that varies dimensions of cognitive difficulty across trials and where performance recordings are available for large numbers of individuals.The results presented here provide further evidence of the utility of IDoCT in the context of data derived from an independently designed UK Biobank task, including improvements in the trial-difficulty scale, participants-score distributions, retest statistics, demographic correlations, and imaging associations.
More specifically, a critical limitation of the PVT dataset is that the original summary score distributions are malformed due to the dynamic sampling algorithm having operated across a suboptimal trial-difficulty scale (Dold).Our analysis of sampling trajectories highlights the basis of that limitation.Specifically, the difficulty (Dold) of sampled trials for participants with moderate to good original scores (AB) either reaches a plateau or decreases after just 5 out of 30 available steps.
Replotting the sampling trajectories using data driven trial-difficulty estimates (D and DS) indicates that the intended difficulty increments across time were not achieved.Furthermore, the correlation between original scores and accuracy or response time measures also are hard to interpret, suggesting that ceiling effects were not the only problem, as the ordering of the scale may also be sub-optimal for the UK Biobank population, perhaps due to it having been developed with a United States (US) population in mind.This resulted in the original summary scores (AB) having a malformed bimodal distribution with a high mean estimate (0.8 where 1 is maximum), which is unlikely to reflect the true underlying population distribution of crystalised intelligence abilities (25).
IDoTC operated by (i) leveraging the population's accuracy and response time measures while (ii) factoring in the component of performance variance that is better explained by basic visuomotor response times and (iii) taking into account potential (and here intended) bias due to more difficult items being sampled for higher performing abilities.Taken together, these characteristics enable recalculation of the trial-difficulty scale in a data driven manner.The resultant DS scale can potentially be used in future studies as the basis for the adaptive sampling algorithm, though it may still be advisable to include more high-difficulty items given the observed ceiling effects.
More importantly, analysis of the data distributions demonstrates that IDoCT was successful in addressing the issues with the original summary scores.For example, AS has the expected Gaussian distribution (25), supporting the view that IDoCT could overcome the ceiling effect limitation generated by Dold and better capture the underlying crystalised language ability that is the target of the task.Plotting the original AB and re-estimated AS scores against the basic mean reaction time and total correct response measures shows distorted distributions for the former but not the latter.Furthermore, by comparing the word difficulty measures before and after the scaling (Figure 5), it is evident that IDoCT properly addresses the biased sampling issue, increasing the measure of difficulty of words that were presented exclusively to participants of higher ability and by decreasing it in the case of words presented only to low performing individuals.Moreover, the retest plots demonstrate a more homogeneous spread on the Bland-Altman plots, with lower SD difference and higher crosssession correlation for AS.In sum, all analyses indicate that the performance score distributions from IDoCT are superior to those output by the original task.
The obtained measure of DT and AS are further validated by the results on the associations with age and education.Specifically, although DT shows the expected increase with age (26), the effect size of the relationship with education level is negligible.This supports the view that DT captures individual differences in fundamental visuo-motor latencies, as opposed to the knowledge of the meaning of words.Conversely, AS improves with age -the expected pattern of results as knowledge of words improves and 'crystalizes' throughout the lifespan (27-30) -but also improves with exposure to higher education, where there is a higher likelihood of learning unusual words (31,32).Notably, the scale of both of these associations is numerically stronger for AS than for the original AB score, together resulting an improved R 2 in the linear regression analysis (AB: 0.17, AS: 0.21).
The above results confirm that IDoCT is successful in recalculating the difficulty scale, and in fractioning performance into distinct cognitive and visuomotor components that have superior re-test properties and improved demographic predictive validity.These findings align with our previous applications of this technique to tasks from the Cognitron library (14).The more novel question pertains to whether these advantages in the precision of the task performance estimates extend to improvements in imaging associations.
The machine learning pipeline addresses this by measuring how accurately the original AB measure can be predicted from data of different imaging modalities and using this as a baseline for comparing the AS and DT performance estimates.Overall, the behavioural -imaging associations are in the small range, albeit statistically significant.This may not be unexpected given the simplicity of the linear regression machine learning approach applied and the recent literature on the scale of such associations when estimated within well powered datasets (33).More importantly, the associations with AS are consistently stronger than AB across all analysed imaging modalities.Among the four datasets studied, the volume measures led to models with the highest R 2 on the test set both for AS and DT, suggesting that these measures might be more informative when trying to predict cognitive ability and visuo-motor latency.
The NLP analysis provides an unbiased data driven way to qualitatively evaluate the functional specificity of the AS and DT measures, by determining the most common functional terms that their associated brain regions co-occur with within literature.The results have face value, with brain regions identified as best predictors of DT mainly relating to visual and motor functions, such as the Intracalcarine cortex (34), and the cerebellum (35).This is expected considered that DT is supposed to measure visuo-motor latency times.Conversely, AS is mainly associated with brain regions involved in memory, language and auditory functions, such as the hippocampus (36), uncinate fasciculus (37) and inferior temporal gyrus (38).Considering the nature of the task, which requires participants to associate the meaning of spoken/written words to different images, the identified brain functions are as expected.
A strength of this study is the sample size, which allows to draw firmer and more precise conclusions.An important aspect to consider, however, is that the UK Biobank sample includes mainly middle to older aged individuals (<1% below 50 years old), that are not necessarily representative of the general population in terms of health, physical, and lifestyle aspects (39).This does not undermine our validation of the IDoCT approach, but it should be noted that the data-driven measures of trials' difficulty might change if younger individuals are included in the analysis, as different words are more likely to be learnt at different stages of the lifespan (e.g., school vs work).A further strength of the study is that four different kinds of features (imaging-derived phenotypes) from two imaging data modalities were analysed, in combination with both behavioural and demographics data, which allows to gain better insight into the best imaging predictors of AS and DT.
Despite these strengths, there are some limitations.First, IDoCT requires as input a trial's description in order to extract data-driven measures of trials' difficulty.In this study, the word presented at each trial is used as that definition because it was the only information available.However, in the case of PVT, the difficulty of a trial is influenced not only by the word presented, but also by the set of pictures from which participants are required to choose.Without this information, it is difficult to interpret the reason why specific words were assigned higher or lower difficulty scores, as the motivation could be a combination of the difficulty of the word itself or of the figures presented.However, a general observation is that higher values of DS on the difficulty scale are mainly characterised by a Latin, French or Greek etymology.This pattern of results again has some face validity, because English is not a romance, but a Germanic language (40), but corresponds only to a limited interpretation due to the lack of information about the word-figure pairs.
Further limitations are related to the NLP analysis.First, the latter was conducted exclusively on open access papers, which limits the literature research to previous studies that are freely accessible.In addition, although the NLP analysis conducted in this study was sufficient to achieve the intended aim, it consisted exclusively of the estimation of word frequency across different papers.Further research could extend this by using more advanced NLP approaches, such as topic modelling to extract common topics (41), or by implementing digraphs and dependency parsing to identify related and connected words (42).The latter approach could provide additional information on the meaning of the words within the context of the sentence in which they appear and help in reducing the noise of the one to many mappings between brain structures and cognitive functions.
Finally, there are two limitations of the imaging analysis that could be addressed by further research.
First, the study is exclusively focused on of associations with structural measures.However, cognitive processes generally result from complex brain networks (43), rather than individual discrete contributions of brain regions (44).Analysis using graph theoretic approaches to quantifying the information processing properties of networks, or focused on network dynamics from functional MRI might provide better and more detailed insights into the neural correlates of cognitive task performance.Relatedly, the machine learning approaches used here are mainly state-of-the-art shallow linear models and they are applied independently to each imaging modality.The strength of associations could improve if more advanced methods, such as deep learning and combination of multi-modal features are applied.Nonetheless, the fact that association strengths increase for AS vs AB across all imaging modalities confirms our primary hypothesis that IDoCT can provide superior cognitive ability measures for future association studies, including those investigating more advanced imaging analysis methods.
In conclusion, we successfully apply IDoCT to the PVT data collected as part of the UK Biobank imaging extension, and obtain superior prediction of subject-level cognitive ability and visuo-motor latency, as well as an optimised data-driven word difficulty scale calibrated on the UK population.Our results further validate IDoCT by showing the improved relationship between the performance metrics and age and education, as well as brain imaging metrics in terms of the strength and functional specificity of associations.

Figure 2 .
Figure 2. Sample and data flowchart.From the 502536 participants in the UK Biobank, 33890 participants had complete data for the Picture Vocabulary data, among which 20593 had data for all imaging modalities of interest, which consisted of

Figure 4 D
Figure 4 D, DS and Dold distributions,.Comparison of the distributions of IDoCT measures of A) unscaled trail's difficulty D, 2) scaled trial's difficulty D and C) original difficulty Dold

Figure 5
Figure 5 Change in word difficultyIDoCT scores before and after scaling.The data-driven measures of trial's difficulty D were scaled according to the ability of the participants to which the trials were presented, in order to account for potential biased sampling in the task design, meaning cases in which words were presented exclusively to participants with high/low ability.In the figure, a red and blue line correspond respectively to the increase and decrease in difficulty score for each word after the scaling.Each dot in the figure corresponds to one word, the size of the dot indicates the number of participants that were presented that word (i.e.bigger dots mean that a word was presented to a higher number of participants) and the color is related to the average AS of the participants that were presented a specific word (or trial).

Figure 6 A
Figure 6 A, AS and AB distributions at baseline.Comparison of the distributions of IDoCT measures of A) AS (cognitive ability), 2) visuo-motor speed DT and C) original ability measure AB.

Figure 8
Figure 8 Trial's difficulty trajectories.Participants were divided into 10 groups based on their AB measures.The trial difficulty trajectory of each group was obtained by measuring the mean difficulty (D, DS and Dold) of the presented trial's at each step of the cognitive assessment.In case of participants with medium-high difficulty, Dold either reached a plateau or decreased after few trials, meaning that the original difficulty scale was not optimally calibrated.

Figure 9 A
Figure 9 A, AS and AB distributions at follow up.Comparison of the distributions of IDoCT measures of A) AS (cognitive ability), 2) visuo-motor speed DT and C) original ability measure AB.

Figure 10
Figure 10 Bland-Altman plots of DT, AS and AB.The correlation coefficients r between AS (A), DT (B) and AB (C) at baseline and follow up were respectively 0.77, 0.57 and 0.66.

Figure 11
Figure 11Analysis of age and education associations.Multiple linear regression models were trained using AS (A), DT (B), and AB (C) as predicted variables and age and education as predictors.The obtained beta coefficients in SD units are represented in the figure.Age and education were always significant predictors, but the effect size of education was negligible (-0.1 < x > 0.1) in case of DT.The color and asterisk represent the significance level of each feature.Age45 and ed7 were used as reference category.ed1 corresponds to College/University degree (ed1), ed2 to A levels/AS levels or equivalent, ed3 to O levels/GCSEs or equivalent, ed4 to, CSEs or equivalent, ed5 to NVQ or HND or HNC or equivalent, ed6 to other professional qualification, and ed7 to none of the above.The red lines indicate the area where the effect size can be considered negligible.

Figure 12 AS
Figure12AS feature selection.P models were trained using as regressors the imaging data in each individual dataset (A) FA, B) Thickness, C) Volume and D) Intensity) and as predictor AS, where P corresponds to the number of features in each dataset.In each iteration of the model, the features (among those still available) with the lowest correlation coefficient with the predictor was excluded from the regressors, meaning that Model1 was trained on all features except 1 (i.e. the feature with the lowest r), and ModelP was trained only on one feature (i.e.feature with highest r).The R 2 was calculated on the train and validation set.The train/validation R 2 (and their standard deviation across 5 folds) are reported in the figure.The red line represents the best performing model, namely the model that yielded the highest R 2 on the validation set.

Figure 14 AB
Figure14AB feature selection.P models were trained using as regressors the imaging data in each individual dataset (A) FA, B) Thickness, C) Volume and D) Intensity) and as predictor AB, where P corresponds to the number of features in each dataset.In each iteration of the model, the features (among those still available) with the lowest correlation coefficient with the predictor was excluded from the regressors, meaning that Model1 was trained on all features except 1 (i.e. the feature

Figure 16 NLP
Figure16NLP analysis results.The word clouds represent the words with highest frequency of occurrence in papers of the best neural correlates of A) AS derived from univariate analysis, B) AS derived from multivariate analysis, C) DT obtained from univariate analysis, and D) DT obtained from multivariate analysis.Bigger words correspond to higher frequency of occurrence.

Table 1
Sample demographics.Age and education of the participants of the study at baseline and at the follow up timepoint

Table 2 Multiple linear regression models with imaging features.
The features identified during the feature selection step were used as regressors of multiple linear regression models to predict AS, AB and DT.The R 2 (train and test), F statistics and p-value of the models are reported in the table.

Table A3 IDoCT results of D and DS.
List of words used in the PVT Task and the measures of D and DS derived from IDoCT.

Table A4 Feature selection of FA dataset when predicting AS
. (P-2) linear regression models were trained (with P equal to the number of features in the dataset) using 5-fold cross validation and at each iteration the feature with the lowest correlation with AS was removed from the regressors.The mean R 2 in the train and test set of each model across the folds, as well as the name of the feature dropped at each iteration, is reported in the Table.

Table A5 Feature selection of thickness dataset when predicting AS
. (P-2) linear regression models were trained (with P equal to the number of features in the dataset) using 5-fold cross validation and at each iteration the feature with the lowest correlation with AS was removed from the regressors.The mean R 2 in the train and test set of each model across the folds, as well as the name of the feature dropped at each iteration, is reported in the Table.

Table A6 Feature selection of intensity dataset when predicting AS
. (P-2) linear regression models were trained (with P equal to the number of features in the dataset) using 5-fold cross validation and at each iteration the feature with the lowest correlation with AS was removed from the regressors.The mean R 2 in the train and test set of each model across the folds, as well as the name of the feature dropped at each iteration, is reported in the Table

Table A8 Feature selection of FA dataset when predicting DT
. (P-2) linear regression models were trained (with P equal to the number of features in the dataset) using 5-fold cross validation and at each iteration the feature with the lowest correlation with DT was removed from the regressors.The mean R 2 in the train and test set of each model across the folds, as well as the name of the feature dropped at each iteration, is reported in the Table.

Table A9 Feature selection of thickness dataset when predicting DT
. (P-2) linear regression models were trained (with P equal to the number of features in the dataset) using 5-fold cross validation and at each iteration the feature with the lowest correlation with DT was removed from the regressors.The mean R 2 in the train and test set of each model across the folds, as well as the name of the feature dropped at each iteration, is reported in the Table.

Table A10 Feature selection of intensity dataset when predicting DT
. (P-2) linear regression models were trained (with P equal to the number of features in the dataset) using 5-fold cross validation and at each iteration the feature with the lowest correlation with DT was removed from the regressors.The mean R 2 in the train and test set of each model across the folds, as well as the name of the feature dropped at each iteration, is reported in the Table.

Table A12 List of most frequent words in the literature related to the neural correlates of DT extracted following the univariate analysis pipeline.
The frequency of occurrence of each word was calculated across all papers for each individual brain region, and normalized between 0 and 1, with 1 corresponding to the most frequent word.The frequency scores of the 5 words with the highest values were added for all the brain regions related to DT.The neural correlates of DT were extracted following the univariate analysis pipeline.

Table A13 List of most frequent words in the literature related to the neural correlates of DT extracted following the multivariate analysis pipeline.
The frequency of occurrence of each word was calculated across all papers for each individual brain region, and normalized between 0 and 1, with 1 corresponding to the most frequent word.The frequency scores of the 5 words with the highest values were added for all the brain regions related to DT.The neural correlates of DT were extracted following the multivariate analysis pipeline.

Table A14 List of most frequent words in the literature related to the neural correlates of AS extracted following the multivariate analysis pipeline.
The frequency of occurrence of each word was calculated across all papers for each individual brain region, and normalized between 0 and 1, with 1 corresponding to the most frequent word.The frequency scores of the 5 words with the highest values were added for all the brain regions related to AS.The neural correlates of AS were extracted following the multivariate analysis pipeline.

Table A15 List of most frequent words in the literature related to the neural correlates of AS extracted following the multivariate analysis pipeline.
The frequency of occurrence of each word was calculated across all papers for each individual brain region, and normalized between 0 and 1, with 1 corresponding to the most frequent word.The frequency scores of the 5 words with the highest values were added for all the brain regions related to AS.The neural correlates of AS were extracted following the multivariate analysis pipeline.