Automated online and App-based cognitive assessment tasks are becoming increasingly popular in large-scale cohorts and biobanks due to advantages in affordability, scalability, and repeatability. However, the summary scores that such tasks generate typically conflate the cognitive processes that are the intended focus of assessment with basic visuo-motor speeds, testing device latencies, and speed-accuracy tradeoffs. This lack of precision presents a fundamental limitation when studying brain-behaviour associations. Previously, we developed a novel modelling approach that leverages continuous performance recordings from large-cohort studies to achieve an iterative decomposition of cognitive tasks (IDoCT), which outputs data-driven estimates of cognitive abilities, and device and visuo-motor latencies, whilst recalibrating trial-difficulty scales. Here, we further validate the IDoCT approach with UK BioBank imaging data. First, we examine whether IDoCT can improve ability distributions and trial-difficulty scales from an adaptive picture-vocabulary task (PVT). Then, we confirm that the resultant visuo-motor and cognitive estimates associate more robustly with age and education than the original PVT scores. Finally, we conduct a multimodal brain-wide association study with free-text analysis to test whether the brain regions that predict the IDoCT estimates have the expected differential relationships with visuo-motor versus language and memory labels within the broader imaging literature. Our results support the view that the rich performance timecourses recorded during computerised cognitive assessments can be leveraged with modelling frameworks like IDoCT to provide estimates of human cognitive abilities that have superior distributions, re-test reliabilities, and brain-wide associations.

Automated and app-based assessment technologies provide a scalable, cost-effective, and reliable way to measure different aspects of cognitive abilities (Soreq et al., 2021) and to monitor cognitive changes in clinical populations (Brooker et al., 2020; Hampshire, Chatfield, et al., 2022; Hampshire, Trender, et al., 2022). Consequently, this technology is becoming popular in large-scale citizen science projects (Germine et al., 2012; Hampshire, 2020), cohorts (Treviño et al., 2021) and registers (Fawns-Ritchie & Deary, 2020). Building on the resultant big data, a major research drive has been to map associations between the summary scores that these computerised cognitive tasks output, and features of brain structure and function from large-scale imaging cohort studies (Cox et al., 2019; Ferguson et al., 2020). However, it is common practice to summarise a participant’s performance by estimating or contrasting average accuracy and reaction times (RT) across task conditions (Vandierendonck, 2017). The resultant scores relate not only to individual differences in abilities to process the specific cognitive demands that are the intended target of the task (Kiesel et al., 2010; Kornblum et al., 1990), but also to other confounding factors such as visuo-motor processing speeds and the latency of the devices that people are assessed with. This lack of cognitive precision in summary score estimates is a non-trivial limitation for both the strength and the specificity of the associations that can be achieved.

A commonly overlooked advantage of computerised cognitive tasks is that, unlike classic pen-and paper assessment scales, they record every stimulus and response in a detailed performance timecourse. These performance timecourses can be modelled in more sophisticated ways to obtain ability estimates that have superior reliability and process specificity than simple contrasts or averaging. Previously, we reported development of one such modelling framework, IDoCT (Iterative Decomposition of Cognitive Tasks—see Box 1), which we designed to disentangle individuals’ cognitive abilities from other confounding factors in a manner that is robust, computationally inexpensive, and sufficiently flexible to be adapted for practically any task that manipulates cognitive difficulty across trials (Giunchiglia et al., 2023).

Box 1
IDoCT: Iterative Decomposition of Cognitive Tasks

1. Data driven assessment of trial’s difficulty D(t)

The performance P(i,t) of participant i in trial t is calculated based on the RT of participant i in trial t and the difficulty D(t) of trial t, if the answer is correct. In case of incorrect answers, the performance is equal to 0. RTmax corresponds to the maximum RT across all trials and all participants.

P(i,t)={0,ifwrong(1RT(i,t)RTmax)*D(t),ifright

D(t) is calculated according to the performance P(i,t) across all participants i in trial t. N corresponds to the total number of trials and T(t,i) to the number of times a trial t was repeated for the same participant i.

D(t)=1i=1NT(t,i)i=1NT(t,i)=1T(t,i)1P(i,t)

A mutual recursive definition is generated. At the first iteration D(t) is set equal to 1 for all trials t, and then iteratively modified. The iterations are interrupted when the model converges into an invariant measure of trial’s difficulty D(t).

2. Data driven assessment of answer time AT(i,t)

RT is characterized by two components: answer time AT, or cognitive time required to provide an answer to the cognitive task, and DT, or visuo-motor latency time. The measure of performance can be updated as follows:

P(i,t)={0,ifwrong(1AT(i,t)+DT(i,t)RTmax)*D(t),ifright

The answer time AT(i,t) of participant i in trial t is calculated according to the ability A(i) of the participant, the RT(i,t) and the difficulty of the trial D(t).

ATN(i,t)=(1A(t))*D(t)

AT(i,t)=ATN(i,t)*(RT(i,t)DTMAX(i))

Where

DTmax(i)=RT(i,t)for the t such that RT(i,t)is the smallest

ATN(i,t) corresponds to a measure of the answer time in milliseconds. The ability A(i) is measured based on the cumulative performance of participant i across all trials t.

B(i,t)=B(i,t1)+P(i,t)

BN(i,N)=t=1NB(i,t)N

A(i)=BN(i,Q)

Where B(i,t) corresponds to the cumulative performance of participant i up to trial t, and BN(i,N) to the overall average performance across all N trials completed.

The delay time DT(i,t) is calculated as the difference between the reaction time RT(i,t) and the answer time AT(i,t).

DT(i,t)=RT(i,t)AT(i,t)

A mutual recursive definition between A(i) and P(i,t) is generated. The first iteration assumes that A(i) is maximum for all participants i and is, therefore, always equivalent to 1, the measure of AT is initialized as ATmin(i) (which is measured as the difference between RT and DTmax). Instead, DT(i, 0) is initialized as DTmax(i). DTmax(i) is set equal to RTmin(i). The iterations are interrupted when the model converges.

3. Measure of specific ability AS(i) and delay time DT(i)

Specific ability AS(i) is calculated according to the cumulative specific performance PA(i,t) of participant i, which is measured using the RT(i,t) corrected for the DT(i,t).

PA(i,t)={0,ifwronganswer(1AT(i,t)ATmax)D(t),ifrightanswer

BS(i,t)=BS(i,t1)+PA(i,t)

BNS(i,N)=t=1NBS(i,t)N

Where BS(i, 0) = 0 and BNS(i, N) corresponds to the cumulative performance PA up to trial N. Once the cumulative performance is calculated, AS(i) can be measured as its overall average:

AS(i)=BNS(i,Q)

Where Q corresponds to the total number of trials for each participant

The delay time DT(i) is measured as the average delay time D(i,t) across all N trials t.

DT(i)=t=1NDT(i,t)N

4. Measure of scaled trial difficulty DS(i)

The trial difficulty D(t) is scaled according to the specific ability AS(i) of all the N participants i that completed the trial t.

DS(t)=i=1NAS(i)*D(t)N

More specifically, IDoCT leverages trial-by-trial performance data from large cohorts to (a) estimate individuals’ abilities to specifically cope with higher cognitive difficulty across trials (AS) and estimate their visuo-motor delay times (DT) while accounting for individual speed-accuracy tradeoffs. Notably, the approach concurrently recalculates the relative difficulty assigned to trials across task conditions using a fixed-point iterative process that handles the circularity of simultaneously defining individual performance from trial difficulty and trial difficulty from individual performance (Giunchiglia et al., 2023). This is achieved in a manner that accounts for any bias towards sampling more difficult conditions in more able individuals, as is the case in some adaptive designs, resulting in a robust data-driven recalibration of trial-difficulty scales.

We initially validated IDoCT by applying it to data from >400,000 participants who undertook 12 cognitive tasks during the Great British Intelligence Test (Hampshire, 2020). The results showed a successful decomposition of cognitive versus device and visuo-motor latencies, as gauged by superior sociodemographic associations and AS estimates that were not dwarfed by an inflated global intelligence factor (Giunchiglia et al., 2023). This combination of sensitivity and decorrelation holds promise for achieving stronger and more process-specific functional-anatomical mappings in behavioural-imaging association studies.

Here, we further validate IDoCT in the context of an adaptive Picture Vocabulary Task (PVT) (Dunn & Dunn, 2007; Weintraub et al., 2013) that was designed to measure “crystallised” comprehension and reading decoding abilities (Fawns-Ritchie & Deary, 2020; Gershon et al., 2014), and that was deployed with 34,927 participants as part of the UK BioBank (UKB) imaging extension (Sudlow et al., 2015). Extracting summary measures for this PVT presents a notable challenge because it applied an adaptive staircase sampling algorithm to efficiently measure each participant’s ability level, but the word-picture difficulty scale that the sampling algorithm traversed proved to be sub-optimally calibrated for the assessed population. This resulted in aberrant sampling trajectories with accuracy ceiling effects and an unexpected bimodal distribution of population performance scores that is not ideal for analysis of associations (Sudlow et al., 2015). In theory, IDoCT should be able to resolve this ill-posed problem by recalibrating the trial-difficulty scale whilst using both speed and accuracy to produce more precise estimates of participant performance, with applications in functional-brain mapping studies.

To test this theory, we first confirm the expected improvements in distributions and test-retest reliability of PVT performance estimates for IDoCT relative to the original summary scores. Next, we evaluate whether the IDoCT PVT performance estimates associate more robustly, and in an interpretable manner, with participant age and education. Then, we use a simple linear machine-learning pipeline to test the hypothesis that the IDoCT PVT estimates can be more reliably predicted from four distinct feature sets of the UKB structural imaging database. Finally, we conduct free text mining across the imaging literature to determine whether the brain regions that predict IDoCT PVT estimates of visuo-motor and cognitive abilities are differentially associated in the neuroscience literature with visual and motor functions versus language and memory functions respectively.

Our full analysis pipeline consisted of nine steps, summarised in Figure 1, namely 1) data curation, 2) IDoCT modelling, 3) evaluation of distributions, 4) estimation of trials’ difficulty trajectories, 5) assessment of test-retest reliability, 6) association with age and education, 7) imaging associations, 8) automated literature extraction and preprocessing, and 9) free text analysis.

Fig. 1.

Summary of analysis. The full analysis consisted of nine steps: 1) data curation, 2) IDoCT modelling, 3) evaluation of distributions, 4) estimation of trials’ difficulty trajectories, 5) assessment of test-retest reliability, 6) association with age and education, 7) imaging associations, 8) automated literature extraction and preprocessing, and 9) free text analysis.

Fig. 1.

Summary of analysis. The full analysis consisted of nine steps: 1) data curation, 2) IDoCT modelling, 3) evaluation of distributions, 4) estimation of trials’ difficulty trajectories, 5) assessment of test-retest reliability, 6) association with age and education, 7) imaging associations, 8) automated literature extraction and preprocessing, and 9) free text analysis.

Close modal

2.1 Study design and participants

The data analysed in this study were collected as part of UKB, a population-based prospective study that recruited >500,000 participants, aged between 40 and 69, in the 2010-2016 timeframe. The aim of UKB is to understand the genetic and non-genetic factors that contribute to different diseases that affect mainly the middle and older aged population. As part of the study, a subset of individuals undertook imaging, genetic, as well as health and demographics measures longitudinally, across 22 assessment centers throughout the United Kingdom (Sudlow et al., 2015). All participants provided informed, written, consent. Detailed information on the study design is available in the original paper (Ferguson et al., 2020) or online https://www.ukbiobank.ac.uk/. In the current study, demographics (i.e., age and education), imaging (i.e., fractional anisotropy from diffusion weighted images, and volume, intensity, and thickness measures from structural MRI), and behavioural data (i.e., performance in PVT) were used.

In total, 34927 participants completed the PVT during at least one timepoint. 33890 had complete picture vocabulary task data at baseline, of whom 20593 had data for all imaging measures of interest, and 4962 had cognitive data at follow up. Figure 2 shows the exact numbers of participants used for each step of the analysis and the exclusion criteria.

Fig. 2.

Sample and data flowchart. From the 502536 participants in the UK Biobank, 33890 participants had complete data for the Picture Vocabulary data, among whom 20593 had data for all imaging modalities of interest, which consisted of 26 fractional anisotropy features from diffusion weighted images, 43 intensity measures, and 139 volume and 60 thickness measures from structural MRI. 4962 participants had picture vocabulary data at follow up.

Fig. 2.

Sample and data flowchart. From the 502536 participants in the UK Biobank, 33890 participants had complete data for the Picture Vocabulary data, among whom 20593 had data for all imaging modalities of interest, which consisted of 26 fractional anisotropy features from diffusion weighted images, 43 intensity measures, and 139 volume and 60 thickness measures from structural MRI. 4962 participants had picture vocabulary data at follow up.

Close modal

2.2 Picture Vocabulary task (PVT)

The PVT used for the UK Biobank was adapted from the NIH Toolbox Picture Vocabulary Test (Dunn & Dunn, 2007; Kreutzer et al., 2011; Weintraub et al., 2013). This task was designed to assess a person’s semantic/language functions by measuring their ability to match words to pictures. At each trial, participants were presented with a written word along with a set of four images, and they were required to select the picture that matched the word. Every participant was started on the same word, which was at an easy vocabulary level. The difficulty of the trials was then adaptively changed, using an Item Reponse Theory (IRT) model, in response to the participant’s performance, according to a maximum likelihood estimate (MLE) of their vocabulary level, calculated from their answers so far. For example, if a participant selected the correct picture at trial n, then the word-picture combination presented at trial n+1 was at a higher level of difficulty, unless the supply of more difficult words had run out. Each participant’s sequence of trials lasted for at least 20 words, and was terminated when either the maximum likelihood estimate was accurate enough (within a standard error of <0.5), or a maximum of 30 trials had been reached.

The similarities and differences from the NIH toolbox test were as follows. The UK Biobank version used a touchscreen to present the words in written form only, unlike the NIH version administered by staff who also pronounced the words. The dataset of pictures, words, and their associated difficulty levels were the same as the English-language version of the NIH PVT, but with modification for obvious difficulties: words differing between UK and US English were changed or removed, for example, the “intersection” (US) was replaced with “crossroads” (UK), and the word “minute” was removed due to ambiguity of its written meaning, as the picture of a clock could have detracted from its intended meaning of a small size. This resulted in a dataset for 340 words. The same methods for estimating vocabulary levels and algorithmic adaptation for the next question were used in both tests. However, the starting word and the maximum number of trials differed for the UK Biobank test.

2.3 Processing of UKB imaging data

The imaging data consisted of 26 measures of fractional anisotropy (FA) from diffusion weighted images (DWI), and 139 volume, 43 intensity, and 60 cortical thickness measures from T1-weighted structural magnetic resonance images (MRI). All these imaging features were provided by UKB. Detailed information on the imaging processing is available in Alfaro-Almagro et al. (2018), and UKB feature labels are provided in the Supplementary Table A1. In brief, DICOM images were converted to NIFTI, fully anonymised using a defacing mask, and corrected for gradient distortion (GD). Specifically, in case of T1 images, the field of view (FOV) was cut down, the images were non-linearly registered to the standard MNI152 space (1 mm resolution), and the brain was extracted. Finally, tissue segmentation was conducted to identify the different tissues and subcortical structures, and final volume measures were extracted after correcting for total brain size. In case of diffusion weighted images (dMRI), the first step of the processing consisted of the correction for head motion and eddy currents, followed by gradient distortion correction. Then, measures of fractional anisotropy (FA) and mean diffusivity were extracted.

2.4 IDoCT: extraction of measures of visuo-motor latency and cognitive ability

IDoCT is an iterative-based method (Giunchiglia et al., 2023) that takes as input trial-by-trial measures of reaction time (RT), and of accuracy for each participant. The measure of accuracy can be either binary (i.e., whether participants selected the right or wrong answer in a given trial) or continuous (i.e., how close were participants to the correct answer). For instance, if the task requires participants to press on a target in the screen and they miss the target, the binary accuracy would be 0, while the continuous accuracy would be a distance measure of how far they were from the target when they pressed the screen. In addition, it requires condition labels for each trial in the timecourse, which are defined based on the design of the task. Here, for example, the words presented at each trial are assumed to vary in difficulty; therefore, they are each assigned their own unique condition label. Then, through two separate iterative processes, the model returns measures of trial difficulties (D), of specific ability (AS), and of basic visuo-motor response speed (DT) for each participant. In a further step, it calculates scaled measures of trial difficulties (DS) that account for cases in which the trial assignment across the participants of differing abilities was biased towards different difficulty levels. This occurs, for example, when the most difficult trials are presented exclusively or predominantly to the most able participants, which should be the case here if the PVT sampling algorithm has sampled across an ordered difficulty scale. DS is obtained by scaling D based on the ability of the participants who completed the specific trials. Details on the implementation of IDoCT are available in the original publication (Giunchiglia et al., 2023) and are summarised in Box 1. In case of PVT, the accuracy measures are binary, and each trial is defined according to the word that is presented to the participants. Therefore, a given trial could occur just once per participant per session, with different participants completing different combinations of trials.

2.5 Test-retest reliability

IDoCT estimates were computed on the follow-up data to assess the test-retest reliability of the model. The results were compared to the reliability of the original ability scale (AB). The model computation was the same as for the baseline data, with the difference that the model parameters D, RTmax, and ATmax obtained from the baseline analysis were used when estimating abilities at the second timepoint. The baseline RTmax and ATmax were used as they represent scaling factors in the model. Therefore, if different scaling factors were used in the follow-up data, then the AS/DT results across timepoints would not be comparable. D was derived from the baseline analysis because one of the assumptions of the model is that D can be reliably derived from a given representative sample and once it is derived then it can be used on different datasets. Pearson correlations and Bland-Altman plots were used to compare and visualise the reliability of DT, AS, and AB estimates across sessions, as well as any trend towards improved or worsened ability scores across time.

2.6 Age and education association

Multiple linear regression was used to compare the sensitivity of the IDoCT AS, DT estimates, and the original AB scores to age and education. Age was converted into 5-year age bins to account for the non-linear relationship between age and cognition, and then one-hot encoded. Education was one-hot encoded into eight categories (College or University Degree, A levels or equivalent, O levels/GCSEs or equivalent, CSEs or equivalent, NVQ or HND or HNC or equivalent, Other professional qualifications, None of the above). Individuals’ selecting “Prefer not to answer” were treated as having missing information. An alpha threshold of p<0.01 was used to determine significance.

2.7 Neural correlates of AS and DT: feature selection

A summary of the imaging association pipeline is provided in Figure 3. Models were optimised and fitted separately for each imaging modality (e.g., measures of volume, cortical thickness, fractional anisotropy and intensity) and for each set of cognitive summary scores (AS, AB, and DT). First, the dataset was split into train and held-out test set with a 75/25 split. Then, age was regressed out of the imaging and cognitive feature vectors. The bivariate Spearman’s correlation between each imaging feature with the target cognitive summary score was computed for the train set only, and the features ranked into scales according to the magnitude of the obtained correlation coefficients r. Next, the combination of features that yielded the best prediction of the target variable was identified in a simple stepwise process using multiple linear regression with five-fold cross-validation. Specifically, models were trained and evaluated across the five folds, iteratively removing the imaging feature with the lowest magnitude correlation coefficient r on the scale until just one remained within the predictor matrix. The optimal number of features was defined as the one producing models with the highest mean R2 across the validation folds. The model was refitted to all of the train data using that optimal number of features. The train-test sets and folds were the same for all models, to enable cross-comparison of model performance. The relative predictability of the different cognitive score estimates from the imaging data was evaluated by comparing the R2 value when the optimal trained models for the different imaging modality–cognitive score combinations were applied to the held-out test data.

Fig. 3.

Imaging analysis summary. The imaging analysis was computed separately for each dataset (FA, Volume, Intensity, and Thickness) and each potential outcome (AS, AB, and DT). The analysis can be summarised in nine main steps: 1) the data were split into train and test sets, 2) the imaging features were ranked in ascending order based on their correlation coefficient r with the outcome variable (AS, AB, or DT), 3) N groups of features were generated (with N equal to the number of imaging features), where in each group the feature with the next lowest coefficient r was progressively excluded, 4) N models were trained using five-fold cross-validation (with N equal to the number of imaging features), where in each model the feature with the next lowest coefficient r was progressively excluded from the analysis, 5) the models were ranked based on their R2 in the validation set, 6) the model with the highest mean R2 on the validation set was selected as the best model and the features used to train it as the best features, 7) the selected features were used to train a final multiple linear regression model using the full train test and then tested on the held-out test set, which was not used in the selection process, 8) the top 15 features with the highest correlation coefficient r with AS or DT were identified as the first set of best neural correlates of the outcome of interest, and 9) the eta squared of the significant features was calculated and the top 15 features with the highest eta squared were identified as the second set of best neural correlates of the outcome of interest.

Fig. 3.

Imaging analysis summary. The imaging analysis was computed separately for each dataset (FA, Volume, Intensity, and Thickness) and each potential outcome (AS, AB, and DT). The analysis can be summarised in nine main steps: 1) the data were split into train and test sets, 2) the imaging features were ranked in ascending order based on their correlation coefficient r with the outcome variable (AS, AB, or DT), 3) N groups of features were generated (with N equal to the number of imaging features), where in each group the feature with the next lowest coefficient r was progressively excluded, 4) N models were trained using five-fold cross-validation (with N equal to the number of imaging features), where in each model the feature with the next lowest coefficient r was progressively excluded from the analysis, 5) the models were ranked based on their R2 in the validation set, 6) the model with the highest mean R2 on the validation set was selected as the best model and the features used to train it as the best features, 7) the selected features were used to train a final multiple linear regression model using the full train test and then tested on the held-out test set, which was not used in the selection process, 8) the top 15 features with the highest correlation coefficient r with AS or DT were identified as the first set of best neural correlates of the outcome of interest, and 9) the eta squared of the significant features was calculated and the top 15 features with the highest eta squared were identified as the second set of best neural correlates of the outcome of interest.

Close modal

2.8 Literature review using natural language processing

The imaging features that contributed the most to the prediction of AS and DT were selected for further investigation using a Natural Language Processing (NLP) (Chen et al., 2021) pipeline in order to confirm whether they mapped onto the expected cognitive and visuo-motor systems. To select features, both univariate and multivariate approaches were used. First, the Eta2 values of the significant features (alpha threshold at 0.05) were calculated from the best fit models as described in the previous section. The features that were both significant and that were among the top 15 features with the highest Eta2 were selected for further analysis (multivariate approach, derived from the multiple regression beta coefficients). The Eta2 was derived by computing the ANOVA of the best fit models and by calculating the Eta2 as the SSeffect/SStotal, where SSeffect corresponds to the sums of squares for the effect of interest and SStotal to the total sums of squares for all effects, errors, and interactions in the ANOVA. Separately, for the univariate approach, the top 15 features with the highest magnitude of Pearson correlation with AS and DT were selected. The resultant feature labels were pooled across imaging modalities, producing AS and DT lists. We used an NLP literature search approach to summarise previously published literature in order to minimise author bias when choosing and interpreting papers related to the different brain regions.

Research papers were identified via multiple advanced search criteria based on the brain feature labels, provided in detail in the Supplementary Table A2. In general, all papers with the name of the brain region in the title and/or in the abstract, and with the words cognition and/or cognitive function in the body of the main text were selected. The addition of the “cognitive function” criteria was necessary to prevent from analysing papers that were mainly related to the cellular and biological aspects of the different brain areas. All papers that matched the advanced search criteria and that could be downloaded in HTML format from PubMed Central were included in the analysis. Papers that were only provided a PDF version, which mostly corresponded to paper published prior to 2000, were excluded, as were papers that were not open access. In total, across all brain regions, 1602 papers were analysed. Detailed numbers on how many papers were analysed for each individual brain region are provided in the Supplementary Table A2.

The HTML documents were pre-processed using the Auto-CORPus pipeline (Automated pipeline for Consistent Outputs from Research Publications) (Beck et al., 2022), an NLP tool that converts publications with an HTML structure into a BioC JSON format (Comeau et al., 2013). The BioC JSON format uses a standard structure developed to allow for the interoperability of text mining outputs across different systems. Concretely, it consists of collections of documents, extracted from a corpus, characterised by different elements, that contain the actual text as well as additional information about the original document. The text is automatically divided into common sections (e.g., abstracts, results …) and paragraphs, and each section is associated to the respective, and unique, Information Artifact Ontology (IAO) annotation (Ceusters, 2012). IAO annotations enable identification of the same sections across different publications, even when they are associated to different titles (e.g., Methods and methodology). Specifically, for our NLP analysis, only the sections corresponding to the Abstract, Discussion, and Introduction were used, which corresponded respectively to the IAO annotations IAO:0000316, IAO:0000315, and IAO:0000319.

The extracted paragraphs were cleaned by converting all words to lower case, by removing words that were less than 2 characters or more than 20 characters long, by removing all special characters and punctuations, and by removing all stop words, which correspond to commonly used words in the English language, such as “you,” “an,” or “in.” In addition, words that are related to the different brain structures (e.g., gyrus, nucleus, cerebellum…), or that are commonly associated to the study design (e.g., controls, patients, humans …) were excluded. After the data cleaning, the frequency of occurrence of each word was calculated across all papers for each individual brain region, and normalised between 0 and 1, with 1 corresponding to the most frequent word.

In order to assess the main brain functions associated to AS and DT, the frequency scores of the five words with the highest values were added separately for all the brain regions related to AS and DT. In this way, if a word appeared frequently in multiple brain regions, then it was assigned a higher frequency score, compared to a frequent word appearing in one individual brain region.

2.9 Software

The analysis was conducted in Python (3.7.1). The main modules used were pandas (1.3.5), numPy (1.21.5), pingouin (0.5.3), statsmodels (0.13.1), nltk (3.7), and genism (4.2.0). The visualisation was completed with seaborn (0.11.0), matplotlib (3.5.1), and wordcloud (1.8.2.2) in Python and ggplot2 (3.3.6) in R (4.0.1).

3.1 Samples characteristics

The full sample consisted of 34927 participants at baseline, among which only 33890 fully completed the PVT. The mean age was 64.7 ± 7.8, and 53% of the participants had an education level comparable to A/AS levels, or higher. Among these 34927 participants, 4962 people (mean age: 61.7 ± 7.2) completed the PVT at a follow-up timepoint. Full details on the sample demographics are available in Table 1.

Table 1.

Sample demographics.

VariableCategoryNumber (percentage)
BaselineFollow up
TOTAL COUNT  34927 4962 
Age group (years) 40-50 346 (1%) 54 (1%) 
 50-60 9525 (27%) 1557 (39%) 
 60-70 14225 (41%) 1661 (42%) 
 70-80 10451 (30%) 661 (17%) 
 >=80 380 (1%) 6 (0%) 
 Missing information 0 (0%) 0 (0%) 
Educational level College or University Degree 16435 (47%) 1888 (48%) 
 A level or equivalent 12828 (37%) 1519 (39%) 
 O levels/GCSEs or equivalent 18567 (53%) 2174 (55%) 
 CSEs or equivalent 4649 (13%) 633 (16%) 
 NVQ or HND or HNC or equivalent 6558 (19%) 802 (20%) 
 Other professional qualifications (e.g., nursing, teaching) 12783 (36%) 1447 (37%) 
 None of the above 2034 (6%) 165 (4%) 
 Prefer not to answer 116 (0.3%) 54 (1%) 
 Missing information 419 (1%) 0 (0%) 
VariableCategoryNumber (percentage)
BaselineFollow up
TOTAL COUNT  34927 4962 
Age group (years) 40-50 346 (1%) 54 (1%) 
 50-60 9525 (27%) 1557 (39%) 
 60-70 14225 (41%) 1661 (42%) 
 70-80 10451 (30%) 661 (17%) 
 >=80 380 (1%) 6 (0%) 
 Missing information 0 (0%) 0 (0%) 
Educational level College or University Degree 16435 (47%) 1888 (48%) 
 A level or equivalent 12828 (37%) 1519 (39%) 
 O levels/GCSEs or equivalent 18567 (53%) 2174 (55%) 
 CSEs or equivalent 4649 (13%) 633 (16%) 
 NVQ or HND or HNC or equivalent 6558 (19%) 802 (20%) 
 Other professional qualifications (e.g., nursing, teaching) 12783 (36%) 1447 (37%) 
 None of the above 2034 (6%) 165 (4%) 
 Prefer not to answer 116 (0.3%) 54 (1%) 
 Missing information 419 (1%) 0 (0%) 

Age and education of the participants of the study at baseline and at the follow-up timepoint. CSE = Certification of Secondary Education, GCSE = General Certificate of Secondary Education, NVQ = National Vocational Certification, HNC = Higher National Certificate, HND = Higher National Diploma, A Level = Advanced Level, O Level = Ordinary Level.

3.2 Measures of individual trial difficulty, ability, and visuo-motor speed from IDoCT

The IDoCT model converged after 250 iterations when estimating trial difficulty measures (D), reaching a mean percent change in word difficulty that tended to 0. In the second iteration, necessary to determine AS and DT, the model converged in 10 iterations while defining a measure of ability and performance. DS was calculated in a final step of the model computation.

The distributions of D (unscaled difficulty) and DS (scaled difficulty) are available in Figure 4A, B, together with the original difficulty scale (Dold) (Fig. 4C), and the association between DS and Dold (Fig. 4D). The mean D, DS, and Dold were respectively 0.60 ± 0.07, 0.72 ± 0.06, and 0.51 ± 0.24. The full list of the words presented and of their associated measure of D, DS, and Dold is available in the Supplementary Table A3. In brief, according to the IDoCT D scale, the five most difficult words, in ascending order of difficulty, were: glower, malefactor, pachyderm, matron, and plethora. Instead, the five easiest words in ascending order of difficulty were: calm, weld, herd, desolate, and engraved. On the other hand, according to DS, the top 5 hardest and easiest words were respectively buffet, trivet, prodigious, bucolic, truncate, and fabricate, angry, monarch, fly, run. In general, most of the words that were assigned to higher difficulty scores were of Latin, Greek, or French origin.

Fig. 4.

D, DS, Dold distributions and association between DS and Dold. Comparison of the distributions of IDoCT measures of (A) unscaled trial’s difficulty D, (B) scaled trial’s difficulty D, and (C) original difficulty Dold. (D) Association between Ds and Dold.

Fig. 4.

D, DS, Dold distributions and association between DS and Dold. Comparison of the distributions of IDoCT measures of (A) unscaled trial’s difficulty D, (B) scaled trial’s difficulty D, and (C) original difficulty Dold. (D) Association between Ds and Dold.

Close modal

The change in assigned difficulty level per word-picture pair between D and DS is presented in Figure 5. In brief, the difficulty of words that were presented exclusively to participants with lower abilities tended to decrease after the scaling. On the other hand, words that were assigned only to participants with higher abilities tended to increase in difficulty after scaling. This pattern of results accords with the method working to correct for sampling bias.

Fig. 5.

Change in word difficulty IDoCT scores before and after scaling. The data-driven measures of trial’s difficulty D were scaled according to the ability of the participants to which the trials were presented, in order to account for potential biased sampling in the task design, meaning cases in which words were presented exclusively to participants with high/low ability. In the figure, red and blue lines correspond respectively to the increase and decrease in difficulty score for each word after the scaling. Each dot in the figure corresponds to one word, the size of the dot indicates the number of participants who were presented that word (i.e., bigger dots mean that a word was presented to a higher number of participants), and the colour is related to the average AS of the participants who were presented a specific word (or trial).

Fig. 5.

Change in word difficulty IDoCT scores before and after scaling. The data-driven measures of trial’s difficulty D were scaled according to the ability of the participants to which the trials were presented, in order to account for potential biased sampling in the task design, meaning cases in which words were presented exclusively to participants with high/low ability. In the figure, red and blue lines correspond respectively to the increase and decrease in difficulty score for each word after the scaling. Each dot in the figure corresponds to one word, the size of the dot indicates the number of participants who were presented that word (i.e., bigger dots mean that a word was presented to a higher number of participants), and the colour is related to the average AS of the participants who were presented a specific word (or trial).

Close modal

The mean AS predicted by IDoCT across the cohort was 0.39 ± 0.08, while the mean DT was 3160 ± 1081. The mean AB was 0.85 ± 0.09. The distributions of AS, DT, and AB were compared, as well as their associations to the raw median RT and the number of correct answers, as presented in Figure 6. As can be observed, AB was characterised by an atypical bimodal distribution centred around high ability scores (~0.85), which supports the hypothesis of the ceiling effect. On the other hand, both DT and AS had the expected near-gaussian distribution of abilities, which, in case of AS, was centred around average scores (~0.4).

Fig. 6.

A, AS, and AB distributions at baseline. Comparison of the distributions of IDoCT measures of (A) AS (cognitive ability), (B) visuo-motor speed DT, and (C) original ability measure AB.

Fig. 6.

A, AS, and AB distributions at baseline. Comparison of the distributions of IDoCT measures of (A) AS (cognitive ability), (B) visuo-motor speed DT, and (C) original ability measure AB.

Close modal

By comparing the associations of AB and AS (Fig. 7) with the number of correct replies and the median RT, expected associations are observed in case of AS, with more correct answers and faster RTs being related to higher AS. On the other hand, in case of AB, some participants were assigned a low AB score despite giving correct answers in the majority of cases and despite having low median RTs. Similar associations were observed when comparing AS, AB, and DT to the 25th and 75th quantile of the RT distributions (Fig. 8).

Fig. 7.

AS, AB, and DT association with median RT and number of correct answers. (A) Association between AS and number of correct answers, (B) Association between AS and median RT, (C) Association between AB and number of correct answers, (D) Association between AB and median RT, (E) Association between DT and number of correct answers, (F) Association between DT and median RT, (G) Association between AS and AB, and (H) Association between DT and AB.

Fig. 7.

AS, AB, and DT association with median RT and number of correct answers. (A) Association between AS and number of correct answers, (B) Association between AS and median RT, (C) Association between AB and number of correct answers, (D) Association between AB and median RT, (E) Association between DT and number of correct answers, (F) Association between DT and median RT, (G) Association between AS and AB, and (H) Association between DT and AB.

Close modal
Fig. 8.

AS, AB, and DT association with the 25th and 75th quantile of the RT distribution. (A) Associations between AS and the 25th/75th quantile of RT distribution, (B) Associations between DT and the 25th/75th quantile of RT distribution, and (C) Associations between AB and the 25th/75th quantile of RT distribution.

Fig. 8.

AS, AB, and DT association with the 25th and 75th quantile of the RT distribution. (A) Associations between AS and the 25th/75th quantile of RT distribution, (B) Associations between DT and the 25th/75th quantile of RT distribution, and (C) Associations between AB and the 25th/75th quantile of RT distribution.

Close modal

In case of DT, no association with the number of correct answers was observed (Fig. 7E), which is expected considered that the level of visuo-motor latency should not affect how correctly each participant replies. On the other hand, DT was almost linearly associated with the median RT (Fig. 7F), with higher median RTs for participants with longer DT. This is also expected since participants with longer visuo-motor latency are expected to be overall slower. No clear association was observed between AS and AB and DT and AB. The latter is not surprising considered that DT is not supposed to capture cognitive performance like AB. On the other hand, the association between AS and AB resembles that of AB and the number of correct, which is expected considered that AS is almost linearly associated with accuracy (Fig. 7G, H).

3.3 Trial-by-trial difficulty trajectories

To validate that Dold led to ceiling effects, trial difficulty trajectories were plotted, that is, showing how the difficulty of sampled word-picture combinations changed sequentially through the task. Separate trajectories were computed for D, DS, and Dold, and for participants with different ability levels. Participants were divided into 10 groups based on AB (0-0.1, 0.1-0.2,… 0.9-1.0), and their mean trajectories obtained by averaging the difficulty scores of the words presented at each, as shown in Figure 9. If the adaptive staircase approach was correct, then a gradual increase in difficulty should have been observed for all participants with medium-high ability (AB > 0.3). Instead, already after up to five trials, Dold either reached a plateau or started to decrease, which suggests that no words with higher difficulty as defined by Dold were available after that point, forcing the algorithm to provide words of either equal or lower difficulty. Furthermore, the difficulty level as defined by D and DS appeared not to increase across time. Together, these results support the hypothesis that the original difficulty scale was not optimally calibrated for the assessed cohort.

Fig. 9.

Trial’s difficulty trajectories. Participants into 10 were divided groups based on their AB measures. The trial difficulty trajectory of each group was obtained by measuring the mean difficulty (D, DS, and Dold) of the presented trials at each step of the cognitive assessment. In case of participants with medium-high difficulty, Dold either reached a plateau or decreased after a few trials, meaning that the original difficulty scale was not optimally calibrated.

Fig. 9.

Trial’s difficulty trajectories. Participants into 10 were divided groups based on their AB measures. The trial difficulty trajectory of each group was obtained by measuring the mean difficulty (D, DS, and Dold) of the presented trials at each step of the cognitive assessment. In case of participants with medium-high difficulty, Dold either reached a plateau or decreased after a few trials, meaning that the original difficulty scale was not optimally calibrated.

Close modal

3.4 AS has better test-retest reliability compared to AB

IDoCT was applied to the follow-up data to evaluate retest reliability. The model converged after 250 and 10 iterations respectively when predicting D and AS/DT. The distributions of the predicted AS and DT, as well as of AB, are available in Figure 10. The mean AS, DT, and AB at follow up were respectively 0.4 ± 0.08, 3009 ± 1035, and 0.83 ± 0.09. Similar to the baseline, both AS and DT were characterised by Gaussian-shaped distributions, with AS being centred around average values (~0.4), while AB again consisted of a bimodal distribution centred around high ability scores.

Fig. 10.

A, AS, and AB distributions at follow up. Comparison of the distributions of IDoCT measures of (A) AS (cognitive ability), (B) visuo-motor speed DT, and (C) original ability measure AB.

Fig. 10.

A, AS, and AB distributions at follow up. Comparison of the distributions of IDoCT measures of (A) AS (cognitive ability), (B) visuo-motor speed DT, and (C) original ability measure AB.

Close modal

By comparing the iDoCT predictions at baseline and follow up, both AS and DT had a test-retest reliability of respectively r = 0.77 and r = 0.57 (Fig. 11). The test-retest reliability of AB was substantially lower compared to AS (r = 0.66). Furthermore, the distribution of the Bland-Altman plots showed a poor spread and only high AB scores being consistent across timepoints.

Fig. 11.

Bland-Altman plots of DT, AS, and AB. The correlation coefficients r between AS (A), DT (B), and AB (C) at baseline and follow up were respectively 0.77, 0.57, and 0.66.

Fig. 11.

Bland-Altman plots of DT, AS, and AB. The correlation coefficients r between AS (A), DT (B), and AB (C) at baseline and follow up were respectively 0.77, 0.57, and 0.66.

Close modal

3.5 Older and better educated participants have higher cognitive abilities scores

When predicting AS from age decade and education level, a significant regression equation was found (F = 710.6 (13, 33876), p = <0.001) with an R2 of 0.21. The regression equation for the original AB score was also significant (F = 561.2 (13, 33864), p = <0.001) but with a lower R2 of 0.17. Both age and education were significant predictors of AS and AB. Regarding education, the strongest predictors were having a college/university degree (ed1) and A/AS level (ed2), which had respectively a positive effect size in standard deviation (SD) units of 0.66 and 0.35 for AS, and somewhat less at 0.57 and 0.33 SDs for AB, compared to the reference category. In the case of age, older participants had higher AS compared to participants aged 45. The increase was gradual until age 70, at which point a plateau was reached, and participants reached a 0.48 SD increase in AS versus a 0.41 increase in AB (for participants aged 75 compared to the reference category) (Fig. 12).

Fig. 12.

Analysis of age and education associations. Multiple linear regression models were trained using AS (A), DT (B), and AB (C) as predicted variables and age and education as predictors. The obtained beta coefficients in SD units are represented in the figure. Age and education were always significant predictors, but the effect size of education was negligible (-0.1< x >0.1) in case of DT. The colour and asterisk represent the significance level of each feature. Age45 and ed7 were used as reference category. ed1 corresponds to College/University degree (ed1), ed2 to A levels/AS levels or equivalent, ed3 to O levels/GCSEs or equivalent, ed4 to CSEs or equivalent, ed5 to NVQ or HND or HNC or equivalent, ed6 to other professional qualification, and ed7 to none of the above. The red lines indicate the area where the effect size can be considered negligible.

Fig. 12.

Analysis of age and education associations. Multiple linear regression models were trained using AS (A), DT (B), and AB (C) as predicted variables and age and education as predictors. The obtained beta coefficients in SD units are represented in the figure. Age and education were always significant predictors, but the effect size of education was negligible (-0.1< x >0.1) in case of DT. The colour and asterisk represent the significance level of each feature. Age45 and ed7 were used as reference category. ed1 corresponds to College/University degree (ed1), ed2 to A levels/AS levels or equivalent, ed3 to O levels/GCSEs or equivalent, ed4 to CSEs or equivalent, ed5 to NVQ or HND or HNC or equivalent, ed6 to other professional qualification, and ed7 to none of the above. The red lines indicate the area where the effect size can be considered negligible.

Close modal

3.6 Older participants have longer visuo-motor latency times

In the case of DT, a significant regression equation was found (F = 39.41 (10, 33876), p = <0.001) with an R2 of 0.015. Both age and education were significant predictors of DT, but the effect size of education was consistently small to negligible (below 0.1 SD units). The relationship between age and DT was comparable to the association with AS, with older participants having a gradual increase in DT, or visuo-motor latency time. More specifically, participants at age 75 and 80 were associated to respectively a 0.38 SD and 0.52 SD increase in DT compared to participants at age 45 (Fig. 12).

3.7 Imaging analysis feature selection

The results of the fine-tuning are reported in Figure 13 for AS and Figure 14 for DT. In total, P models were trained for each dataset (i.e., FA, Thickness, Volume, and Intensity), where P corresponds to the number of features available. Model0 corresponds to the model trained on all features except for the feature with the lowest correlation coefficient. For each of the following models, the feature with the lowest coefficient r among those remaining was dropped, until Model(P-2), which was trained exclusively on the last feature available (i.e., the feature with the highest r correlation coefficient).

Fig. 13.

AS feature selection. P models were trained using as regressors the imaging data in each individual dataset (A) FA, (B) Thickness, (C) Volume, and (D) Intensity and as predictor AS, where P corresponds to the number of features in each dataset. In each iteration of the model, the features (among those still available) with the lowest correlation coefficient with the predictor were excluded from the regressors, meaning that Model1 was trained on all features except 1 (i.e., the feature with the lowest r), and ModelP was trained only on one feature (i.e., the feature with the highest r). The R2 was calculated on the train and validation set. The train/validation R2 (and their standard deviation across five folds) are reported in the figure. The red line represents the best performing model, namely the model that yielded the highest R2 on the validation set.

Fig. 13.

AS feature selection. P models were trained using as regressors the imaging data in each individual dataset (A) FA, (B) Thickness, (C) Volume, and (D) Intensity and as predictor AS, where P corresponds to the number of features in each dataset. In each iteration of the model, the features (among those still available) with the lowest correlation coefficient with the predictor were excluded from the regressors, meaning that Model1 was trained on all features except 1 (i.e., the feature with the lowest r), and ModelP was trained only on one feature (i.e., the feature with the highest r). The R2 was calculated on the train and validation set. The train/validation R2 (and their standard deviation across five folds) are reported in the figure. The red line represents the best performing model, namely the model that yielded the highest R2 on the validation set.

Close modal
Fig. 14.

DT feature selection. P models were trained using as regressors the imaging data in each individual dataset (A) FA, (B) Thickness, (C) Volume, and (D) Intensity and as predictor DT, where P corresponds to the number of features in each dataset. In each iteration of the model, the features (among those still available) with the lowest correlation coefficient with the predictor were excluded from the regressors, meaning that Model1 was trained on all features except 1 (i.e., the feature with the lowest r), and Model(P-2) was trained only on one feature (i.e., the feature with the highest r). The R2 was calculated on the train and validation set. The train/validation R2 (and their standard deviation across five folds) are reported in the figure. The red line represents the best performing model, namely the model that yielded the highest R2 on the validation set.

Fig. 14.

DT feature selection. P models were trained using as regressors the imaging data in each individual dataset (A) FA, (B) Thickness, (C) Volume, and (D) Intensity and as predictor DT, where P corresponds to the number of features in each dataset. In each iteration of the model, the features (among those still available) with the lowest correlation coefficient with the predictor were excluded from the regressors, meaning that Model1 was trained on all features except 1 (i.e., the feature with the lowest r), and Model(P-2) was trained only on one feature (i.e., the feature with the highest r). The R2 was calculated on the train and validation set. The train/validation R2 (and their standard deviation across five folds) are reported in the figure. The red line represents the best performing model, namely the model that yielded the highest R2 on the validation set.

Close modal

When predicting AS, in the case of FA and Thickness, Model0 was the best performing, which included all features except for respectively the FA in the right cingulate gyrus (part of cingulum), and the thickness in the right pars triangularis. Similarly, Model1 was the best performing for Volume, meaning that only two features (i.e., the Brain Stem and Crus I Cerebellum vermis) were dropped when predicting AS. Finally, in case of Intensity, Model11 was the one that yielded the highest R2 in the validation set, which resulted in 12 features being dropped (i.e., mean intensity of CSF, White Matter hypointensities, non-White matter hypointensities, Corpus Callosum Mid-Posterior, 5th Ventricle, left/right Cerebellum White Matter, Corpus Callosum Posterior, right/left Cerebellum Cortex, left Amygdala, and volume of White Matter hypointensities). The R2 obtained for each model, and each dataset, is available in the Supplementary Tables A4-A7.

When predicting DT, Model12 was the best FA model, which resulted from dropping 13 features (i.e., FA in the left/right superior thalamic radiation, left/right corticospinal tract, right posterior thalamic radiation, left inferior longitudinal fasciculus, left superior longitudinal fasciculus, left parahippocampal part of cingulum, middle cerebellar peduncle, right/left inferior fronto occipital fasciculus, left anterior thalamic radiation, and forceps minor). For Volume and Thickness, Model97 and Model45 were the best performing; as a result, respectively only 41 out of 139 and 13 out of 60 features were kept. When fine-tuning the models on the intensity dataset, all the models had R2 <= 0. The only model with a positive R2 (R2 = 0.001) was obtained after dropping all the features except one. Due to the low performance of the intensity-based models, no features were extracted from the intensity dataset during the feature selection step. More detailed information on which features were dropped during the fine-tuning is available in the Supplementary Tables A8-A11, as well as the obtained R2 on the train and validation set of each model.

Results of the feature selection process when applying the same train-validation pipeline to AB are reported in Figure 15.

Fig. 15.

AB feature selection. P models were trained using as regressors the imaging data in each individual dataset (A) FA, (B) Thickness, (C) Volume, and (D) Intensity and as predictor AB, where P corresponds to the number of features in each dataset. In each iteration of the model, the features (among those still available) with the lowest correlation coefficient with the predictor were excluded from the regressors, meaning that Model1 was trained on all features except 1 (i.e., the feature with the lowest r), and Model(P-2) was trained only on one feature (i.e., the feature with the highest r). The R2 was calculated on the train and validation set. The train/validation R2 (and their standard deviation across five folds) are reported in the figure. The red line represents the best performing model, namely the model that yielded the highest R2 on the validation set.

Fig. 15.

AB feature selection. P models were trained using as regressors the imaging data in each individual dataset (A) FA, (B) Thickness, (C) Volume, and (D) Intensity and as predictor AB, where P corresponds to the number of features in each dataset. In each iteration of the model, the features (among those still available) with the lowest correlation coefficient with the predictor were excluded from the regressors, meaning that Model1 was trained on all features except 1 (i.e., the feature with the lowest r), and Model(P-2) was trained only on one feature (i.e., the feature with the highest r). The R2 was calculated on the train and validation set. The train/validation R2 (and their standard deviation across five folds) are reported in the figure. The red line represents the best performing model, namely the model that yielded the highest R2 on the validation set.

Close modal

3.8 Identification of neural correlates of AS and DT: univariate and multivariate analysis

The selected features were used as regressors of multiple linear regression models trained on the full train set and tested on the held-out test set. Two models were trained for each dataset (i.e., FA, Thickness, Volume, and Intensity), predicting either AS or DT. Overall, significant regression models were found for all analysed datasets, with an average train and test R2 that was respectively 0.02 ± 0.01 and 0.016 ± 0.005 across all datasets for AS, and 0.004 ± 0.002 and 0.002 ± 0.001 across all datasets for DT. Full results are available in Table 2. The model for the intensity dataset and DT was not computed due to the identified low performance at the feature selection step of the analysis. For AB, significant regression models were found for all analysed datasets, with an average train and test R2 that was respectively 0.018 ± 0.01 and 0.013 ± 0.005. Full results are available in Table 2. Therefore, while the models predicting AS had modest R2 values, they performed numerically better than the models predicting AB across all four imaging datasets.

Table 2.

Multiple linear regression models with imaging features.

DatasetASABDT
R2trainR2testp-valueF trainR2trainR2testp-valueF (df)trainR2trainR2testp-valueF (df)train
FA 0.011 0.016 <0.001 9.539 0.011 0.014 <0.001 8.738 0.003 0.003 <0.001 4.924 
Intensity 0.013 0.008 <0.001 8.651 0.012 0.006 <0.001 7.247 — — — — 
Grey volume 0.037 0.023 <0.001 5.800 0.034 0.019 <0.001 5.308 0.007 0.001 <0.001 3.594 
Thickness 0.016 0.015 <0.001 5.588 0.015 0.012 <0.001 5.321 0.002 0.001 <0.001 3.056 
DatasetASABDT
R2trainR2testp-valueF trainR2trainR2testp-valueF (df)trainR2trainR2testp-valueF (df)train
FA 0.011 0.016 <0.001 9.539 0.011 0.014 <0.001 8.738 0.003 0.003 <0.001 4.924 
Intensity 0.013 0.008 <0.001 8.651 0.012 0.006 <0.001 7.247 — — — — 
Grey volume 0.037 0.023 <0.001 5.800 0.034 0.019 <0.001 5.308 0.007 0.001 <0.001 3.594 
Thickness 0.016 0.015 <0.001 5.588 0.015 0.012 <0.001 5.321 0.002 0.001 <0.001 3.056 

The features identified during the feature selection step were used as regressors of multiple linear regression models to predict AS, AB, and DT. The R2 (train and test), F statistics, and p-value of the models are reported in the table.

The significant features of the AS and DT models, and their respective Eta2 are presented in Figure 16. The 15 top significant features with the highest Eta2 across all data modalities were identified as the best neural correlates of AS and DT and used in the next steps of the analysis. The identified neural correlates of AS were: Hippocampus, superior and transverse temporal gyrus, medial lemniscus, anterior corpus callosum, medial frontal cortex, uncinate fasciculus, nucleus Accumbens, caudal middle frontal gyrus, inferior temporal gyrus, superior frontal gyrus, and parahippocampus part of cingulum. The identified neural correlates of DT were forceps major, medial lemniscus, posterior thalamic radiation, middle temporal gyrus, lateral orbitofrontal cortex, occipital fusiform gyrus, and lateral occipital cortex.

Fig. 16.

Multiple linear regression using imaging features. The imaging features identified during the feature selection step were used as regressors of multiple linear regression models to predict AS (A, B, C, D) and DT (E, F, G). The measured eta squared of the features identified as significant in the models are reported in the figure. The colour and asterisk represent the significance level of each feature. The red dotted line separates the features belonging to the top 15, which were selected as a neural correlate of AS/DT, from the other features.

Fig. 16.

Multiple linear regression using imaging features. The imaging features identified during the feature selection step were used as regressors of multiple linear regression models to predict AS (A, B, C, D) and DT (E, F, G). The measured eta squared of the features identified as significant in the models are reported in the figure. The colour and asterisk represent the significance level of each feature. The red dotted line separates the features belonging to the top 15, which were selected as a neural correlate of AS/DT, from the other features.

Close modal

In order to assess the robustness of the results, a second set of neural correlates were identified by completing a univariate analysis. In this case, the selected features were ranked according to the magnitude of their Pearson correlation coefficient obtained after correlating each feature with either AS or DT. The top 15 features across all data modalities with a significant p-value (p-value < 0.001) were selected as neural correlates of AS and DT. The identified neural correlates of AS were: Amygdala, Hippocampus, Frontal Pole, Insular Cortex, Cerebellum, Superior temporal gyrus, and Temporal fusiform cortex. On the other hand, the identified neural correlates of DT were: middle temporal area, Cerebellum, Inferior temporal gyrus, Intracalcarine cortex, Occipital Pole, superior temporal, and lateral orbito-frontal area.

3.9 DT is associated more strongly to brain regions with visuo-motor functions and AS to regions with memory and language functions

The full list of the words with the highest frequency of occurrence in the articles that use the anatomical labels associated with AS and DT is available in the Supplementary Tables A12-A15, for both the univariate and multivariate analysis. A summary of the results is presented in Figure 17.

Fig. 17.

NLP analysis results. The word clouds represent the words with the highest frequency of occurrence in papers of the best neural correlates of (A) AS derived from univariate analysis, (B) AS derived from multivariate analysis, (C) DT obtained from univariate analysis, and (D) DT obtained from multivariate analysis. Bigger words correspond to a higher frequency of occurrence.

Fig. 17.

NLP analysis results. The word clouds represent the words with the highest frequency of occurrence in papers of the best neural correlates of (A) AS derived from univariate analysis, (B) AS derived from multivariate analysis, (C) DT obtained from univariate analysis, and (D) DT obtained from multivariate analysis. Bigger words correspond to a higher frequency of occurrence.

Close modal

In summary, visual (2.57), age (2.0), motor (1.8), lesion (1.32), and stimulus (1.27) were the words with the highest frequency of occurrence for DT-associated anatomical labels in case of the multivariate analysis. Similarly, visual (3.63), stimulus (1.53), and lesion (1.29) were the ones with the highest frequency for DT-associated anatomical labels derived from the univariate analysis. Conversely, memory (3.75), age (3.24), and auditory (2.0), and memory (2.0), social (1.68), behaviour (1.59), emotional (1.33), and learning (1.28) were the most frequent words in case of AS (Fig. 16) for respectively the multivariate and univariate analysis.

IDoCT is a flexible method designed to fractionate the detailed timecourses that are collected during performance of a computerised cognitive task into components that can be explained by inter-subject variability in basic visuo-motor processing speed and device latency on the one hand, and the specific cognitive abilities that the task was intended to manipulate on the other (Giunchiglia et al., 2023). This is achieved in a simple, robust, and data-driven manner that iteratively re-estimates individuals’ abilities and trial-difficulty scales whilst handling the speed accuracy tradeoff.

This method was initially applied to improve the precision of performance estimates from the Cognitron library of online tasks (Giunchiglia et al., 2023). However, the flexibility of the approach can be adapted for practically any computerised task that varies dimensions of cognitive difficulty across trials and where performance recordings are available for large numbers of individuals. The results presented here provide further evidence of the utility of IDoCT in the context of data derived from an independently designed UK Biobank task, including improvements in the trial-difficulty scale, participants’ score distributions, retest statistics, demographic correlations, and imaging associations.

More specifically, a critical limitation of the PVT dataset is that the original summary score distributions are malformed due to the dynamic sampling algorithm having operated across a sub-optimal trial-difficulty scale (Dold). Our analysis of sampling trajectories highlights the basis of that limitation. Specifically, the difficulty (Dold) of sampled trials for participants with moderate to good original scores (AB) either reaches a plateau or decreases after just 5 out of 30 available steps. Replotting the sampling trajectories using data-driven trial-difficulty estimates (D and DS) indicates that the intended difficulty increments across time were not achieved. Furthermore, the correlations between original scores and accuracy or response time measures also are hard to interpret, suggesting that ceiling effects were not the only problem, as the ordering of the scale may also be sub-optimal for the UK Biobank population, perhaps due to it having been developed with a United States (US) population in mind. This resulted in the original summary scores (AB) having a malformed bimodal distribution with a high mean estimate (0.8 where 1 is maximum), which is unlikely to reflect the true underlying population distribution of crystallised intelligence abilities (Flynn, 1987).

IDoTC operated by (i) leveraging the population’s accuracy and response time measures while (ii) factoring in the component of performance variance that is better explained by basic visuo-motor response times and (iii) taking into account potential (and here intended) bias due to more difficult items being sampled for higher performing abilities. Taken together, these characteristics enable recalculation of the trial-difficulty scale in a data driven manner. The resultant DS scale can potentially be used in future studies as the basis for the adaptive sampling algorithm, though it may still be advisable to include more high-difficulty items given the observed ceiling effects.

More importantly, analysis of the data distributions demonstrates that IDoCT was successful in addressing the issues with the original summary scores. For example, AS has the expected Gaussian distribution (Flynn, 1987), supporting the view that IDoCT could overcome the ceiling effect limitation generated by Dold and better capture the underlying crystallised language ability that is the target of the task. Plotting the original AB and re-estimated AS scores against the basic mean reaction time and total correct response measures shows distorted distributions for the former but not the latter. Furthermore, by comparing the word difficulty measures before and after the scaling (Fig. 5), it is evident that IDoCT properly addresses the biased sampling issue, increasing the measure of difficulty of words that were presented exclusively to participants of higher ability and by decreasing it in the case of words presented only to low-performing individuals. Moreover, the retest plots demonstrate a more homogeneous spread on the Bland-Altman plots, with lower SD difference and higher cross-session correlation for AS. Although higher compared to AB, the relatively low test-retest reliability of AS can be due to multiple factors, such as learning curve resulting from having already completed the tests previously and aging effects. In sum, all analyses indicate that the performance score distributions from IDoCT are superior to those output by the original task.

The obtained measures of DT and AS are further validated by the results on the associations with age and education. Specifically, although DT shows the expected increase with age (Habekost et al., 2013), the effect size of the relationship with education level is negligible. This supports the view that DT captures individual differences in fundamental visuo-motor latencies, as opposed to the knowledge of the meaning of words. Conversely, AS improves with age—the expected pattern of results as knowledge of words improves and “crystallises” throughout the lifespan (Hayden & Welsh-Bohmer, 2011; Park & Reuter-Lorenz, 2009; Salthouse, 2009; Singh-Manoux et al., 2012)—but also improves with exposure to higher education, where there is a higher likelihood of learning unusual words (Ceci, 1991; Guerra-Carrillo et al., 2017). Notably, the scale of both of these associations is numerically stronger for AS than for the original AB score, together resulting in an improved R2 in the linear regression analysis (AB: 0.17, AS: 0.21).

The above results confirm that IDoCT is successful in recalculating the difficulty scale, and in fractioning performance into distinct cognitive and visuo-motor components that have superior retest properties and improved demographic predictive validity. These findings align with our previous applications of this technique to tasks from the Cognitron library (Giunchiglia et al., 2023). The more novel question pertains to whether these advantages in the precision of the task performance estimates extend to improvements in imaging associations.

The machine-learning pipeline addresses this by measuring how accurately the original AB measure can be predicted from data of different imaging modalities and using this as a baseline for comparing the AS and DT performance estimates. Overall, the behavioural–imaging associations are in the small range, albeit statistically significant. This may not be unexpected given the simplicity of the linear regression machine-learning approach applied and the recent literature on the scale of such associations when estimated within well-powered datasets (Marek et al., 2022). More importantly, the associations with AS are consistently stronger than AB across all analysed imaging modalities. Among the four datasets studied, the volume measures led to models with the highest R2 on the test set for both AS and DT, suggesting that these measures might be more informative when trying to predict cognitive ability and visuo-motor latency.

The NLP analysis provides an unbiased data-driven way to qualitatively evaluate the functional specificity of the AS and DT measures, by determining the most common functional terms that their associated brain regions co-occur within literature. The results have face value, with brain regions identified as best predictors of DT mainly relating to visual and motor functions, such as the intracalcarine cortex (Coullon et al., 2015), and the cerebellum (Guell et al., 2018). This is expected considered that DT is supposed to measure visuo-motor latency times. Conversely, AS is mainly associated with brain regions involved in memory, language, and auditory functions, such as the hippocampus (Eichenbaum et al., 1999), uncinate fasciculus (Papagno, 2011), and inferior temporal gyrus (Onitsuka et al., 2004). Considering the nature of the task, which requires participants to associate the meaning of spoken/written words to different images, the identified brain functions are as expected.

A strength of this study is the sample size, which allows to draw firmer and more precise conclusions. An important aspect to consider, however, is that the UK Biobank sample includes mainly middle to older aged individuals (<1% below 50 years old), who are not necessarily representative of the general population in terms of health, physical, and lifestyle aspects (Fry et al., 2017). This does not undermine our validation of the IDoCT approach, but it should be noted that the data-driven measures of trials’ difficulty might change if younger individuals are included in the analysis, as different words are more likely to be learnt at different stages of the lifespan (e.g., school vs work). A further strength of the study is that four different kinds of features (imaging-derived phenotypes) from two imaging data modalities were analysed, in combination with both behavioural and demographics data, which allows to gain better insight into the best imaging predictors of AS and DT.

Despite these strengths, there are some limitations. First, IDoCT requires as input a trial’s description in order to extract data-driven measures of trials’ difficulty. In this study, the word presented at each trial is used as that definition because it was the only information available. However, in the case of PVT, the difficulty of a trial is influenced not only by the word presented, but also by the set of pictures from which participants are required to choose. Without this information, it is difficult to interpret the reason why specific words were assigned higher or lower difficulty scores, as the motivation could be a combination of the difficulty of the word itself or of the figures presented. However, a general observation is that higher values of DS on the difficulty scale are mainly characterised by a Latin, French or Greek etymology. This pattern of results again has some face validity, because English is not a romance, but a Germanic language (Bech & Walkden, 2016), but corresponds only to a limited interpretation due to the lack of information about the word-figure pairs.

Further limitations are related to the NLP analysis. First, the latter was conducted exclusively on open access papers, which limits the literature research to previous studies that are freely accessible. In addition, although the NLP analysis conducted in this study was sufficient to achieve the intended aim, it consisted exclusively of the estimation of word frequency across different papers. Further research could extend this by using more advanced NLP approaches, such as topic modelling to extract common topics (Koltcov et al., 2014), or by implementing digraphs and dependency parsing to identify related and connected words (Kübler et al., 2009). The latter approach could provide additional information on the meaning of the words within the context of the sentence in which they appear and help in reducing the noise of the one to many mappings between brain structures and cognitive functions.

Finally, there are two limitations of the imaging analysis that could be addressed by further research. First, the study is exclusively focused on associations with structural measures. However, cognitive processes generally result from complex brain networks (Raichle et al., 2001), rather than individual discrete contributions of brain regions (Petersen & Sporns, 2015). Analysis using graph theoretic approaches to quantifying the information processing properties of networks, or focused on network dynamics from functional MRI might provide better and more detailed insights into the neural correlates of cognitive task performance. Relatedly, the machine-learning approaches used here are mainly state-of-the-art shallow linear models and they are applied independently to each imaging modality. The strength of associations could improve if more advanced methods, such as deep learning and combination of multi-modal features, are applied. Nonetheless, the fact that association strengths increase for AS versus AB across all imaging modalities confirms our primary hypothesis that IDoCT can provide superior cognitive ability measures for future association studies, including those investigating more advanced imaging analysis methods.

In conclusion, we successfully apply IDoCT to the PVT data collected as part of the UK Biobank imaging extension, and obtain superior prediction of subject-level cognitive ability and visuo-motor latency, as well as an optimised data-driven word difficulty scale calibrated on the UK population. Our results further validate IDoCT by showing the improved relationship between the performance metrics and age and education, as well as brain imaging metrics in terms of the strength and functional specificity of associations.

The imaging and cognitive datasets analysed in this study are available via the UK Biobank data access process (see http://www.ukbiobank.ac.uk/register-apply/). The AS and DT measures derived can also be accessed via the same process (see https://biobank.ndph.ox.ac.uk/ukb/label.cgi?id=504). The code of IDoCT will be released publicly on Github.

A.H. obtained funding. A.H., S.C., S.S., and N.A. designed the study. S.C. and N.A. collected the data and developed the cognitive task. S.S. collected the imaging data. V.G. conducted the analysis, validation and implemented the methodology. A.H. and S.S. supervised the project. V.G. wrote the first draft of the manuscript. All authors reviewed and approved the manuscript.

V.G. was supported by the NIHR Imperial Biomedical Research Centre (BRC) grant to A.H. and by the Medical Research Council, MR/W00710X/1. N.A. was funded by the Medical Research Council and Wellcome Trust.

The authors declare no competing interests.

This research has been conducted using the UK Biobank Resource under application number 100870.

Supplementary material for this article is available with the online version here: https://doi.org/10.1162/imag_a_00087.

Alfaro-Almagro
,
F.
,
Jenkinson
,
M.
,
Bangerter
,
N. K.
,
Andersson
,
J. L. R.
,
Griffanti
,
L.
,
Douaud
,
G.
,
Sotiropoulos
,
S. N.
,
Jbabdi
,
S.
,
Hernandez-Fernandez
,
M.
,
Vallee
,
E.
,
Vidaurre
,
D.
,
Webster
,
M.
,
McCarthy
,
P.
,
Rorden
,
C.
,
Daducci
,
A.
,
Alexander
,
D. C.
,
Zhang
,
H.
,
Dragonu
,
I.
,
Matthews
,
P. M.
, …
Smith
,
S. M.
(
2018
).
Image processing and quality control for the first 10,000 brain imaging datasets from UK Biobank
.
NeuroImage
,
166
,
400
424
. https://doi.org/10.1016/j.neuroimage.2017.10.034
Bech
,
K.
, &
Walkden
,
G.
(
2016
).
English is (still) a West Germanic language
.
Nordic Journal of Linguistics
,
39
(
1
),
65
100
. https://doi.org/10.1017/S0332586515000219
Beck
,
T.
,
Shorter
,
T.
,
Hu
,
Y.
,
Li
,
Z.
,
Sun
,
S.
,
Popovici
,
C. M.
,
McQuibban
,
N. A. R.
,
Makraduli
,
F.
,
Yeung
,
C. S.
,
Rowlands
,
T.
, &
Posma
,
J. M.
(
2022
).
Auto-CORPus: A natural language processing tool for standardizing and reusing biomedical literature
.
Frontiers in Digital Health
,
4
,
788124
. https://doi.org/10.3389/fdgth.2022.788124
Brooker
,
H.
,
Williams
,
G.
,
Hampshire
,
A.
,
Corbett
,
A.
,
Aarsland
,
D.
,
Cummings
,
J.
,
Molinuevo
,
J. L.
,
Atri
,
A.
,
Ismail
,
Z.
,
Creese
,
B.
,
Fladby
,
T.
,
Thim-Hansen
,
C.
,
Wesnes
,
K.
, &
Ballard
,
C.
(
2020
).
FLAME: A computerized neuropsychological composite for trials in early dementia
.
Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring
,
12
(
1
). https://doi.org/10.1002/dad2.12098
Ceci
,
S. J.
(
1991
).
How much does schooling influence general intelligence and its cognitive components? A reassessment of the evidence
.
Developmental Psychology
,
27
(
5
),
703
722
. https://doi.org/10.1037/0012-1649.27.5.703
Ceusters
,
W.
(
2012
).
An information artifact ontology perspective on data collections and associated representational artifacts
. In
MIE
(pp.
68
72
). https://doi.org/10.3233/978-1-61499-101-4-68
Chen
,
Q.
,
Leaman
,
R.
,
Allot
,
A.
,
Luo
,
L.
,
Wei
,
C.-H.
,
Yan
,
S.
, &
Lu
,
Z.
(
2021
).
Artificial intelligence in action: Addressing the COVID-19 pandemic with natural language processing
.
Annual Review of Biomedical Data Science
,
4
(
1
),
313
339
. https://doi.org/10.1146/annurev-biodatasci-021821-061045
Comeau
,
D. C.
,
Islamaj Dogan
,
R.
,
Ciccarese
,
P.
,
Cohen
,
K. B.
,
Krallinger
,
M.
,
Leitner
,
F.
,
Lu
,
Z.
,
Peng
,
Y.
,
Rinaldi
,
F.
,
Torii
,
M.
,
Valencia
,
A.
,
Verspoor
,
K.
,
Wiegers
,
T. C.
,
Wu
,
C. H.
, &
Wilbur
,
W. J.
(
2013
).
BioC: A minimalist approach to interoperability for biomedical text processing
.
Database
,
2013
,
bat064
. https://doi.org/10.1093/database/bat064
Coullon
,
G. S. L.
,
Emir
,
U. E.
,
Fine
,
I.
,
Watkins
,
K. E.
, &
Bridge
,
H.
(
2015
).
Neurochemical changes in the pericalcarine cortex in congenital blindness attributable to bilateral anophthalmia
.
Journal of Neurophysiology
,
114
(
3
),
1725
1733
. https://doi.org/10.1152/jn.00567.2015
Cox
,
S. R.
,
Ritchie
,
S. J.
,
Fawns-Ritchie
,
C.
,
Tucker-Drob
,
E. M.
, &
Deary
,
I. J.
(
2019
).
Structural brain imaging correlates of general intelligence in UK Biobank
.
Intelligence
,
76
,
101376
. https://doi.org/10.1016/j.intell.2019.101376
Dunn
,
L. M.
, &
Dunn
,
D. M.
(
2007
).
Peabody picture vocabulary test—fourth edition
[dataset].
American Psychological Association
. https://doi.org/10.1037/t15144-000
Eichenbaum
,
H.
,
Dudchenko
,
P.
,
Wood
,
E.
,
Shapiro
,
M.
, &
Tanila
,
H.
(
1999
).
The hippocampus, memory, and place cells
.
Neuron
,
23
(
2
),
209
226
. https://doi.org/10.1016/S0896-6273(00)80773-4
Fawns-Ritchie
,
C.
, &
Deary
,
I. J.
(
2020
).
Reliability and validity of the UK Biobank cognitive tests
.
PLoS One
,
15
(
4
),
e0231627
. https://doi.org/10.1371/journal.pone.0231627
Ferguson
,
A. C.
,
Tank
,
R.
,
Lyall
,
L. M.
,
Ward
,
J.
,
Welsh
,
P.
,
Celis-Morales
,
C.
,
McQueenie
,
R.
,
Strawbridge
,
R. J.
,
Mackay
,
D. F.
,
Pell
,
J. P.
,
Smith
,
D. J.
,
Sattar
,
N.
,
Cavanagh
,
J.
, &
Lyall
,
D. M.
(
2020
).
Association of SBP and BMI with cognitive and structural brain phenotypes in UK Biobank
.
Journal of Hypertension
,
38
(
12
),
2482
2489
. https://doi.org/10.1097/HJH.0000000000002579
Flynn
,
J. R.
(
1987
).
“Massive IQ gains in 14 nations: What IQ tests really measure”: Correction to Flynn
.
Psychological Bulletin
,
101
(
3
),
427
427
. https://doi.org/10.1037/h0090408
Fry
,
A.
,
Littlejohns
,
T. J.
,
Sudlow
,
C.
,
Doherty
,
N.
,
Adamska
,
L.
,
Sprosen
,
T.
,
Collins
,
R.
, &
Allen
,
N. E.
(
2017
).
Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population
.
American Journal of Epidemiology
,
186
(
9
),
1026
1034
. https://doi.org/10.1093/aje/kwx246
Germine
,
L.
,
Nakayama
,
K.
,
Duchaine
,
B. C.
,
Chabris
,
C. F.
,
Chatterjee
,
G.
, &
Wilmer
,
J. B.
(
2012
).
Is the web as good as the lab? Comparable performance from web and lab in cognitive/perceptual experiments
.
Psychonomic Bulletin & Review
,
19
(
5
),
847
857
. https://doi.org/10.3758/s13423-012-0296-9
Gershon
,
R. C.
,
Cook
,
K. F.
,
Mungas
,
D.
,
Manly
,
J. J.
,
Slotkin
,
J.
,
Beaumont
,
J. L.
, &
Weintraub
,
S.
(
2014
).
Language measures of the NIH toolbox cognition battery
.
Journal of the International Neuropsychological Society
,
20
(
6
),
642
651
. https://doi.org/10.1017/S1355617714000411
Giunchiglia
,
V.
,
Gruia
,
D.
,
Lerede
,
A.
,
Trender
,
W.
,
Hellyer
,
P.
, &
Hampshire
,
A.
(
2023
).
Iterative decomposition of visuomotor, device and cognitive variance in large scale online cognitive test data
[Preprint]. In Review. https://doi.org/10.21203/rs.3.rs-2972434/v1
Guell
,
X.
,
Schmahmann
,
J. D.
,
Gabrieli
,
J. DE.
, &
Ghosh
,
S. S.
(
2018
).
Functional gradients of the cerebellum
. https://doi.org/10.7554/eLife.36652
Guerra-Carrillo
,
B.
,
Katovich
,
K.
, &
Bunge
,
S. A.
(
2017
).
Does higher education hone cognitive functioning and learning efficacy? Findings from a large and diverse sample
.
PLoS One
,
12
(
8
),
e0182276
. https://doi.org/10.1371/journal.pone.0182276
Habekost
,
T.
,
Vogel
,
A.
,
Rostrup
,
E.
,
Bundesen
,
C.
,
Kyllingsbaek
,
S.
,
Garde
,
E.
,
Ryberg
,
C.
, &
Waldemar
,
G.
(
2013
).
Visual processing speed in old age
.
Scandinavian Journal of Psychology
,
54
(
2
),
89
94
. https://doi.org/10.1111/sjop.12008
Hampshire
,
A.
(
2020
).
Great British intelligence test protocol
.
Preprint
. https://doi.org/10.21203/rs.3.pex-1085/v1
Hampshire
,
A.
,
Chatfield
,
D. A.
,
MPhil
,
A. M.
,
Jolly
,
A.
,
Trender
,
W.
,
Hellyer
,
P. J.
,
Giovane
,
M. D.
,
Newcombe
,
V. F. J.
,
Outtrim
,
J. G.
,
Warne
,
B.
,
Bhatti
,
J.
,
Pointon
,
L.
,
Elmer
,
A.
,
Sithole
,
N.
,
Bradley
,
J.
,
Kingston
,
N.
,
Sawcer
,
S. J.
,
Bullmore
,
E. T.
,
Rowe
,
J. B.
, &
Menon
,
D. K.
(
2022
).
Multivariate profile and acute-phase correlates of cognitive deficits in a COVID-19 hospitalised cohort
.
eClinicalMedicine
,
47
,
101417
. https://doi.org/10.1016/j.eclinm.2022.101417
Hampshire
,
A.
,
Trender
,
W.
,
Grant
,
J. E.
,
Mirza
,
M. B.
,
Moran
,
R.
,
Hellyer
,
P. J.
, &
Chamberlain
,
S. R.
(
2022
).
Item-level analysis of mental health symptom trajectories during the COVID-19 pandemic in the UK: Associations with age, sex and pre-existing psychiatric conditions
.
Comprehensive Psychiatry
,
114
,
152298
. https://doi.org/10.1016/j.comppsych.2022.152298
Hayden
,
K. M.
, &
Welsh-Bohmer
,
K. A.
(
2011
).
Epidemiology of cognitive aging and Alzheimer’s disease: Contributions of the Cache County Utah Study of memory, health and aging
. In
Pardon
M.-C.
&
Bondi
M. W.
(Eds.),
Behavioral neurobiology of aging
(Vol.
10
, pp.
3
31
).
Springer Berlin Heidelberg
. https://doi.org/10.1007/7854_2011_152
Kiesel
,
A.
,
Steinhauser
,
M.
,
Wendt
,
M.
,
Falkenstein
,
M.
,
Jost
,
K.
,
Philipp
,
A. M.
, &
Koch
,
I.
(
2010
).
Control and interference in task switching—A review
.
Psychological Bulletin
,
136
(
5
),
849
874
. https://doi.org/10.1037/a0019842
Koltcov
,
S.
,
Koltsova
,
O.
, &
Nikolenko
,
S.
(
2014
).
Latent dirichlet allocation: Stability and applications to studies of user-generated content
. In
Proceedings of the 2014 ACM Conference on Web Science
(pp.
161
165
). https://doi.org/10.1145/2615569.2615680
Kornblum
,
S.
,
Hasbroucq
,
T.
, &
Osman
,
A.
(
1990
).
Dimensional overlap: Cognitive basis for stimulus-response compatibility—A model and taxonomy
.
Psychological Review
,
97
(
2
),
253
270
. https://doi.org/10.1037/0033-295X.97.2.253
Kreutzer
,
J. S.
,
Caplan
,
B.
, &
DeLuca
,
J.
(Eds.). (
2011
).
Encyclopedia of clinical neuropsychology
.
Springer
. https://doi.org/10.1007/978-3-319-57111-9
Kübler
,
S.
,
McDonald
,
R.
, &
Nivre
,
J.
(
2009
).
Dependency parsing
.
Springer International Publishing
. https://doi.org/10.1007/978-3-031-02131-2
Marek
,
S.
,
Tervo-Clemmens
,
B.
,
Calabro
,
F. J.
,
Montez
,
D. F.
,
Kay
,
B. P.
,
Hatoum
,
A. S.
,
Donohue
,
M. R.
,
Foran
,
W.
,
Miller
,
R. L.
,
Hendrickson
,
T. J.
,
Malone
,
S. M.
,
Kandala
,
S.
,
Feczko
,
E.
,
Miranda-Dominguez
,
O.
,
Graham
,
A. M.
,
Earl
,
E. A.
,
Perrone
,
A. J.
,
Cordova
,
M.
,
Doyle
,
O.
, …
Dosenbach
,
N. U. F.
(
2022
).
Reproducible brain-wide association studies require thousands of individuals
.
Nature
,
603
(
7902
),
654
660
. https://doi.org/10.1038/s41586-022-04492-9
Onitsuka
,
T.
,
Shenton
,
M. E.
,
Salisbury
,
D. F.
,
Dickey
,
C. C.
,
Kasai
,
K.
,
Toner
,
S. K.
,
Frumin
,
M.
,
Kikinis
,
R.
,
Jolesz
,
F. A.
, &
McCarley
,
R. W.
(
2004
).
Middle and inferior temporal gyrus gray matter volume abnormalities in chronic schizophrenia: An MRI study
.
American Journal of Psychiatry
,
161
(
9
),
1603
1611
. https://doi.org/10.1176/appi.ajp.161.9.1603
Papagno
,
C.
(
2011
).
Naming and the role of the uncinate fasciculus in language function
.
Current Neurology and Neuroscience Reports
,
11
(
6
),
553
559
. https://doi.org/10.1007/s11910-011-0219-6
Park
,
D. C.
, &
Reuter-Lorenz
,
P.
(
2009
).
The adaptive brain: Aging and neurocognitive scaffolding
.
Annual Review of Psychology
,
60
(
1
),
173
196
. https://doi.org/10.1146/annurev.psych.59.103006.093656
Petersen
,
S. E.
, &
Sporns
,
O.
(
2015
).
Brain networks and cognitive architectures
.
Neuron
,
88
(
1
),
207
219
. https://doi.org/10.1016/j.neuron.2015.09.027
Raichle
,
M. E.
,
MacLeod
,
A. M.
,
Snyder
,
A. Z.
,
Powers
,
W. J.
,
Gusnard
,
D. A.
, &
Shulman
,
G. L.
(
2001
).
A default mode of brain function
.
Proceedings of the National Academy of Sciences
,
98
(
2
),
676
682
. https://doi.org/10.1073/pnas.98.2.676
Salthouse
,
T. A.
(
2009
).
Decomposing age correlations on neuropsychological and cognitive variables
.
Journal of the International Neuropsychological Society
,
15
(
5
),
650
661
. https://doi.org/10.1017/S1355617709990385
Singh-Manoux
,
A.
,
Kivimaki
,
M.
,
Glymour
,
M. M.
,
Elbaz
,
A.
,
Berr
,
C.
,
Ebmeier
,
K. P.
,
Ferrie
,
J. E.
, &
Dugravot
,
A.
(
2012
).
Timing of onset of cognitive decline: Results from Whitehall II prospective cohort study
.
BMJ
,
344
,
d7622
. https://doi.org/10.1136/bmj.d7622
Soreq
,
E.
,
Violante
,
I. R.
,
Daws
,
R. E.
, &
Hampshire
,
A.
(
2021
).
Neuroimaging evidence for a network sampling theory of individual differences in human intelligence test performance
.
Nature Communications
,
12
(
1
),
2072
. https://doi.org/10.1038/s41467-021-22199-9
Sudlow
,
C.
,
Gallacher
,
J.
,
Allen
,
N.
,
Beral
,
V.
,
Burton
,
P.
,
Danesh
,
J.
,
Downey
,
P.
,
Elliott
,
P.
,
Green
,
J.
,
Landray
,
M.
,
Liu
,
B.
,
Matthews
,
P.
,
Ong
,
G.
,
Pell
,
J.
,
Silman
,
A.
,
Young
,
A.
,
Sprosen
,
T.
,
Peakman
,
T.
, &
Collins
,
R.
(
2015
).
UK Biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age
.
PLoS Medicine
,
12
(
3
),
e1001779
. https://doi.org/10.1371/journal.pmed.1001779
Treviño
,
M.
,
Zhu
,
X.
,
Lu
,
Y. Y.
,
Scheuer
,
L. S.
,
Passell
,
E.
,
Huang
,
G. C.
,
Germine
,
L. T.
, &
Horowitz
,
T. S.
(
2021
).
How do we measure attention? Using factor analysis to establish construct validity of neuropsychological tests
.
Cognitive Research: Principles and Implications
,
6
(
1
),
51
. https://doi.org/10.1186/s41235-021-00313-1
Vandierendonck
,
A.
(
2017
).
A comparison of methods to combine speed and accuracy measures of performance: A rejoinder on the binning procedure
.
Behavior Research Methods
,
49
(
2
),
653
673
. https://doi.org/10.3758/s13428-016-0721-5
Weintraub
,
S.
,
Dikmen
,
S. S.
,
Heaton
,
R. K.
,
Tulsky
,
D. S.
,
Zelazo
,
P. D.
,
Bauer
,
P. J.
,
Carlozzi
,
N. E.
,
Slotkin
,
J.
,
Blitz
,
D.
,
Wallner-Allen
,
K.
,
Fox
,
N. A.
,
Beaumont
,
J. L.
,
Mungas
,
D.
,
Nowinski
,
C. J.
,
Richler
,
J.
,
Deocampo
,
J. A.
,
Anderson
,
J. E.
,
Manly
,
J. J.
,
Borosh
,
B.
, …
Gershon
,
R. C.
(
2013
).
Cognition assessment using the NIH toolbox
.
Neurology
,
80
(
Issue 11, Supplement 3
),
S54
S64
. https://doi.org/10.1212/WNL.0b013e3182872ded
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.

Supplementary data