Abstract
Automated online and App-based cognitive assessment tasks are becoming increasingly popular in large-scale cohorts and biobanks due to advantages in affordability, scalability, and repeatability. However, the summary scores that such tasks generate typically conflate the cognitive processes that are the intended focus of assessment with basic visuo-motor speeds, testing device latencies, and speed-accuracy tradeoffs. This lack of precision presents a fundamental limitation when studying brain-behaviour associations. Previously, we developed a novel modelling approach that leverages continuous performance recordings from large-cohort studies to achieve an iterative decomposition of cognitive tasks (IDoCT), which outputs data-driven estimates of cognitive abilities, and device and visuo-motor latencies, whilst recalibrating trial-difficulty scales. Here, we further validate the IDoCT approach with UK BioBank imaging data. First, we examine whether IDoCT can improve ability distributions and trial-difficulty scales from an adaptive picture-vocabulary task (PVT). Then, we confirm that the resultant visuo-motor and cognitive estimates associate more robustly with age and education than the original PVT scores. Finally, we conduct a multimodal brain-wide association study with free-text analysis to test whether the brain regions that predict the IDoCT estimates have the expected differential relationships with visuo-motor versus language and memory labels within the broader imaging literature. Our results support the view that the rich performance timecourses recorded during computerised cognitive assessments can be leveraged with modelling frameworks like IDoCT to provide estimates of human cognitive abilities that have superior distributions, re-test reliabilities, and brain-wide associations.
1 Introduction
Automated and app-based assessment technologies provide a scalable, cost-effective, and reliable way to measure different aspects of cognitive abilities (Soreq et al., 2021) and to monitor cognitive changes in clinical populations (Brooker et al., 2020; Hampshire, Chatfield, et al., 2022; Hampshire, Trender, et al., 2022). Consequently, this technology is becoming popular in large-scale citizen science projects (Germine et al., 2012; Hampshire, 2020), cohorts (Treviño et al., 2021) and registers (Fawns-Ritchie & Deary, 2020). Building on the resultant big data, a major research drive has been to map associations between the summary scores that these computerised cognitive tasks output, and features of brain structure and function from large-scale imaging cohort studies (Cox et al., 2019; Ferguson et al., 2020). However, it is common practice to summarise a participant’s performance by estimating or contrasting average accuracy and reaction times (RT) across task conditions (Vandierendonck, 2017). The resultant scores relate not only to individual differences in abilities to process the specific cognitive demands that are the intended target of the task (Kiesel et al., 2010; Kornblum et al., 1990), but also to other confounding factors such as visuo-motor processing speeds and the latency of the devices that people are assessed with. This lack of cognitive precision in summary score estimates is a non-trivial limitation for both the strength and the specificity of the associations that can be achieved.
A commonly overlooked advantage of computerised cognitive tasks is that, unlike classic pen-and paper assessment scales, they record every stimulus and response in a detailed performance timecourse. These performance timecourses can be modelled in more sophisticated ways to obtain ability estimates that have superior reliability and process specificity than simple contrasts or averaging. Previously, we reported development of one such modelling framework, IDoCT (Iterative Decomposition of Cognitive Tasks—see Box 1), which we designed to disentangle individuals’ cognitive abilities from other confounding factors in a manner that is robust, computationally inexpensive, and sufficiently flexible to be adapted for practically any task that manipulates cognitive difficulty across trials (Giunchiglia et al., 2023).
1. Data driven assessment of trial’s difficulty D(t)
The performance P(i,t) of participant i in trial t is calculated based on the RT of participant i in trial t and the difficulty D(t) of trial t, if the answer is correct. In case of incorrect answers, the performance is equal to 0. RTmax corresponds to the maximum RT across all trials and all participants.
D(t) is calculated according to the performance P(i,t) across all participants i in trial t. N corresponds to the total number of trials and T(t,i) to the number of times a trial t was repeated for the same participant i.
A mutual recursive definition is generated. At the first iteration D(t) is set equal to 1 for all trials t, and then iteratively modified. The iterations are interrupted when the model converges into an invariant measure of trial’s difficulty D(t).
2. Data driven assessment of answer time AT(i,t)
RT is characterized by two components: answer time AT, or cognitive time required to provide an answer to the cognitive task, and DT, or visuo-motor latency time. The measure of performance can be updated as follows:
The answer time AT(i,t) of participant i in trial t is calculated according to the ability A(i) of the participant, the RT(i,t) and the difficulty of the trial D(t).
Where
ATN(i,t) corresponds to a measure of the answer time in milliseconds. The ability A(i) is measured based on the cumulative performance of participant i across all trials t.
Where B(i,t) corresponds to the cumulative performance of participant i up to trial t, and BN(i,N) to the overall average performance across all N trials completed.
The delay time DT(i,t) is calculated as the difference between the reaction time RT(i,t) and the answer time AT(i,t).
A mutual recursive definition between A(i) and P(i,t) is generated. The first iteration assumes that A(i) is maximum for all participants i and is, therefore, always equivalent to 1, the measure of AT is initialized as ATmin(i) (which is measured as the difference between RT and DTmax). Instead, DT(i, 0) is initialized as DTmax(i). DTmax(i) is set equal to RTmin(i). The iterations are interrupted when the model converges.
3. Measure of specific ability AS(i) and delay time DT(i)
Specific ability AS(i) is calculated according to the cumulative specific performance PA(i,t) of participant i, which is measured using the RT(i,t) corrected for the DT(i,t).
Where BS(i, 0) = 0 and BNS(i, N) corresponds to the cumulative performance PA up to trial N. Once the cumulative performance is calculated, AS(i) can be measured as its overall average:
Where Q corresponds to the total number of trials for each participant
The delay time DT(i) is measured as the average delay time D(i,t) across all N trials t.
4. Measure of scaled trial difficulty DS(i)
The trial difficulty D(t) is scaled according to the specific ability AS(i) of all the N participants i that completed the trial t.
More specifically, IDoCT leverages trial-by-trial performance data from large cohorts to (a) estimate individuals’ abilities to specifically cope with higher cognitive difficulty across trials (AS) and estimate their visuo-motor delay times (DT) while accounting for individual speed-accuracy tradeoffs. Notably, the approach concurrently recalculates the relative difficulty assigned to trials across task conditions using a fixed-point iterative process that handles the circularity of simultaneously defining individual performance from trial difficulty and trial difficulty from individual performance (Giunchiglia et al., 2023). This is achieved in a manner that accounts for any bias towards sampling more difficult conditions in more able individuals, as is the case in some adaptive designs, resulting in a robust data-driven recalibration of trial-difficulty scales.
We initially validated IDoCT by applying it to data from >400,000 participants who undertook 12 cognitive tasks during the Great British Intelligence Test (Hampshire, 2020). The results showed a successful decomposition of cognitive versus device and visuo-motor latencies, as gauged by superior sociodemographic associations and AS estimates that were not dwarfed by an inflated global intelligence factor (Giunchiglia et al., 2023). This combination of sensitivity and decorrelation holds promise for achieving stronger and more process-specific functional-anatomical mappings in behavioural-imaging association studies.
Here, we further validate IDoCT in the context of an adaptive Picture Vocabulary Task (PVT) (Dunn & Dunn, 2007; Weintraub et al., 2013) that was designed to measure “crystallised” comprehension and reading decoding abilities (Fawns-Ritchie & Deary, 2020; Gershon et al., 2014), and that was deployed with 34,927 participants as part of the UK BioBank (UKB) imaging extension (Sudlow et al., 2015). Extracting summary measures for this PVT presents a notable challenge because it applied an adaptive staircase sampling algorithm to efficiently measure each participant’s ability level, but the word-picture difficulty scale that the sampling algorithm traversed proved to be sub-optimally calibrated for the assessed population. This resulted in aberrant sampling trajectories with accuracy ceiling effects and an unexpected bimodal distribution of population performance scores that is not ideal for analysis of associations (Sudlow et al., 2015). In theory, IDoCT should be able to resolve this ill-posed problem by recalibrating the trial-difficulty scale whilst using both speed and accuracy to produce more precise estimates of participant performance, with applications in functional-brain mapping studies.
To test this theory, we first confirm the expected improvements in distributions and test-retest reliability of PVT performance estimates for IDoCT relative to the original summary scores. Next, we evaluate whether the IDoCT PVT performance estimates associate more robustly, and in an interpretable manner, with participant age and education. Then, we use a simple linear machine-learning pipeline to test the hypothesis that the IDoCT PVT estimates can be more reliably predicted from four distinct feature sets of the UKB structural imaging database. Finally, we conduct free text mining across the imaging literature to determine whether the brain regions that predict IDoCT PVT estimates of visuo-motor and cognitive abilities are differentially associated in the neuroscience literature with visual and motor functions versus language and memory functions respectively.
2 Materials and Methods
Our full analysis pipeline consisted of nine steps, summarised in Figure 1, namely 1) data curation, 2) IDoCT modelling, 3) evaluation of distributions, 4) estimation of trials’ difficulty trajectories, 5) assessment of test-retest reliability, 6) association with age and education, 7) imaging associations, 8) automated literature extraction and preprocessing, and 9) free text analysis.
2.1 Study design and participants
The data analysed in this study were collected as part of UKB, a population-based prospective study that recruited >500,000 participants, aged between 40 and 69, in the 2010-2016 timeframe. The aim of UKB is to understand the genetic and non-genetic factors that contribute to different diseases that affect mainly the middle and older aged population. As part of the study, a subset of individuals undertook imaging, genetic, as well as health and demographics measures longitudinally, across 22 assessment centers throughout the United Kingdom (Sudlow et al., 2015). All participants provided informed, written, consent. Detailed information on the study design is available in the original paper (Ferguson et al., 2020) or online https://www.ukbiobank.ac.uk/. In the current study, demographics (i.e., age and education), imaging (i.e., fractional anisotropy from diffusion weighted images, and volume, intensity, and thickness measures from structural MRI), and behavioural data (i.e., performance in PVT) were used.
In total, 34927 participants completed the PVT during at least one timepoint. 33890 had complete picture vocabulary task data at baseline, of whom 20593 had data for all imaging measures of interest, and 4962 had cognitive data at follow up. Figure 2 shows the exact numbers of participants used for each step of the analysis and the exclusion criteria.
2.2 Picture Vocabulary task (PVT)
The PVT used for the UK Biobank was adapted from the NIH Toolbox Picture Vocabulary Test (Dunn & Dunn, 2007; Kreutzer et al., 2011; Weintraub et al., 2013). This task was designed to assess a person’s semantic/language functions by measuring their ability to match words to pictures. At each trial, participants were presented with a written word along with a set of four images, and they were required to select the picture that matched the word. Every participant was started on the same word, which was at an easy vocabulary level. The difficulty of the trials was then adaptively changed, using an Item Reponse Theory (IRT) model, in response to the participant’s performance, according to a maximum likelihood estimate (MLE) of their vocabulary level, calculated from their answers so far. For example, if a participant selected the correct picture at trial n, then the word-picture combination presented at trial n+1 was at a higher level of difficulty, unless the supply of more difficult words had run out. Each participant’s sequence of trials lasted for at least 20 words, and was terminated when either the maximum likelihood estimate was accurate enough (within a standard error of <0.5), or a maximum of 30 trials had been reached.
The similarities and differences from the NIH toolbox test were as follows. The UK Biobank version used a touchscreen to present the words in written form only, unlike the NIH version administered by staff who also pronounced the words. The dataset of pictures, words, and their associated difficulty levels were the same as the English-language version of the NIH PVT, but with modification for obvious difficulties: words differing between UK and US English were changed or removed, for example, the “intersection” (US) was replaced with “crossroads” (UK), and the word “minute” was removed due to ambiguity of its written meaning, as the picture of a clock could have detracted from its intended meaning of a small size. This resulted in a dataset for 340 words. The same methods for estimating vocabulary levels and algorithmic adaptation for the next question were used in both tests. However, the starting word and the maximum number of trials differed for the UK Biobank test.
2.3 Processing of UKB imaging data
The imaging data consisted of 26 measures of fractional anisotropy (FA) from diffusion weighted images (DWI), and 139 volume, 43 intensity, and 60 cortical thickness measures from T1-weighted structural magnetic resonance images (MRI). All these imaging features were provided by UKB. Detailed information on the imaging processing is available in Alfaro-Almagro et al. (2018), and UKB feature labels are provided in the Supplementary Table A1. In brief, DICOM images were converted to NIFTI, fully anonymised using a defacing mask, and corrected for gradient distortion (GD). Specifically, in case of T1 images, the field of view (FOV) was cut down, the images were non-linearly registered to the standard MNI152 space (1 mm resolution), and the brain was extracted. Finally, tissue segmentation was conducted to identify the different tissues and subcortical structures, and final volume measures were extracted after correcting for total brain size. In case of diffusion weighted images (dMRI), the first step of the processing consisted of the correction for head motion and eddy currents, followed by gradient distortion correction. Then, measures of fractional anisotropy (FA) and mean diffusivity were extracted.
2.4 IDoCT: extraction of measures of visuo-motor latency and cognitive ability
IDoCT is an iterative-based method (Giunchiglia et al., 2023) that takes as input trial-by-trial measures of reaction time (RT), and of accuracy for each participant. The measure of accuracy can be either binary (i.e., whether participants selected the right or wrong answer in a given trial) or continuous (i.e., how close were participants to the correct answer). For instance, if the task requires participants to press on a target in the screen and they miss the target, the binary accuracy would be 0, while the continuous accuracy would be a distance measure of how far they were from the target when they pressed the screen. In addition, it requires condition labels for each trial in the timecourse, which are defined based on the design of the task. Here, for example, the words presented at each trial are assumed to vary in difficulty; therefore, they are each assigned their own unique condition label. Then, through two separate iterative processes, the model returns measures of trial difficulties (D), of specific ability (AS), and of basic visuo-motor response speed (DT) for each participant. In a further step, it calculates scaled measures of trial difficulties (DS) that account for cases in which the trial assignment across the participants of differing abilities was biased towards different difficulty levels. This occurs, for example, when the most difficult trials are presented exclusively or predominantly to the most able participants, which should be the case here if the PVT sampling algorithm has sampled across an ordered difficulty scale. DS is obtained by scaling D based on the ability of the participants who completed the specific trials. Details on the implementation of IDoCT are available in the original publication (Giunchiglia et al., 2023) and are summarised in Box 1. In case of PVT, the accuracy measures are binary, and each trial is defined according to the word that is presented to the participants. Therefore, a given trial could occur just once per participant per session, with different participants completing different combinations of trials.
2.5 Test-retest reliability
IDoCT estimates were computed on the follow-up data to assess the test-retest reliability of the model. The results were compared to the reliability of the original ability scale (AB). The model computation was the same as for the baseline data, with the difference that the model parameters D, RTmax, and ATmax obtained from the baseline analysis were used when estimating abilities at the second timepoint. The baseline RTmax and ATmax were used as they represent scaling factors in the model. Therefore, if different scaling factors were used in the follow-up data, then the AS/DT results across timepoints would not be comparable. D was derived from the baseline analysis because one of the assumptions of the model is that D can be reliably derived from a given representative sample and once it is derived then it can be used on different datasets. Pearson correlations and Bland-Altman plots were used to compare and visualise the reliability of DT, AS, and AB estimates across sessions, as well as any trend towards improved or worsened ability scores across time.
2.6 Age and education association
Multiple linear regression was used to compare the sensitivity of the IDoCT AS, DT estimates, and the original AB scores to age and education. Age was converted into 5-year age bins to account for the non-linear relationship between age and cognition, and then one-hot encoded. Education was one-hot encoded into eight categories (College or University Degree, A levels or equivalent, O levels/GCSEs or equivalent, CSEs or equivalent, NVQ or HND or HNC or equivalent, Other professional qualifications, None of the above). Individuals’ selecting “Prefer not to answer” were treated as having missing information. An alpha threshold of p<0.01 was used to determine significance.
2.7 Neural correlates of AS and DT: feature selection
A summary of the imaging association pipeline is provided in Figure 3. Models were optimised and fitted separately for each imaging modality (e.g., measures of volume, cortical thickness, fractional anisotropy and intensity) and for each set of cognitive summary scores (AS, AB, and DT). First, the dataset was split into train and held-out test set with a 75/25 split. Then, age was regressed out of the imaging and cognitive feature vectors. The bivariate Spearman’s correlation between each imaging feature with the target cognitive summary score was computed for the train set only, and the features ranked into scales according to the magnitude of the obtained correlation coefficients r. Next, the combination of features that yielded the best prediction of the target variable was identified in a simple stepwise process using multiple linear regression with five-fold cross-validation. Specifically, models were trained and evaluated across the five folds, iteratively removing the imaging feature with the lowest magnitude correlation coefficient r on the scale until just one remained within the predictor matrix. The optimal number of features was defined as the one producing models with the highest mean R2 across the validation folds. The model was refitted to all of the train data using that optimal number of features. The train-test sets and folds were the same for all models, to enable cross-comparison of model performance. The relative predictability of the different cognitive score estimates from the imaging data was evaluated by comparing the R2 value when the optimal trained models for the different imaging modality–cognitive score combinations were applied to the held-out test data.
2.8 Literature review using natural language processing
The imaging features that contributed the most to the prediction of AS and DT were selected for further investigation using a Natural Language Processing (NLP) (Chen et al., 2021) pipeline in order to confirm whether they mapped onto the expected cognitive and visuo-motor systems. To select features, both univariate and multivariate approaches were used. First, the Eta2 values of the significant features (alpha threshold at 0.05) were calculated from the best fit models as described in the previous section. The features that were both significant and that were among the top 15 features with the highest Eta2 were selected for further analysis (multivariate approach, derived from the multiple regression beta coefficients). The Eta2 was derived by computing the ANOVA of the best fit models and by calculating the Eta2 as the SSeffect/SStotal, where SSeffect corresponds to the sums of squares for the effect of interest and SStotal to the total sums of squares for all effects, errors, and interactions in the ANOVA. Separately, for the univariate approach, the top 15 features with the highest magnitude of Pearson correlation with AS and DT were selected. The resultant feature labels were pooled across imaging modalities, producing AS and DT lists. We used an NLP literature search approach to summarise previously published literature in order to minimise author bias when choosing and interpreting papers related to the different brain regions.
Research papers were identified via multiple advanced search criteria based on the brain feature labels, provided in detail in the Supplementary Table A2. In general, all papers with the name of the brain region in the title and/or in the abstract, and with the words cognition and/or cognitive function in the body of the main text were selected. The addition of the “cognitive function” criteria was necessary to prevent from analysing papers that were mainly related to the cellular and biological aspects of the different brain areas. All papers that matched the advanced search criteria and that could be downloaded in HTML format from PubMed Central were included in the analysis. Papers that were only provided a PDF version, which mostly corresponded to paper published prior to 2000, were excluded, as were papers that were not open access. In total, across all brain regions, 1602 papers were analysed. Detailed numbers on how many papers were analysed for each individual brain region are provided in the Supplementary Table A2.
The HTML documents were pre-processed using the Auto-CORPus pipeline (Automated pipeline for Consistent Outputs from Research Publications) (Beck et al., 2022), an NLP tool that converts publications with an HTML structure into a BioC JSON format (Comeau et al., 2013). The BioC JSON format uses a standard structure developed to allow for the interoperability of text mining outputs across different systems. Concretely, it consists of collections of documents, extracted from a corpus, characterised by different elements, that contain the actual text as well as additional information about the original document. The text is automatically divided into common sections (e.g., abstracts, results …) and paragraphs, and each section is associated to the respective, and unique, Information Artifact Ontology (IAO) annotation (Ceusters, 2012). IAO annotations enable identification of the same sections across different publications, even when they are associated to different titles (e.g., Methods and methodology). Specifically, for our NLP analysis, only the sections corresponding to the Abstract, Discussion, and Introduction were used, which corresponded respectively to the IAO annotations IAO:0000316, IAO:0000315, and IAO:0000319.
The extracted paragraphs were cleaned by converting all words to lower case, by removing words that were less than 2 characters or more than 20 characters long, by removing all special characters and punctuations, and by removing all stop words, which correspond to commonly used words in the English language, such as “you,” “an,” or “in.” In addition, words that are related to the different brain structures (e.g., gyrus, nucleus, cerebellum…), or that are commonly associated to the study design (e.g., controls, patients, humans …) were excluded. After the data cleaning, the frequency of occurrence of each word was calculated across all papers for each individual brain region, and normalised between 0 and 1, with 1 corresponding to the most frequent word.
In order to assess the main brain functions associated to AS and DT, the frequency scores of the five words with the highest values were added separately for all the brain regions related to AS and DT. In this way, if a word appeared frequently in multiple brain regions, then it was assigned a higher frequency score, compared to a frequent word appearing in one individual brain region.
2.9 Software
The analysis was conducted in Python (3.7.1). The main modules used were pandas (1.3.5), numPy (1.21.5), pingouin (0.5.3), statsmodels (0.13.1), nltk (3.7), and genism (4.2.0). The visualisation was completed with seaborn (0.11.0), matplotlib (3.5.1), and wordcloud (1.8.2.2) in Python and ggplot2 (3.3.6) in R (4.0.1).
3 Results
3.1 Samples characteristics
The full sample consisted of 34927 participants at baseline, among which only 33890 fully completed the PVT. The mean age was 64.7 ± 7.8, and 53% of the participants had an education level comparable to A/AS levels, or higher. Among these 34927 participants, 4962 people (mean age: 61.7 ± 7.2) completed the PVT at a follow-up timepoint. Full details on the sample demographics are available in Table 1.
Variable . | Category . | Number (percentage) . | |
---|---|---|---|
Baseline . | Follow up . | ||
TOTAL COUNT | 34927 | 4962 | |
Age group (years) | 40-50 | 346 (1%) | 54 (1%) |
50-60 | 9525 (27%) | 1557 (39%) | |
60-70 | 14225 (41%) | 1661 (42%) | |
70-80 | 10451 (30%) | 661 (17%) | |
>=80 | 380 (1%) | 6 (0%) | |
Missing information | 0 (0%) | 0 (0%) | |
Educational level | College or University Degree | 16435 (47%) | 1888 (48%) |
A level or equivalent | 12828 (37%) | 1519 (39%) | |
O levels/GCSEs or equivalent | 18567 (53%) | 2174 (55%) | |
CSEs or equivalent | 4649 (13%) | 633 (16%) | |
NVQ or HND or HNC or equivalent | 6558 (19%) | 802 (20%) | |
Other professional qualifications (e.g., nursing, teaching) | 12783 (36%) | 1447 (37%) | |
None of the above | 2034 (6%) | 165 (4%) | |
Prefer not to answer | 116 (0.3%) | 54 (1%) | |
Missing information | 419 (1%) | 0 (0%) |
Variable . | Category . | Number (percentage) . | |
---|---|---|---|
Baseline . | Follow up . | ||
TOTAL COUNT | 34927 | 4962 | |
Age group (years) | 40-50 | 346 (1%) | 54 (1%) |
50-60 | 9525 (27%) | 1557 (39%) | |
60-70 | 14225 (41%) | 1661 (42%) | |
70-80 | 10451 (30%) | 661 (17%) | |
>=80 | 380 (1%) | 6 (0%) | |
Missing information | 0 (0%) | 0 (0%) | |
Educational level | College or University Degree | 16435 (47%) | 1888 (48%) |
A level or equivalent | 12828 (37%) | 1519 (39%) | |
O levels/GCSEs or equivalent | 18567 (53%) | 2174 (55%) | |
CSEs or equivalent | 4649 (13%) | 633 (16%) | |
NVQ or HND or HNC or equivalent | 6558 (19%) | 802 (20%) | |
Other professional qualifications (e.g., nursing, teaching) | 12783 (36%) | 1447 (37%) | |
None of the above | 2034 (6%) | 165 (4%) | |
Prefer not to answer | 116 (0.3%) | 54 (1%) | |
Missing information | 419 (1%) | 0 (0%) |
Age and education of the participants of the study at baseline and at the follow-up timepoint. CSE = Certification of Secondary Education, GCSE = General Certificate of Secondary Education, NVQ = National Vocational Certification, HNC = Higher National Certificate, HND = Higher National Diploma, A Level = Advanced Level, O Level = Ordinary Level.
3.2 Measures of individual trial difficulty, ability, and visuo-motor speed from IDoCT
The IDoCT model converged after 250 iterations when estimating trial difficulty measures (D), reaching a mean percent change in word difficulty that tended to 0. In the second iteration, necessary to determine AS and DT, the model converged in 10 iterations while defining a measure of ability and performance. DS was calculated in a final step of the model computation.
The distributions of D (unscaled difficulty) and DS (scaled difficulty) are available in Figure 4A, B, together with the original difficulty scale (Dold) (Fig. 4C), and the association between DS and Dold (Fig. 4D). The mean D, DS, and Dold were respectively 0.60 ± 0.07, 0.72 ± 0.06, and 0.51 ± 0.24. The full list of the words presented and of their associated measure of D, DS, and Dold is available in the Supplementary Table A3. In brief, according to the IDoCT D scale, the five most difficult words, in ascending order of difficulty, were: glower, malefactor, pachyderm, matron, and plethora. Instead, the five easiest words in ascending order of difficulty were: calm, weld, herd, desolate, and engraved. On the other hand, according to DS, the top 5 hardest and easiest words were respectively buffet, trivet, prodigious, bucolic, truncate, and fabricate, angry, monarch, fly, run. In general, most of the words that were assigned to higher difficulty scores were of Latin, Greek, or French origin.
The change in assigned difficulty level per word-picture pair between D and DS is presented in Figure 5. In brief, the difficulty of words that were presented exclusively to participants with lower abilities tended to decrease after the scaling. On the other hand, words that were assigned only to participants with higher abilities tended to increase in difficulty after scaling. This pattern of results accords with the method working to correct for sampling bias.
The mean AS predicted by IDoCT across the cohort was 0.39 ± 0.08, while the mean DT was 3160 ± 1081. The mean AB was 0.85 ± 0.09. The distributions of AS, DT, and AB were compared, as well as their associations to the raw median RT and the number of correct answers, as presented in Figure 6. As can be observed, AB was characterised by an atypical bimodal distribution centred around high ability scores (~0.85), which supports the hypothesis of the ceiling effect. On the other hand, both DT and AS had the expected near-gaussian distribution of abilities, which, in case of AS, was centred around average scores (~0.4).
By comparing the associations of AB and AS (Fig. 7) with the number of correct replies and the median RT, expected associations are observed in case of AS, with more correct answers and faster RTs being related to higher AS. On the other hand, in case of AB, some participants were assigned a low AB score despite giving correct answers in the majority of cases and despite having low median RTs. Similar associations were observed when comparing AS, AB, and DT to the 25th and 75th quantile of the RT distributions (Fig. 8).
In case of DT, no association with the number of correct answers was observed (Fig. 7E), which is expected considered that the level of visuo-motor latency should not affect how correctly each participant replies. On the other hand, DT was almost linearly associated with the median RT (Fig. 7F), with higher median RTs for participants with longer DT. This is also expected since participants with longer visuo-motor latency are expected to be overall slower. No clear association was observed between AS and AB and DT and AB. The latter is not surprising considered that DT is not supposed to capture cognitive performance like AB. On the other hand, the association between AS and AB resembles that of AB and the number of correct, which is expected considered that AS is almost linearly associated with accuracy (Fig. 7G, H).
3.3 Trial-by-trial difficulty trajectories
To validate that Dold led to ceiling effects, trial difficulty trajectories were plotted, that is, showing how the difficulty of sampled word-picture combinations changed sequentially through the task. Separate trajectories were computed for D, DS, and Dold, and for participants with different ability levels. Participants were divided into 10 groups based on AB (0-0.1, 0.1-0.2,… 0.9-1.0), and their mean trajectories obtained by averaging the difficulty scores of the words presented at each, as shown in Figure 9. If the adaptive staircase approach was correct, then a gradual increase in difficulty should have been observed for all participants with medium-high ability (AB > 0.3). Instead, already after up to five trials, Dold either reached a plateau or started to decrease, which suggests that no words with higher difficulty as defined by Dold were available after that point, forcing the algorithm to provide words of either equal or lower difficulty. Furthermore, the difficulty level as defined by D and DS appeared not to increase across time. Together, these results support the hypothesis that the original difficulty scale was not optimally calibrated for the assessed cohort.
3.4 AS has better test-retest reliability compared to AB
IDoCT was applied to the follow-up data to evaluate retest reliability. The model converged after 250 and 10 iterations respectively when predicting D and AS/DT. The distributions of the predicted AS and DT, as well as of AB, are available in Figure 10. The mean AS, DT, and AB at follow up were respectively 0.4 ± 0.08, 3009 ± 1035, and 0.83 ± 0.09. Similar to the baseline, both AS and DT were characterised by Gaussian-shaped distributions, with AS being centred around average values (~0.4), while AB again consisted of a bimodal distribution centred around high ability scores.
By comparing the iDoCT predictions at baseline and follow up, both AS and DT had a test-retest reliability of respectively r = 0.77 and r = 0.57 (Fig. 11). The test-retest reliability of AB was substantially lower compared to AS (r = 0.66). Furthermore, the distribution of the Bland-Altman plots showed a poor spread and only high AB scores being consistent across timepoints.
3.5 Older and better educated participants have higher cognitive abilities scores
When predicting AS from age decade and education level, a significant regression equation was found (F = 710.6 (13, 33876), p = <0.001) with an R2 of 0.21. The regression equation for the original AB score was also significant (F = 561.2 (13, 33864), p = <0.001) but with a lower R2 of 0.17. Both age and education were significant predictors of AS and AB. Regarding education, the strongest predictors were having a college/university degree (ed1) and A/AS level (ed2), which had respectively a positive effect size in standard deviation (SD) units of 0.66 and 0.35 for AS, and somewhat less at 0.57 and 0.33 SDs for AB, compared to the reference category. In the case of age, older participants had higher AS compared to participants aged 45. The increase was gradual until age 70, at which point a plateau was reached, and participants reached a 0.48 SD increase in AS versus a 0.41 increase in AB (for participants aged 75 compared to the reference category) (Fig. 12).
3.6 Older participants have longer visuo-motor latency times
In the case of DT, a significant regression equation was found (F = 39.41 (10, 33876), p = <0.001) with an R2 of 0.015. Both age and education were significant predictors of DT, but the effect size of education was consistently small to negligible (below 0.1 SD units). The relationship between age and DT was comparable to the association with AS, with older participants having a gradual increase in DT, or visuo-motor latency time. More specifically, participants at age 75 and 80 were associated to respectively a 0.38 SD and 0.52 SD increase in DT compared to participants at age 45 (Fig. 12).
3.7 Imaging analysis feature selection
The results of the fine-tuning are reported in Figure 13 for AS and Figure 14 for DT. In total, P models were trained for each dataset (i.e., FA, Thickness, Volume, and Intensity), where P corresponds to the number of features available. Model0 corresponds to the model trained on all features except for the feature with the lowest correlation coefficient. For each of the following models, the feature with the lowest coefficient r among those remaining was dropped, until Model(P-2), which was trained exclusively on the last feature available (i.e., the feature with the highest r correlation coefficient).
When predicting AS, in the case of FA and Thickness, Model0 was the best performing, which included all features except for respectively the FA in the right cingulate gyrus (part of cingulum), and the thickness in the right pars triangularis. Similarly, Model1 was the best performing for Volume, meaning that only two features (i.e., the Brain Stem and Crus I Cerebellum vermis) were dropped when predicting AS. Finally, in case of Intensity, Model11 was the one that yielded the highest R2 in the validation set, which resulted in 12 features being dropped (i.e., mean intensity of CSF, White Matter hypointensities, non-White matter hypointensities, Corpus Callosum Mid-Posterior, 5th Ventricle, left/right Cerebellum White Matter, Corpus Callosum Posterior, right/left Cerebellum Cortex, left Amygdala, and volume of White Matter hypointensities). The R2 obtained for each model, and each dataset, is available in the Supplementary Tables A4-A7.
When predicting DT, Model12 was the best FA model, which resulted from dropping 13 features (i.e., FA in the left/right superior thalamic radiation, left/right corticospinal tract, right posterior thalamic radiation, left inferior longitudinal fasciculus, left superior longitudinal fasciculus, left parahippocampal part of cingulum, middle cerebellar peduncle, right/left inferior fronto occipital fasciculus, left anterior thalamic radiation, and forceps minor). For Volume and Thickness, Model97 and Model45 were the best performing; as a result, respectively only 41 out of 139 and 13 out of 60 features were kept. When fine-tuning the models on the intensity dataset, all the models had R2 <= 0. The only model with a positive R2 (R2 = 0.001) was obtained after dropping all the features except one. Due to the low performance of the intensity-based models, no features were extracted from the intensity dataset during the feature selection step. More detailed information on which features were dropped during the fine-tuning is available in the Supplementary Tables A8-A11, as well as the obtained R2 on the train and validation set of each model.
Results of the feature selection process when applying the same train-validation pipeline to AB are reported in Figure 15.
3.8 Identification of neural correlates of AS and DT: univariate and multivariate analysis
The selected features were used as regressors of multiple linear regression models trained on the full train set and tested on the held-out test set. Two models were trained for each dataset (i.e., FA, Thickness, Volume, and Intensity), predicting either AS or DT. Overall, significant regression models were found for all analysed datasets, with an average train and test R2 that was respectively 0.02 ± 0.01 and 0.016 ± 0.005 across all datasets for AS, and 0.004 ± 0.002 and 0.002 ± 0.001 across all datasets for DT. Full results are available in Table 2. The model for the intensity dataset and DT was not computed due to the identified low performance at the feature selection step of the analysis. For AB, significant regression models were found for all analysed datasets, with an average train and test R2 that was respectively 0.018 ± 0.01 and 0.013 ± 0.005. Full results are available in Table 2. Therefore, while the models predicting AS had modest R2 values, they performed numerically better than the models predicting AB across all four imaging datasets.
Dataset . | AS . | AB . | DT . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
R2train . | R2test . | p-value . | F train . | R2train . | R2test . | p-value . | F (df)train . | R2train . | R2test . | p-value . | F (df)train . | |
FA | 0.011 | 0.016 | <0.001 | 9.539 | 0.011 | 0.014 | <0.001 | 8.738 | 0.003 | 0.003 | <0.001 | 4.924 |
Intensity | 0.013 | 0.008 | <0.001 | 8.651 | 0.012 | 0.006 | <0.001 | 7.247 | — | — | — | — |
Grey volume | 0.037 | 0.023 | <0.001 | 5.800 | 0.034 | 0.019 | <0.001 | 5.308 | 0.007 | 0.001 | <0.001 | 3.594 |
Thickness | 0.016 | 0.015 | <0.001 | 5.588 | 0.015 | 0.012 | <0.001 | 5.321 | 0.002 | 0.001 | <0.001 | 3.056 |
Dataset . | AS . | AB . | DT . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
R2train . | R2test . | p-value . | F train . | R2train . | R2test . | p-value . | F (df)train . | R2train . | R2test . | p-value . | F (df)train . | |
FA | 0.011 | 0.016 | <0.001 | 9.539 | 0.011 | 0.014 | <0.001 | 8.738 | 0.003 | 0.003 | <0.001 | 4.924 |
Intensity | 0.013 | 0.008 | <0.001 | 8.651 | 0.012 | 0.006 | <0.001 | 7.247 | — | — | — | — |
Grey volume | 0.037 | 0.023 | <0.001 | 5.800 | 0.034 | 0.019 | <0.001 | 5.308 | 0.007 | 0.001 | <0.001 | 3.594 |
Thickness | 0.016 | 0.015 | <0.001 | 5.588 | 0.015 | 0.012 | <0.001 | 5.321 | 0.002 | 0.001 | <0.001 | 3.056 |
The features identified during the feature selection step were used as regressors of multiple linear regression models to predict AS, AB, and DT. The R2 (train and test), F statistics, and p-value of the models are reported in the table.
The significant features of the AS and DT models, and their respective Eta2 are presented in Figure 16. The 15 top significant features with the highest Eta2 across all data modalities were identified as the best neural correlates of AS and DT and used in the next steps of the analysis. The identified neural correlates of AS were: Hippocampus, superior and transverse temporal gyrus, medial lemniscus, anterior corpus callosum, medial frontal cortex, uncinate fasciculus, nucleus Accumbens, caudal middle frontal gyrus, inferior temporal gyrus, superior frontal gyrus, and parahippocampus part of cingulum. The identified neural correlates of DT were forceps major, medial lemniscus, posterior thalamic radiation, middle temporal gyrus, lateral orbitofrontal cortex, occipital fusiform gyrus, and lateral occipital cortex.
In order to assess the robustness of the results, a second set of neural correlates were identified by completing a univariate analysis. In this case, the selected features were ranked according to the magnitude of their Pearson correlation coefficient obtained after correlating each feature with either AS or DT. The top 15 features across all data modalities with a significant p-value (p-value < 0.001) were selected as neural correlates of AS and DT. The identified neural correlates of AS were: Amygdala, Hippocampus, Frontal Pole, Insular Cortex, Cerebellum, Superior temporal gyrus, and Temporal fusiform cortex. On the other hand, the identified neural correlates of DT were: middle temporal area, Cerebellum, Inferior temporal gyrus, Intracalcarine cortex, Occipital Pole, superior temporal, and lateral orbito-frontal area.
3.9 DT is associated more strongly to brain regions with visuo-motor functions and AS to regions with memory and language functions
The full list of the words with the highest frequency of occurrence in the articles that use the anatomical labels associated with AS and DT is available in the Supplementary Tables A12-A15, for both the univariate and multivariate analysis. A summary of the results is presented in Figure 17.
In summary, visual (2.57), age (2.0), motor (1.8), lesion (1.32), and stimulus (1.27) were the words with the highest frequency of occurrence for DT-associated anatomical labels in case of the multivariate analysis. Similarly, visual (3.63), stimulus (1.53), and lesion (1.29) were the ones with the highest frequency for DT-associated anatomical labels derived from the univariate analysis. Conversely, memory (3.75), age (3.24), and auditory (2.0), and memory (2.0), social (1.68), behaviour (1.59), emotional (1.33), and learning (1.28) were the most frequent words in case of AS (Fig. 16) for respectively the multivariate and univariate analysis.
4 Discussion
IDoCT is a flexible method designed to fractionate the detailed timecourses that are collected during performance of a computerised cognitive task into components that can be explained by inter-subject variability in basic visuo-motor processing speed and device latency on the one hand, and the specific cognitive abilities that the task was intended to manipulate on the other (Giunchiglia et al., 2023). This is achieved in a simple, robust, and data-driven manner that iteratively re-estimates individuals’ abilities and trial-difficulty scales whilst handling the speed accuracy tradeoff.
This method was initially applied to improve the precision of performance estimates from the Cognitron library of online tasks (Giunchiglia et al., 2023). However, the flexibility of the approach can be adapted for practically any computerised task that varies dimensions of cognitive difficulty across trials and where performance recordings are available for large numbers of individuals. The results presented here provide further evidence of the utility of IDoCT in the context of data derived from an independently designed UK Biobank task, including improvements in the trial-difficulty scale, participants’ score distributions, retest statistics, demographic correlations, and imaging associations.
More specifically, a critical limitation of the PVT dataset is that the original summary score distributions are malformed due to the dynamic sampling algorithm having operated across a sub-optimal trial-difficulty scale (Dold). Our analysis of sampling trajectories highlights the basis of that limitation. Specifically, the difficulty (Dold) of sampled trials for participants with moderate to good original scores (AB) either reaches a plateau or decreases after just 5 out of 30 available steps. Replotting the sampling trajectories using data-driven trial-difficulty estimates (D and DS) indicates that the intended difficulty increments across time were not achieved. Furthermore, the correlations between original scores and accuracy or response time measures also are hard to interpret, suggesting that ceiling effects were not the only problem, as the ordering of the scale may also be sub-optimal for the UK Biobank population, perhaps due to it having been developed with a United States (US) population in mind. This resulted in the original summary scores (AB) having a malformed bimodal distribution with a high mean estimate (0.8 where 1 is maximum), which is unlikely to reflect the true underlying population distribution of crystallised intelligence abilities (Flynn, 1987).
IDoTC operated by (i) leveraging the population’s accuracy and response time measures while (ii) factoring in the component of performance variance that is better explained by basic visuo-motor response times and (iii) taking into account potential (and here intended) bias due to more difficult items being sampled for higher performing abilities. Taken together, these characteristics enable recalculation of the trial-difficulty scale in a data driven manner. The resultant DS scale can potentially be used in future studies as the basis for the adaptive sampling algorithm, though it may still be advisable to include more high-difficulty items given the observed ceiling effects.
More importantly, analysis of the data distributions demonstrates that IDoCT was successful in addressing the issues with the original summary scores. For example, AS has the expected Gaussian distribution (Flynn, 1987), supporting the view that IDoCT could overcome the ceiling effect limitation generated by Dold and better capture the underlying crystallised language ability that is the target of the task. Plotting the original AB and re-estimated AS scores against the basic mean reaction time and total correct response measures shows distorted distributions for the former but not the latter. Furthermore, by comparing the word difficulty measures before and after the scaling (Fig. 5), it is evident that IDoCT properly addresses the biased sampling issue, increasing the measure of difficulty of words that were presented exclusively to participants of higher ability and by decreasing it in the case of words presented only to low-performing individuals. Moreover, the retest plots demonstrate a more homogeneous spread on the Bland-Altman plots, with lower SD difference and higher cross-session correlation for AS. Although higher compared to AB, the relatively low test-retest reliability of AS can be due to multiple factors, such as learning curve resulting from having already completed the tests previously and aging effects. In sum, all analyses indicate that the performance score distributions from IDoCT are superior to those output by the original task.
The obtained measures of DT and AS are further validated by the results on the associations with age and education. Specifically, although DT shows the expected increase with age (Habekost et al., 2013), the effect size of the relationship with education level is negligible. This supports the view that DT captures individual differences in fundamental visuo-motor latencies, as opposed to the knowledge of the meaning of words. Conversely, AS improves with age—the expected pattern of results as knowledge of words improves and “crystallises” throughout the lifespan (Hayden & Welsh-Bohmer, 2011; Park & Reuter-Lorenz, 2009; Salthouse, 2009; Singh-Manoux et al., 2012)—but also improves with exposure to higher education, where there is a higher likelihood of learning unusual words (Ceci, 1991; Guerra-Carrillo et al., 2017). Notably, the scale of both of these associations is numerically stronger for AS than for the original AB score, together resulting in an improved R2 in the linear regression analysis (AB: 0.17, AS: 0.21).
The above results confirm that IDoCT is successful in recalculating the difficulty scale, and in fractioning performance into distinct cognitive and visuo-motor components that have superior retest properties and improved demographic predictive validity. These findings align with our previous applications of this technique to tasks from the Cognitron library (Giunchiglia et al., 2023). The more novel question pertains to whether these advantages in the precision of the task performance estimates extend to improvements in imaging associations.
The machine-learning pipeline addresses this by measuring how accurately the original AB measure can be predicted from data of different imaging modalities and using this as a baseline for comparing the AS and DT performance estimates. Overall, the behavioural–imaging associations are in the small range, albeit statistically significant. This may not be unexpected given the simplicity of the linear regression machine-learning approach applied and the recent literature on the scale of such associations when estimated within well-powered datasets (Marek et al., 2022). More importantly, the associations with AS are consistently stronger than AB across all analysed imaging modalities. Among the four datasets studied, the volume measures led to models with the highest R2 on the test set for both AS and DT, suggesting that these measures might be more informative when trying to predict cognitive ability and visuo-motor latency.
The NLP analysis provides an unbiased data-driven way to qualitatively evaluate the functional specificity of the AS and DT measures, by determining the most common functional terms that their associated brain regions co-occur within literature. The results have face value, with brain regions identified as best predictors of DT mainly relating to visual and motor functions, such as the intracalcarine cortex (Coullon et al., 2015), and the cerebellum (Guell et al., 2018). This is expected considered that DT is supposed to measure visuo-motor latency times. Conversely, AS is mainly associated with brain regions involved in memory, language, and auditory functions, such as the hippocampus (Eichenbaum et al., 1999), uncinate fasciculus (Papagno, 2011), and inferior temporal gyrus (Onitsuka et al., 2004). Considering the nature of the task, which requires participants to associate the meaning of spoken/written words to different images, the identified brain functions are as expected.
A strength of this study is the sample size, which allows to draw firmer and more precise conclusions. An important aspect to consider, however, is that the UK Biobank sample includes mainly middle to older aged individuals (<1% below 50 years old), who are not necessarily representative of the general population in terms of health, physical, and lifestyle aspects (Fry et al., 2017). This does not undermine our validation of the IDoCT approach, but it should be noted that the data-driven measures of trials’ difficulty might change if younger individuals are included in the analysis, as different words are more likely to be learnt at different stages of the lifespan (e.g., school vs work). A further strength of the study is that four different kinds of features (imaging-derived phenotypes) from two imaging data modalities were analysed, in combination with both behavioural and demographics data, which allows to gain better insight into the best imaging predictors of AS and DT.
Despite these strengths, there are some limitations. First, IDoCT requires as input a trial’s description in order to extract data-driven measures of trials’ difficulty. In this study, the word presented at each trial is used as that definition because it was the only information available. However, in the case of PVT, the difficulty of a trial is influenced not only by the word presented, but also by the set of pictures from which participants are required to choose. Without this information, it is difficult to interpret the reason why specific words were assigned higher or lower difficulty scores, as the motivation could be a combination of the difficulty of the word itself or of the figures presented. However, a general observation is that higher values of DS on the difficulty scale are mainly characterised by a Latin, French or Greek etymology. This pattern of results again has some face validity, because English is not a romance, but a Germanic language (Bech & Walkden, 2016), but corresponds only to a limited interpretation due to the lack of information about the word-figure pairs.
Further limitations are related to the NLP analysis. First, the latter was conducted exclusively on open access papers, which limits the literature research to previous studies that are freely accessible. In addition, although the NLP analysis conducted in this study was sufficient to achieve the intended aim, it consisted exclusively of the estimation of word frequency across different papers. Further research could extend this by using more advanced NLP approaches, such as topic modelling to extract common topics (Koltcov et al., 2014), or by implementing digraphs and dependency parsing to identify related and connected words (Kübler et al., 2009). The latter approach could provide additional information on the meaning of the words within the context of the sentence in which they appear and help in reducing the noise of the one to many mappings between brain structures and cognitive functions.
Finally, there are two limitations of the imaging analysis that could be addressed by further research. First, the study is exclusively focused on associations with structural measures. However, cognitive processes generally result from complex brain networks (Raichle et al., 2001), rather than individual discrete contributions of brain regions (Petersen & Sporns, 2015). Analysis using graph theoretic approaches to quantifying the information processing properties of networks, or focused on network dynamics from functional MRI might provide better and more detailed insights into the neural correlates of cognitive task performance. Relatedly, the machine-learning approaches used here are mainly state-of-the-art shallow linear models and they are applied independently to each imaging modality. The strength of associations could improve if more advanced methods, such as deep learning and combination of multi-modal features, are applied. Nonetheless, the fact that association strengths increase for AS versus AB across all imaging modalities confirms our primary hypothesis that IDoCT can provide superior cognitive ability measures for future association studies, including those investigating more advanced imaging analysis methods.
In conclusion, we successfully apply IDoCT to the PVT data collected as part of the UK Biobank imaging extension, and obtain superior prediction of subject-level cognitive ability and visuo-motor latency, as well as an optimised data-driven word difficulty scale calibrated on the UK population. Our results further validate IDoCT by showing the improved relationship between the performance metrics and age and education, as well as brain imaging metrics in terms of the strength and functional specificity of associations.
Data and Code Availability
The imaging and cognitive datasets analysed in this study are available via the UK Biobank data access process (see http://www.ukbiobank.ac.uk/register-apply/). The AS and DT measures derived can also be accessed via the same process (see https://biobank.ndph.ox.ac.uk/ukb/label.cgi?id=504). The code of IDoCT will be released publicly on Github.
Author Contributions
A.H. obtained funding. A.H., S.C., S.S., and N.A. designed the study. S.C. and N.A. collected the data and developed the cognitive task. S.S. collected the imaging data. V.G. conducted the analysis, validation and implemented the methodology. A.H. and S.S. supervised the project. V.G. wrote the first draft of the manuscript. All authors reviewed and approved the manuscript.
Funding
V.G. was supported by the NIHR Imperial Biomedical Research Centre (BRC) grant to A.H. and by the Medical Research Council, MR/W00710X/1. N.A. was funded by the Medical Research Council and Wellcome Trust.
Declaration of Competing Interest
The authors declare no competing interests.
Acknowledgements
This research has been conducted using the UK Biobank Resource under application number 100870.
Supplementary Materials
Supplementary material for this article is available with the online version here: https://doi.org/10.1162/imag_a_00087.