Abstract
When encountering letter strings, we rapidly determine whether they are words. The speed of such lexical decisions (LDs) is affected by word frequency. Apart from influencing late, decision-related, processing stages, frequency has also been shown to affect very early stages, and even the processing of nonwords. We developed a detailed account of the different frequency effects involved in LDs by (1) dividing LDs into processing stages using a combination of hidden semi-Markov models and multivariate pattern analysis applied to EEG data and (2) using generalized additive mixed models to investigate how the effect of continuous word and nonword frequency differs between these stages. We discovered six stages shared between word types, with the fifth stage consisting of two substages for pseudowords only. In the earliest stages, visual processing was completed faster for frequent words, but took longer for word-like nonwords. Later stages involved an orthographic familiarity assessment followed by an elaborate decision process, both affected differently by frequency. We therefore conclude that frequency indeed affects all processes involved in LDs and that the magnitude and direction of these effects differ both by process and word type.
INTRODUCTION
We constantly try to interpret our environment. For example, when driving and seeing a sequence of letters on a road sign, we intuitively try to read it to not miss out on important information. In these situations, a process to determine whether such a sequences of letters actually corresponds to a word might precede our reading attempts. This lexical decision (LD) step has been studied extensively and is well known to be sensitive to word type and frequency manipulations (e.g., Baayen, Milin, & Ramscar, 2016; Brysbaert, Stevens, Mandera, & Keuleers, 2016; Yap, Sibley, Balota, Ratcliff, & Rueckl, 2015; Keuleers, Diependaele, & Brysbaert, 2010; New, Brysbaert, Veronis, & Pallier, 2007; Hauk, Davis, Ford, Pulvermüller, & Marslen-Wilson, 2006; Ratcliff, Gomez, & McKoon, 2004; Balota & Spieler, 1999; Sereno, Rayner, & Posner, 1998; Andrews, 1989, 1996; Balota & Chumbley, 1984; Forster & Chambers, 1973).
Usually, the more frequent a word, the faster and more accurate the LD (e.g., Yap et al., 2015; Baayen, Wurm, & Aycock, 2007; Baayen, Feldman, & Schreuder, 2006; Balota & Spieler, 1999; Sereno et al., 1998; Balota & Chumbley, 1984; Forster & Chambers, 1973). In addition, the type of nonwords featured in an experiment also impacts the accuracy and speed of LDs (e.g., Berberyan, van Rijn, & Borst, 2021; Ratcliff et al., 2004; Stone & Van Orden, 1993; Shulman & Davidson, 1977; James, 1975). For example, Stone and Van Orden (1993) and Ratcliff and colleagues (2004) both revealed faster and more accurate LDs for words when random strings where used as nonwords instead of pseudowords (i.e., stimuli that could be a word, but do not exist in the language). When both random strings and pseudowords are present, the former are responded to faster than words whereas the latter are responded to slower than words (e.g., Berberyan, van Rijn, et al., 2021; Ratcliff et al., 2004; Stone & Van Orden, 1993). RT and accuracy, however, cannot directly reveal which individual LD processes are affected by stimulus properties nor how these properties impact processing.
Previous research identified three candidate processes that are likely to be affected by word type and frequency manipulations: the orthographic processing of the stimulus, the identification and access of lexical representations, and the actual decision process. To study orthographic processing, researchers commonly focus on the effects of frequency and word type on early ERP components (e.g., N100 and P100; Araújo, Faísca, Bramão, Reis, & Petersson, 2015; Segalowitz & Zheng, 2009; Hauk, Davis, et al., 2006; Hauk, Patterson, et al., 2006; Sereno & Rayner, 2003; Sereno et al., 1998), which might be driven by improbable character pairings of random character strings, or by similarity in the make-up of infrequent words and pseudowords (Araújo et al., 2015; Hauk, Patterson, et al., 2006; Balota & Spieler, 1999; Seidenberg & McClelland, 1989). Related to this, the slightly later N170 component has been shown to discriminate, among other things, between words and sequences of geometric shapes (e.g., Mahé, Bonnefond, Gavens, Dufour, & Doignon-Camus, 2012; Maurer, Brandeis, & McCandliss, 2005; Maurer, Brem, Bucher, & Brandeis, 2005). The topography of the N170 elicited by pseudowords, although slightly different, appears to be much closer to the topography elicited by words than other symbol sequences (Maurer, Brandeis, et al., 2005). Facilitated lexical retrieval of frequent words, a common prediction made by models of language processing (e.g., Coltheart, Rastle, Perry, Langdon, & Ziegler, 2001; Seidenberg & McClelland, 1989; McClelland & Rumelhart, 1985; Paap, Newsome, McDonald, & Schvaneveldt, 1982; McClelland & Rumelhart, 1981; Morton, 1969), and prolonged or more effortful pseudoword identification attempts have been linked to the much later N400 component (Meade, Grainger, & Holcomb, 2019; Barber, Vergara, & Carreiras, 2004; Holcomb, Grainger, & O'Rourke, 2002). Third, it has been suggested that frequency effects in LD tasks are the result of the discriminatory nature of the task, favoring a familiarity or decision process based on a composite criterion encoding stimulus properties like frequency (Berberyan, van Rijn, et al., 2021; Wagenmakers, Ratcliff, Gomez, & McKoon, 2008; Ratcliff et al., 2004; Balota & Spieler, 1999; Balota & Chumbley, 1984).
Although the aforementioned EEG studies linked word type and frequency effects to specific cognitive processes, averaging the EEG over trials to calculate ERPs neglects trial-level variation in the duration of these processes. This prevents a precise identification of the onset and duration of cognitive processes, complicating the attribution of effects to individual processes (Borst & Anderson, 2024; Luck, 2014). The Hidden semi-Markov Model Multivariate Pattern Analysis (HsMM-MVPA) method introduced by Anderson, Zhang, Borst, and Walsh (2016) accounts for this variability. Instead of averaging trial-level EEG recordings, the method decomposes them into discrete cognitive stages.
To investigate the precise temporal locus and nature of frequency and word type effects, we therefore performed a two-stage analysis on EEG data. We first identified cognitive processing stages using the HsMM-MVPA method (Anderson et al., 2016). Subsequently, we investigated the combined effect of frequency and word type on the duration of the discovered stages, utilizing a novel frequency measure applicable to both words and nonwords (i.e., Google frequency scores; Hendrix & Sun, 2021).
Discovering Cognitive Stages
The HsMM-MVPA method models cognitive stages as states in a left-to-right hidden-semi-Markov model (Borst & Anderson, 2024; Anderson et al., 2016). As depicted in Figure 1 and explained in more detail in the Methods section, the model searches for stage-specific multivariate events in the EEG—termed “bumps”—that signal the start of a stage. These multivariate events are assumed to be the same across trials and participants (similar to ERPs); the method discovers the event topologies that explain most variance in the data across all trials. Consequently, the current stage ends once the model finds sufficient evidence for the onset of the next one, again in the form of a match between the EEG signal and the EEG topology of this next stage. To model the variability in stage duration between trials, conditions, and participants, the method uses gamma distributions (Anderson et al., 2016).
Importantly, although the cognitive stage (i.e., what is represented by a Markov state) ends with the onset of the subsequent one, the method does not preclude the possibility that the cognitive process attributed to a stage (see Figure 1 and van Maanen, Portoles, & Borst, 2021) continues (as expected, e.g., in cascading models of cognition, e.g., Coltheart et al., 2001; McClelland, 1979). The duration of a stage on a particular trial thus reflects the estimated time until the onset of the next stage rather than the time until the current process ends. Similarly, the onset of a new processing stage does not necessarily imply the onset of an entirely new cognitive process. Instead, the onset of a new processing stage could also indicate that a cognitive process has progressed substantially, resulting in an additional perturbation in the EEG (i.e., a new bump topology). In the context of visual word processing, this could, for example, indicate that the orthographic representation activated by the presented stimulus has changed relative to an earlier stage (e.g., Seidenberg & McClelland, 1989; McClelland & Rumelhart, 1985). Correspondingly, it is not uncommon that more than one stage recovered by the HsMM-MVPA method is attributed to visual processing (e.g., Berberyan, van Rijn, et al., 2021; Portoles, Borst, & van Vugt, 2018; Borst & Anderson, 2015).
The HsMM-MVPA method has been applied successfully to a variety of different tasks (e.g., Portoles, Blesa, van Vugt, Cao, & Borst, 2022; Berberyan, van Maanen, van Rijn, & Borst, 2021; van Maanen et al., 2021; Portoles et al., 2018; Zhang, Walsh, & Anderson, 2017, 2018; Anderson et al., 2016), including LD making (Berberyan, van Rijn, et al., 2021). Berberyan's LD experiment featured four stimulus categories: frequent and infrequent words, pseudowords, and random strings. They identified six processing stages, independent of stimulus category (Berberyan, van Rijn, et al., 2021). This aligns with previous work by Mahé, Zesiger, and Laganaro (2015) who utilized “microstate analysis” to discover five distinct EEG topographies1 in ERPs recorded during a LD task involving words and pseudowords. Importantly, the same set of topographies accounted for the data from both words and pseudowords. In the study by Berberyan, van Rijn, et al. (2021), only the duration of the fifth stage, beginning, on average, 400 msec after stimulus presentation, differed between conditions (consistent with decision accounts; e.g., Ratcliff et al., 2004; Balota & Spieler, 1999; Balota & Chumbley, 1984). Hence, Berberyan and colleagues argued the stage to reflect a decision stage, with the duration of the stage reflecting the difficulty of the LD (Berberyan, van Rijn, et al., 2021).
Importantly, their findings also suggested the decision process to be but one of six processes involved in reaching LDs. Yet, Berberyan and colleagues found no evidence in favor of word type and frequency effects on the other stages (Berberyan, van Rijn, et al., 2021), in contrast with the earlier ERP studies, which reported effects already 100–300 msec after stimulus onset (Meade et al., 2019; Araújo et al., 2015; Segalowitz & Zheng, 2009; Hauk, Davis, et al., 2006; Hauk, Patterson, et al., 2006; Sereno et al., 1998). This absence could have resulted from the lack of a continuous frequency manipulation, which might highlight more subtle processing differences that are lost when comparing only frequent and infrequent words, and which has been argued to be a necessary condition for observing the earliest frequency effects (Segalowitz & Zheng, 2009). Similarly, manipulating the frequency of both words and nonwords might yet again reveal other effects of frequency. In the current study, we will therefore vary frequency continuously for both words and nonwords, to provide a unified account of how frequency impacts the encoding, retrieval, and decision processes involved in LDs.
Choosing a “Word”-frequency Measure
Although the classical concept of frequency naturally does not apply to nonwords, other frequency-related predictors are commonly observed to influence nonword processing as well (e.g., number of orthographic neighbors, n-gram frequency or typicality, base-word frequency; Hendrix & Sun, 2021; Yap et al., 2015; Hauk, Patterson, et al., 2006; Balota, Cortese, Sergent-Marshall, Spieler, & Yap, 2004; Barber et al., 2004; Andrews, 1989, 1996; Grainger & Jacobs, 1996; Coltheart, Davelaar, Jonasson, & Besner, 1977). Unfortunately, untangling the specific effects of these predictors and what they imply for the processing of nonwords continues to prove difficult (for recent attempts, see Hendrix & Sun, 2021; Yap et al., 2015).
One reason for this is that the same predictor can be observed to have strikingly different effects. For example, Andrews (1996) measured longer RTs for nonwords derived from very frequent words. Conversely, Yap and colleagues (2015) observed the opposite trend: faster RTs for nonwords derived from very frequent words. This could potentially result from predictors interacting in their influence on nonword processing: Nonwords that have more real word neighbors have been associated with longer RTs reflecting more effortful LDs (e.g., Balota et al., 2004; Andrews, 1989; Coltheart et al., 1977), but this trend appears to weaken or even reverse as the frequency of neighboring words increases (Yap et al., 2015; Grainger & Jacobs, 1996). In addition, the direction and shape of frequency effects on nonword processing is likely sensitive to the choice of base-word frequency measure (i.e., subtitle-based, corpus-based, rating-based, spoken frequency), because different measures will all reflect different experiences with the base words (e.g., Baayen et al., 2016; Brysbaert et al., 2016; Brysbaert & New, 2009; New et al., 2007).
Even when accounting for the combined effects of most or all of these predictors statistically (e.g., Yap et al., 2015), only parts of the puzzle might be revealed: Although nonwords with more neighbors have been described as more “word-like” (Yap et al., 2015), neighbor count and related predictors essentially carry only substring information (e.g., n-gram frequency) or information about related words (instead of the nonword directly; e.g., base-word frequency, number of neighbors) and thus might still fail to capture the familiarity or word-likeness of the whole nonword string (for similar proposals, see Hendrix & Sun, 2021; Seidenberg & McClelland, 1989; Balota & Chumbley, 1984).
Thus, it might be more desirable to rely on a single unified measure applicable to both words and nonwords, capturing both classical frequency and the aforementioned familiarity effects. Hendrix and Sun (2021) approximated the frequency of all their stimuli with Google result counts and observed that the measure correlates with other features, such as orthographic neighbor count, while still performing well in regression models of LD RTs (Hendrix & Sun, 2021). This suggests that the Google frequency measure might carry more than just traditional frequency information, in which case it might be a suitable and meaningful proxy of the holistic word-likeness of our stimuli.
Current Study
In the current study, we therefore followed the approach taken by Hendrix and Sun (2021) and manipulated the continuous Google frequency of words and nonwords alike. To better understand the link between this Google frequency measure, presumably reflecting word-likeness, and the human language processing system, it was included as a predictor in the aforementioned two-stage analysis of the EEG data: After recovering the cognitive stages involved in LDs using the HsMM-MVPA method (Anderson et al., 2016), generalized additive mixed models (GAMMs; Wood, 2017; Hastie & Tibshirani, 1990) were used to investigate the effects of word type and Google frequency on the duration of the recovered stages. The Google frequency measure was also compared with alternative frequency measures (i.e., CELEX and SUBTLEX; Keuleers, Brysbaert, & New, 2010; Baayen, Gulikers, & Piepenbrock, 1995) and a measure of orthographic typicality (i.e., OLD20; Yarkoni, Balota, & Yap, 2008), to investigate whether it indeed captures the word-likeness of words and nonwords.
METHODS
Participants
Twenty-seven participants performed a Dutch LD task. One participant dropped out of the study and was thus excluded from analysis. Two additional participants were excluded because 45% and 32% of their trials were rejected because of unacceptably high EEG noise levels. Therefore, data from 24 adult native Dutch speakers (Nfemale = 12, Nmale = 10, Nother = 2), on average 21.2 years old (SD = 1.9 years), were included for analysis. The number of participants aligns with previous studies that utilized the HsMM-MVPA method (e.g., Portoles et al., 2022; Berberyan, van Rijn, et al., 2021; Zhang et al., 2018; Anderson et al., 2016).
Stimuli
Words and pseudowords were taken from the Dutch Lexicon Project (DLP) corpus (Keuleers, Diependaele, & Brysbaert, 2010). From all nouns in the DLP that consisted of five to nine characters and did not contain any special characters, 250 words were selected for the experiment. To have an equal number of nonwords, we selected the closest (according to edit-distance) pseudoword in the DLP corpus as a pseudoword variant for 125 words. Random strings were obtained by simply scrambling the remaining word-strings, ensuring that the resulting string was not included in the entire DLP corpus. All participants saw the same 500 stimuli. Because either a pseudoword or a random string was created from each word, the stimulus set contained two variants of each stimulus. To control for this, the stimulus schedule for any participant was created semirandomly, preventing the presentation of stimulus variants in two subsequent trials.
The Google frequency measure (Hendrix & Sun, 2021) was used as unified frequency measure for words and nonwords in this study. To obtain frequency scores, a custom Google search engine (CSE) was created, with the Netherlands being specified as a region and advertisement disabled. The CSE was configured to fall back to results relevant for other regions, to enrich results. The result count returned by the CSE for every stimulus, acting as the final frequency score, was recorded (Hendrix & Sun, 2021).
Google Frequency Analysis
To understand what is captured by this novel frequency measure, pairwise correlations between this measure, the SUBTLEX (Keuleers, Brysbaert, & New, 2010) frequency measure, and the CELEX (Baayen et al., 1995) frequency measure were computed. The SUBTLEX frequency measure estimates word frequency based on movie subtitles, whereas the CELEX measure, used in the study by Berberyan, van Rijn, et al. (2021), estimates word frequency based on corpora. For nonwords, SUBTLEX and CELEX frequency scores of the base words were used (i.e., the frequency score of the word from which each nonword was derived).
Hendrix and Sun (2021) also observed the Google frequency measure to correlate with the orthographic Levenshtein distance (OLD20; Yarkoni et al., 2008) measure, relating to the stimulus' number of orthographic neighbors. Because the OLD20 measure is also applicable to words and nonwords alike, it was compared with the Google frequency scores as well. For a word or nonword, the OLD20 score measures the average, computed over the closest 20 words, number of character edits necessary to reach another real word (Hendrix & Sun, 2021; Yarkoni et al., 2008). For this study, the closest 20 neighboring words for every stimulus were identified from the set of all words included in the DLP (Keuleers, Diependaele, & Brysbaert, 2010). For example, the DLP includes two neighbors for the random string “tajel” (derived from “latje”) with an edit distance of one (“tafel” and “tabel”) and more than 18 neighbors with an edit distance of two, resulting in an OLD20 score of 1.9 = . In contrast, the DLP only includes neighbors with an edit-distance of at least four for the random string “rmrembo” (derived from “brommer”). A low OLD20 score thus indicates that few character-level changes are necessary to reach many alternative real words (in the DLP), suggesting that the corresponding stimulus is orthographically quite “word-like” (Hendrix & Sun, 2021; Yap & Balota, 2007).
LD Task
On every trial, participants first fixated a dot for a period lasting between 500 and 750 msec before stimulus presentation. Participants then had 2000 msec to provide a “Dutch word” (green button, index finger) or “not a Dutch word” (red button, middle finger) response using a RB 740 button box from CEDRUS. After 2000 msec or response initialization, the stimulus was replaced with a hash mask of the same length as the stimulus. This mask was presented for a duration between 1500 and 2500 msec.
Participants completed six blocks, of which two were practice blocks. In each of the practice blocks, participants saw 12 stimuli not contained in the experimental blocks. Each of the four experimental blocks consisted of 125 experimental trials, resulting in 500 experimental trials, of which 250 were word trials and 250 were nonword trials (125 pseudowords and 125 random strings). Performance feedback (accuracy and mean RT) was given after each block.
Data Recording and Preprocessing
EEG data were recorded at 512 Hz with an Active Two system from BioSemi with 32 electrodes placed according to the 10–20 layout. Six skin electrodes were used to record from the mastoids and to detect horizontal as well as vertical eye movements. For preprocessing, we followed the steps taken by Berberyan, van Rijn, et al. (2021) in EEGLAB (Delorme & Makeig, 2004). First, the EEG signal recorded during the experimental blocks was re-referenced with respect to the mastoids, low-pass filtered (40 Hz) and high-pass filtered (0.5 Hz) using EEGLAB's Finite Impulse Response filter. Subsequently, the signal was inspected visually, and obvious not eye-related artifacts were removed. Channels that were consistently noisy were excluded (at most two channels were excluded) and later interpolated from surrounding channels using spherical interpolation. Before the interpolation, an infomax independent component analysis was performed on the cleaned data. Components reflecting blinks and horizontal eye-movements were excluded.
During subsequent epoching, trials from which an artifact had been removed in the interval from 400 msec before stimulus onset to 200 msec after the response were excluded from further analysis. The average signal over the 100 msec before stimulus onset was subtracted from every epoch as baseline correction. In addition, trials on which participants responded incorrectly, not at all, or later than 1950 msec post stimulus onset, were excluded (on average, 8.25%). After preprocessing, 5402 word trials and 5546 nonword trials (2668 pseudoword and 2878 random string trials) were available for analysis. For the HsMM-MVPA analysis, the EEG signal was subsequently downsampled to 100 Hz and the variance of each participant scaled to the average variance across all participants. The 100-Hz rate was employed by previous studies (e.g., Berberyan, van Rijn, et al., 2021; Anderson et al., 2016) out of computational considerations (as discussed below, the method takes all trials and all time-points of all participants into account simultaneously). The pattern the method searches for to identify the onset of a new cognitive stage, the 50-msec bump across electrodes, is not of a particularly high frequency and can thus be picked up even in a signal downsampled to 100 Hz. The data were then decomposed using a principal component analysis. The first 10 principal components (accounting for more than 90% of the variance in the signal) were z-scored per participant and retained as input for the HsMM-MVPA analysis.
HsMM-MVPA Analysis
The HsMM-MVPA method by Anderson and colleagues (2016) detects 50-msec multivariate peaks in single trials, referred to as “bumps.” Following leading ERP theories (classic & synchronized; Yeung, Bogacz, Holroyd, Nieuwenhuis, & Cohen, 2007; Yeung, Botvinick, & Cohen, 2004; Makeig et al., 2002), these bumps are assumed to reflect the onset of a cognitive processing stage (see Figure 1). The choice of 50 msec for the bump width was based on the original work of Anderson and colleagues (2016), who conducted extensive simulation studies, revealing that even when the true bump width deviates (30–130 msec) from the assumed 50 msec, the method remains accurate in identifying stage onsets. Hence, the bump width of 50 msec was used in all studies utilizing the HsMM-MVPA method up to this point, including the current one.
Different stages can be associated with different “bump topologies,” yet each topology is assumed to be stable between participants and trials, as it reflects the same underlying cognitive process. The duration of cognitive stages demarcated by these bumps is assumed to vary on the trial level, and gamma distributions (with the shape parameter fixed at 2; Anderson et al., 2016) are used to capture this trial-level variability in stage duration. To discover the bumps and gamma distributions that explain most of the variance in the EEG data given a number of bumps, HsMM algorithms are used (Yu, 2010).
More specifically, fitting is performed on the trial-level EEG data and proceeds iteratively by means of an expectation maximization algorithm (Yu, 2010; Hastie, Tibshirani, & Friedman, 2009b): At every iteration, the first step is to compute, for every stage at every time-point and for every trial, the probability that a stage begins, given the current estimates for bump topologies and gamma distribution scale parameters (expectation). Given the (expected) onsets of every stage, the bump topologies and gamma distribution scale parameters can then be updated to offer the best match2 for the EEG across trials in the 50 msec after event onset and the duration between the stage onsets (maximization).
This iterative approach avoids having to identify bump topologies from averaged ERPs, which, as discussed in the Introduction section, could lead to distorted estimates (Borst & Anderson, 2015; Luck, 2014). Instead, the trial-level EEG is essentially first “aligned” to the estimated stage onsets before estimating the bump topology (i.e., the weights for the individual bumps) that best generalizes across all trials. Estimating bump topologies like this rests on the assumption that the topologies are stable and generalize across trials and participants. Notably, this does not imply that the same topologies are perfectly visible on all trials. Rather, after accounting for the bump topologies, no systematic information (e.g., stable between participant differences) should be contained in the EEG during that time frame.
To identify the best HsMM-MVPA model, Berberyan, van Rijn, et al. (2021) started with identifying an optimal “shared model,” which assumes that the number of states and the parameters in the model do not depend on word type. Therefore, we also first identified an optimal shared model. Because the bump width was assumed to be 50 msec, bumps cannot overlap, and because the fastest trials took slightly less than 350 msec, at most six bumps could be fitted for the shared model. Given that the first stage starts at stimulus onset and the last stage ends at the response, this results in seven stages. We started by fitting six models—ranging from one to six possible bumps, each based on the combined data from all word type conditions and differing only in the number of bumps.
A leave-one-subject-out cross-validation (LOOCV) procedure was used to identify the optimal number of bumps for the shared model. Specifically, for all possible numbers of bumps, 24 models were fitted to the data from all but one participant (i.e., the LOOCV “folds”; Hastie, Tibshirani, & Friedman, 2009a), so that every participant was left out once. Subsequently, 24 held-out log-likelihood scores could be obtained by evaluating the fitted models on the left-out participants. These scores offer insights into how well a given number of bumps would generalize beyond our sample of participants and are thus a suitable criterion to find the most complex model without “over-fitting” to the data collected (Hastie et al., 2009a).
For each of the 24 folds, we first fit a model with the maximum number of bumps, six in this case. This was followed by a back-fitting procedure: For a given number of n bumps, every bump was dropped separately, providing n potential starting points for the simpler n − 1 bump model, which were all fitted to convergence. The best of these n models was then selected and used in the same way described above to find starting estimates for the n − 2 bump model and so forth. We then used sign tests to judge whether a significant number of held-out participants improved in log-likelihood scores when using the n bump model instead of the n − 1 bump model (Berberyan, van Rijn, et al., 2021; Anderson et al., 2016), allowing us to find the maximum number of bumps that would still generalize (e.g., van Maanen et al., 2021).
Next, we explored more complex models by (1) letting the gamma distribution's scale parameter—and therefore the stage duration—corresponding to an individual stage vary by word type, and (2) estimating both an individual bump topology and scale parameter per word type for a particular stage (cf. Anderson et al., 2016). The first case accounts for differences between word types in stage duration, whereas the second case accounts for cases in which the same processing stage is associated with a different activation pattern depending on word type. Substantial variation in both bump topology and scale parameter between word types could also indicate that an additional processing stage needs to be inserted for one or more word types. This third case could arise because different processes are involved in LDs for different word types, or, as discussed in the Introduction section, because a process causes multiple brief perturbations of the EEG signal for some but not all word types.
To test for these three cases, the complexity of the shared model was increased in a step-wise fashion. Specifically, varying scale parameters for a stage were considered before testing whether the corresponding bump topology should vary as well. If the bump topology preceding a stage was observed to differ between word types, the next set of comparisons involved models that featured an additional stage for a single word type only. The same LOOCV procedure outlined above was used to judge the necessity of these complexity increases.
GAMM Analysis of Stage Duration
From the final model, trial-level duration estimates were obtained for every processing stage, by calculating for every trial and every bump the weighted mean of the bump onset distribution (Anderson et al., 2016). Gaussian GAMMs (Wood, 2017; Hastie & Tibshirani, 1990), as available in mgcv (v. 1.9.0; Wood, 2011), were used to analyze how the log-transformed duration of each processing stage was affected by changes to the log-transformed Google frequency separately per word type. Stage duration and frequency variables were log-transformed to ensure that the resulting models would comply with the assumptions inherent to a normal model. Separate GAMMs were estimated per processing stage because we anticipated that the variance in (log-transformed) stage duration would vary drastically between stages, 3 conflicting with the normal model's assumption of constant variance (Wood, 2017). This complicates an assessment of the support for the three-way interaction of processing stage, word type, and frequency. Instead, only the two-way interaction of word type and frequency was tested formally, separately for every processing stage.
To this end, the model for each processing stage was equipped with fixed by word-type smooths of frequency, essentially permitting for a different effect of frequency on stage duration for every word type. Suitable nonlinear random effects were added to model per-participant deviations from the aforementioned fixed effects. An additional nonlinear random effect term was included to capture remaining per-trial fluctuations in stage duration.
Estimating the effect of frequency by means of a smooth implies that the effect itself can usually no longer be explained by a single coefficient (e.g., the slope of a line). Therefore, the estimated effects need to be visualized and interpreted to understand how frequency affects the duration of each processing stage and whether it does so differently between word types. However, a visual inspection alone cannot establish whether any differences between word types in the estimated effects of frequency are robust and likely to generalize.
To more formally test whether, for a given stage, the effect of frequency would be different between word types, we computed confidence intervals (CIs) around the estimated difference in the effect of frequency between any two word types (using the “get_difference” function from the itsadug R package; van Rij, Wieling, Baayen, & van Rijn, 2022). Specifically, separate model matrices (e.g., XW for words and XPW for pseudowords) were created for every word type. Every row in these matrices corresponds to a different frequency score. Each column corresponds either to a fixed or random effect in the stage-specific GAMM. Only columns corresponding to fixed effects for the word type of interest (i.e., the word-type-specific smooth of frequency) and the general intercept contained nonzero values. Columns associated with random effects or fixed effects for a different word type were zeroed. The difference in the estimated effect of frequency between two word types (e.g., μW−PW; the difference curve between words and pseudowords) was then obtained by evaluating μW−PW = (XW − XPW)β, where β contains all coefficients estimated by the model for a specific stage.
Wood (2017, Section 6.10) shows how to compute CIs for a difference curve like μW−PW. Consulting, for example, the CI around μW−PW enables checking for evidence that the difference between words and nonwords in predicted stage duration is statistically nonzero for at least one frequency score. In that case, there would be significant evidence for the two-way interaction, because the effect of frequency would have to be different for both word types to result in a significant difference in the predicted stage duration over some frequency range.
An advantage CIs have over a model selection procedure based, for example, on Akaike's information criterion (AIC; for details on how the criterion behaves for GAMMs, see Wood, Pya, & Säfken, 2016) is that they indicate where the effect of frequency differs between any two types of words and not just if it does so (e.g., Wood, 2017). However, instead of a single comparison per processing stage (two-way interaction model vs. model including only main effects), a CI needs to be computed for every pairwise combination of word types for every stage. Thus, the alpha level αi of every CI was Bonferroni corrected αi = 0.05/k for the total number k of intervals computed across stages (e.g., Section 2.6.8 in Wood, 2017).
RESULTS
Cognitive Processing Stages
The HsMM-MVPA analysis revealed that a shared six-stage model accounted best for the EEG data, in line with the results of Berberyan, van Rijn, and colleagues (2021; a visualization of the change in average held-out log-likelihood with an increasing number of processing stages is shown in Figure A1 in Appendix A). It performed better than a model with five stages for 19 out of 24 participants (p = .003), whereas a seven-stage model performed better only for seven participants (p = .988). Because the more complex shared seven-stage model offered an improvement in held-out likelihood only for a nonsignificant number of seven participants, the shared six-stage model was selected for further analysis.
The first follow-up model comparison revealed significant differences between word types in the duration of the fifth stage: Estimating different gamma parameters per word type for Stage 5 significantly improved the model (better for 23 participants, p < .001). The second comparison revealed significant differences between word types in the duration of the fourth stage as well: additionally varying gamma parameters for Stage 4 would further improve the model (better for 19 participants, p = .003). The third comparison revealed that the bump topologies preceding the fifth stage also differed significantly between word types: Additionally varying the bump preceding Stage 5 would again significantly improve the model (better for 20 participants, p < .001). As discussed in the Methods section, the next set of comparisons included models with an extra stage inserted before Stage 5. This fourth comparison revealed a significant improvement in fit for a majority of participants when inserting an extra stage for pseudowords (better for 18 participants, p = .011) but not for words (better for five participants, p = .999) or random strings (better for 12 participants, p = .581). The fifth comparison revealed no evidence of additional varying gamma parameters. However, a sixth comparison revealed that the bump topologies preceding the fourth stage now also differed significantly between word types (better for 17 participants, p = .032). No further increases in complexity were supported by the data.
The average stage duration estimates and bump topologies of the final model are shown in Figure 2. In this final model, six stages were shared between word types. For pseudowords, an extra stage was visible between the shared fourth and fifth stages. An inspection of the estimated bump topologies (see the top in Figure 2) revealed that the topology preceding this extra stage closely matched the topologies preceding the shared fifth stage but not those preceding the shared fourth stage. This could indicate that the split of the shared fifth stage for pseudowords, into one part that is exclusive to pseudowords (Substage 5a) and one part that is shared with the remaining word types (Substage 5b), results from the EEG topography changing during the fifth stage. The only way for a HsMM-MVPA model to account for this would be to insert an extra stage. Rather than indicating the necessity of an additional LD process for pseudowords, this would suggest that the same process driving the shared fifth stage was associated with two brief perturbations of the EEG signal for pseudowords only.
Hence, the trial-level duration estimates from the two Substages 5a and 5b were summed for pseudowords for the subsequent analysis, so that the same set of six stages could be analyzed across word types. The bottom in Figure 2 shows how much each substage contributes, on average, to the duration of the (combined) fifth stage for pseudowords.
The effect of word type was greatest on Stage 5, reflecting typical RT effects (i.e., longest stage duration for pseudowords and shortest for random strings; see Figure 2). In addition, Stage 4 was completed faster for words and pseudowords than random strings. Although similar differences in the duration of the fifth stage were observed before, differences in the duration of the fourth stage were not observed previously (Berberyan, van Rijn, et al., 2021).
Frequency and Word-type Effects
In addition to these average durations, the HsMM-MVPA analysis also provides trial-by-trial stage duration estimates, allowing us to investigate the effect of frequency and word type on stage duration in detail. To this end, we used GAMMs (Wood, 2017) because they can reveal both linear and nonlinear effects. As fixed predictor structure for log-transformed stage durations, we included a potentially nonlinear interaction between log-transformed frequency and word type. A separate GAMM was estimated for each processing stage. After inspecting the estimated effects of word frequency on stage duration, we discovered qualitatively different frequency effects for words exclusive to the Dutch language and foreign words (n = 69), which include different types of valid yet atypical Dutch words (e.g., cognates, loan words, false friends such as “pausen” or “genre”). Hence, for the trial-level analysis, we decided to distinguish between these two sets of words (the original GAMM analysis results are visualized in Figure B1 of Appendix B). The GAMM analysis results are visualized in Figure 3.
Specifically, Figure 3 depicts the estimated effect of frequency on the duration of a processing stage for words, foreign words, pseudowords, and random strings, separately for each processing stage. The solid colored lines reflect the estimated effect of frequency as obtained from the stage-specific GAMM. The shaded areas correspond to the 95% CIs around the estimated effect.4
The alpha level for the Bonferroni-corrected difference CIs, used for inference about the two-way interactions, was set to 0.00138 ≈ 0.05/36. The labeled horizontal black bars indicate the frequency range for which the difference in estimated stage duration between two word types is significantly different from zero according to these Bonferroni-corrected difference CIs. Because at least one difference CI does not contain zero over some frequency range in every processing stage, all stage-specific two-way interactions of frequency and word type are significant at an overall confidence level of 0.95.5 This does not imply that word type and frequency impact the duration of every processing stage in the same way. Therefore, we will now discuss the stage and word-type-specific effects of frequency.
In Stage 1 (0–70 msec), a subtle yet reliable increase in stage duration with frequency can be observed for pseudowords and random strings. In contrast, for both words and foreign words, stage duration generally decreased with frequency. However, for the most frequent words, this trend appeared to reverse, resulting in nonlinear estimates for both words and foreign words. In addition, across stages, the difference CIs for words and foreign words do not contain zero over almost the entire frequency range applicable to both word types, indicating that processing generally took longer for foreign words than words. The difference CIs further suggest the effect of frequency on stage duration to be significantly different for words compared with both nonwords over a small range of frequency scores. These results imply an interaction effect of word type and frequency already on the earliest processes in LD.
In Stages 2 (70–150 msec) and 3 (150–240 msec), a similar pattern is visible. However, contrary to the positive effect of frequency for random strings and the negative effect for words, which both remain visible in these two stages, the positive effect of frequency for pseudowords appears to be diminished. For words, the negative effect of frequency becomes almost linear. For random strings, the effect of frequency remains positive but becomes more nonlinear. In both stages, the difference CIs suggest an interaction effect between word type and frequency over a slightly larger range of frequency scores compared with the first stage.
Completing Stage 4 (240 to ∼360 msec) took significantly less time for all word types compared with random strings, independent of frequency scores. However, there still is an interaction effect present with positive effects of frequency for random strings but negative effects for words, foreign words, and pseudowords. Intriguingly, the clear separation in durations between random strings and the remaining word types indicates that, at the trial-level, the former can be distinguished reliably from the latter purely based on the information available from these early stages.
In Stage 5 (∼360 to 525/700/590 msec, Random Strings/Pseudowords/Words), a complex and nonlinear interaction between frequency and word type can be observed. Completing this stage for random strings took significantly less time than for pseudowords, independent of frequency scores. Generally, completing the stage took significantly less time for random strings compared with words as well. However, because of the robust positive effect of frequency for random strings, the most frequent ones become comparable to words (corresponding difference CI contains zero for log-frequency scores slightly higher than 15). Compared with random strings, the effect of frequency is more nonlinear for words, foreign words, and pseudowords. For foreign words, the strong nonlinearity might partially result from the comparatively small selection of foreign words, particularly in the lower frequency range. In addition, there is overlap in stage durations between infrequent words (both Dutch and foreign) and similarly frequent pseudowords. However, the difference CI for words and pseudowords continues to exclude zero in this lower frequency range, indicating that words are processed faster than pseudowords in this stage independent of frequency scores.
The GAMM for Stage 5 was also fitted to the individual duration of the two substages for pseudowords,6 and the resulting effects are visualized in Figure 4. The right column in Figure 4 reveals that the shape of the estimated effect of frequency for pseudowords obtained for the combined stage durations (i.e., the effect shown in Figure 3) can largely be attributed to Substage 5b, because the shape of the frequency effect obtained for just this substage is qualitatively very similar (see the dotted line). Substage 5a was shorter than 5b (the average duration similar to Stage 4; see Figure 2), and the trial-level analysis reveals that there was virtually no variability in the duration of this substage for different frequency scores (see dashed line in the right column in Figure 4). As a result, 5a mainly contributes an offset shift to the effect visible for the combined stage durations. This lack of variability in the duration of Substage 5a (see also the small standard error in Figure 2) further suggests that the support for Substage 5a mainly stems from the extra bump topology it adds to the model and that it is thus unlikely to reflect an extra LD process exclusive to pseudowords.
The pattern in Stage 6 is qualitatively similar to the pattern visible in Stage 5. However, contrary to Stage 5, there is considerably more overlap between random strings and pseudowords in stage completion times across frequency scores. Generally, the overlap between word types is more considerable in this stage.
Follow-up Analyses
RT Analysis
A complex nonlinear interaction of frequency and word type was visible in the fifth stage (see Figure 3). In addition, the pattern visible in the average duration of this fifth stage (see Figure 2) matches the pattern visible in the fifth stage recovered by Berberyan, van Rijn, et al. (2021). The authors previously interpreted this stage as the main decision stage, partially because the pattern visible in the average duration of the stage matched the typical effects of word type and frequency on RT data. This supports the interpretation of the fifth stage as a decision stage because it has previously been argued extensively that RT data almost exclusively carry information about the timing of the decision process underlying LDs (e.g., Wagenmakers et al., 2008; Ratcliff et al., 2004).
Whereas Berberyan, van Rijn, et al. (2021) treated frequency as a categorical variable, our study enables comparing the continuous and potentially nonlinear interaction effect of word type and frequency on RTs and the duration of Stage 5. Hence, the same model fitted to the log-transformed stage duration values was fitted to the log-transformed RT values. The estimated effects are visualized in Figure 4.
Evidently, the effects of frequency on RTs (see left column in Figure 4) are qualitatively very similar to the effects on the duration of Stage 5 (see central column in Figure 4). The predicted RT and duration scores for the fifth stage are also highly correlated (r = .965). The only stage that showed frequency effects similar to those visible in Stage 5 was Stage 6 (see Figure 3). However, there was considerably more overlap between word types in the effect of frequency in Stage 6 compared with Stage 5 or the RT data. In addition, the correlation between the predicted RT and duration scores for the sixth stage is lower (r = .836). These results align with the observations made by Berberyan, van Rijn, et al. (2021), suggesting that the nonlinear interaction effect of frequency and word type on RT data can largely be attributed to the fifth processing stage involved in LDs.
Importantly, for pseudowords, the similarity between the estimated effects of frequency on RTs and the duration of Stage 5 is reduced when considering the Substages 5a and 5b alone: Although the frequency effect for pseudowords in Substage 5b is qualitatively similar to the effect visible in the combined stage (cf. the right column of Figure 4), resulting again in a high correlation with the predicted RT scores (r = .941), it is precisely the aforementioned offset shift contributed by Substage 5a that ensures that the overlap in the frequency effects on the duration of the combined fifth stage matches the overlap visible in the frequency effects on the RTs. Combined with the aforementioned evidence suggesting that RT effects can be attributed to a single process, this again supports the earlier interpretation that the presence of Substage 5a reflects an extra EEG perturbation caused by the same process driving the shared fifth stage rather than an extra process involved in LDs for pseudowords.
Google Frequency Analysis
The previous analysis revealed strikingly different effects of the Google frequency measure not just per processing stage, but also per word type. To better understand which information this measure captures for the different types of stimuli, the Google frequency scores were compared with other common linguistic predictors. Figure 5 depicts density estimates for the Google frequency and OLD20 scores for the words and nonwords used in this experiment. Evidently, the Google frequency densities for the different word types overlap, but according to the measure, words are generally more frequent than pseudowords, which are generally more frequent than random strings (see the top in Figure 5).
This pattern supports the claim that the measure might capture changes on a shared criterion, applicable to both words and nonwords. In contrast, the OLD20 measure seems to mainly discriminate between random strings and the other word types. Specifically, the distributions for words and pseudowords overlap considerably, suggesting that the pseudowords used in the experiment are orthographically very word-like (see the bottom in Figure 5). In comparison, random strings are generally associated with higher OLD20 scores. Notably, compared with words, foreign words also appear to have higher OLD20 scores, suggesting that their make-up is less typical for Dutch words (in the DLP).
For the words included in the experiment, the Google frequency measure also shows strong positive correlations (see Table 1) with the SUBTLEX (Keuleers, Brysbaert, & New, 2010) and CELEX (Baayen et al., 1995) frequency measures. This suggests that, at least for words, the Google frequency scores still capture classical frequency information. For foreign words, the correlations with these classical frequency measures are lower and closer in magnitude to the correlations visible for pseudowords. For random strings, the Google frequency scores do not correlate strongly with any of the classical frequency measures (i.e., the SUBTLEX or CELEX frequency scores of their base words). However, random strings alone show a strong negative correlation between the Google frequency measure and the OLD20 scores (i.e., random strings that are orthographically more similar to real words are generally associated with a higher Google frequency score).
. | Random Strings . | Pseudowords . | ||||
---|---|---|---|---|---|---|
Google . | Subtlex . | Celex . | Google . | Subtlex . | Celex . | |
Subtlex | 0.071 | 0.253 | ||||
Celex | 0.074 | 0.829 | 0.235 | 0.827 | ||
OLD20 | −0.533 | −0.111 | −0.183 | 0.010 | −0.245 | −0.239 |
. | Foreign Words . | Words . | ||||
Google . | Subtlex . | Celex . | Google . | Subtlex . | Celex . | |
Subtlex | 0.307 | 0.660 | ||||
Celex | 0.068 | 0.777 | 0.663 | 0.853 | ||
OLD20 | 0.065 | −0.317 | −0.278 | −0.096 | −0.133 | −0.126 |
. | Random Strings . | Pseudowords . | ||||
---|---|---|---|---|---|---|
Google . | Subtlex . | Celex . | Google . | Subtlex . | Celex . | |
Subtlex | 0.071 | 0.253 | ||||
Celex | 0.074 | 0.829 | 0.235 | 0.827 | ||
OLD20 | −0.533 | −0.111 | −0.183 | 0.010 | −0.245 | −0.239 |
. | Foreign Words . | Words . | ||||
Google . | Subtlex . | Celex . | Google . | Subtlex . | Celex . | |
Subtlex | 0.307 | 0.660 | ||||
Celex | 0.068 | 0.777 | 0.663 | 0.853 | ||
OLD20 | 0.065 | −0.317 | −0.278 | −0.096 | −0.133 | −0.126 |
The OLD20 measure (Yarkoni et al., 2008) was computed for each stimulus based on the closest 20 words out of all words in the DLP corpus (Keuleers, Diependaele, & Brysbaert, 2010). For pseudowords and random strings, the SUBTLEX (Keuleers, Brysbaert, & New, 2010) and CELEX (Baayen et al., 1995) frequency of the real word (base word) from which they were derived were used to compute the correlations.
These results indicate that Google frequency scores reflect different information for different word types, which is necessary for them to be a suitable and meaningful proxy of the holistic word-likeness of our stimuli. Specifically, for random strings, word-likeness appears to be captured mainly by information about their orthographic plausibility (e.g., measured by the OLD20), than by base-word frequency information. A more word-like random string might thus look like a typographical error (e.g., “tajel”). For words and pseudowords, the distribution of the OLD20 measure (see Figure 5) suggests minor differences in their orthographic plausibility. Therefore, differences in word-likeness for words and pseudowords might be better captured by (base) word exposure or frequency information, which is further supported by the pattern visible in the correlations for words and pseudowords. The correlations between these “defining” predictors and the Google frequency measure are visualized in Figure 6.
SUBTLEX Stage Duration Analysis
Across stages, nonlinear frequency effects can be observed. To test whether the nonlinearity for words (cf. Stages 1, 5, and 6 in Figure 3) is contingent on the Google frequency measure, the analysis was completed again using the more conventional SUBTLEX frequency measure (Keuleers, Brysbaert, & New, 2010). When relying on the SUBTLEX frequency measure, the effect of frequency on stage duration becomes negative and approximately linear across all stages (see Figure 7; generally matching the previous results in Figure 3).
This could indicate that the nonlinear Google frequency effect for words still captures frequency effects of separate word clusters. As mentioned before, the Google frequency scores appear to discriminate between at least two clusters of words: foreign and Dutch-exclusive words. The Google result counts of foreign words (e.g., “coupon”) are likely inflated, because these words also appear in resources written in other languages. This is also reflected in their SUBTLEX frequency scores, which shift toward the lower end of the frequency spectrum (see Figure 7). Because the nonlinearity in the Google frequency effect for words is mainly localized in the frequency range marking this transition to foreign words, it might result from the Dutch words in that range being more similar to foreign words (i.e., they should be allocated to the foreign word set for analysis purposes).
ERP Analysis
The follow-up analysis using the SUBTLEX frequency measure still revealed word frequency effects on the earliest processing stages, indicating that these are unlikely to be an artifact of the Google frequency measure. To ensure that the earliest frequency effects are not a result of using the HsMM-MVPA method either, the early ERP waveforms were inspected as well. ERP waveforms over the first 300 msec after stimulus onset were computed for each of four Google frequency quantiles per word type using the preprocessed EEG data before the down-sampling was applied for the HsMM-MVPA analysis. Only the first 300 msec of the ERPs were visualized because this time range coincides with the average duration of the first stages showing the early frequency effects of interest here. In addition, because the early stage topologies recovered by the HsMM-MVPA method suggest mainly posterior activation patterns (see Figure 2), the results are presented for two parietal/occipital channels (Pz and Oz; see Figure 8).
The early ERPs reveal subtle amplitude differences for different frequency quantiles already in the first 0–100 msec after stimulus onset. From the waveforms, it is not immediately clear whether these differences are a result of the frequency manipulation or because of the ongoing oscillations in the signal, which are present throughout the baseline period and visible until approximately 100 msec after stimulus onset (see Figure 8). This problem does not arise with the HsMM-MVPA analysis. Compared with the stage duration estimates obtained from the latter, any early effect of frequency appears to be more attenuated in the waveforms. However, the amplitude differences appear to be slightly more pronounced for nonwords, which would support the early interaction effects between word type and frequency visible in the first three processing stages.
DISCUSSION
Our results paint a much more detailed and intricate picture of the processes involved in LDs and the roles played by stimulus properties throughout them than previous EEG studies (e.g., Meade et al., 2019; Segalowitz & Zheng, 2009; Hauk, Patterson, et al., 2006; Barber et al., 2004; Holcomb et al., 2002). The model revealed that six processing stages need to be completed for both words and nonwords (in line with the findings by Berberyan, van Rijn, et al., 2021), adding further evidence in favor of a unified system enabling word and nonword processing (e.g., Yap et al., 2015). Contrary to previous work by Berberyan, van Rijn, et al. (2021), our results suggested that the fifth processing stage was associated with two brief perturbations of the EEG signal for pseudowords only, reflected by the presence of two substages (5a and 5b). In addition, both Google frequency and word type affected the duration of every processing stage, with the nature of the effect varying between stages. The qualitatively different frequency effects for words, pseudowords, and random strings in particular suggest that all of the processes enabling LDs are sensitive to differences between words and nonwords.
Early Visual Processing
Across the first three stages, more frequent words (Dutch and foreign) were generally processed faster (cf. the word frequency effect in the first stage; Figure 3). Their early onset and posterior activation patterns, visible in the recovered topologies (see Figure 2), suggest that these first three stages reflect visual processing and orthographic encoding (e.g., Berberyan, van Rijn, et al., 2021; Araújo et al., 2015; Segalowitz & Zheng, 2009; Hauk, Patterson, et al., 2006). The frequency effects on these early processes highlight a relation to different levels of stimulus representation: Compared with infrequent stimuli, a frequent word is not just more prevalent but also has a much more recognizable and familiar character string (e.g., Seidenberg & McClelland, 1989). Early distributed orthographic word representations can be expected to become activated more rapidly for these more recognizable character strings (Andrews, 1989; Seidenberg & McClelland, 1989; McClelland & Rumelhart, 1985), which could explain the aforementioned word frequency effect. The prolonged processing of foreign words in these stages could similarly result from their atypical (for Dutch words) character strings (see Figure 5 and Hauk, Patterson, et al., 2006).
For nonwords, the frequency effect in the first three stages was reversed: Frequent nonwords were associated with longer stage durations. The effect was generally more pronounced for random strings than pseudowords (see Figure 3). As mentioned in the Introduction section, nonword frequency effects might similarly be driven by string familiarity or “word-likeness” effects (Yap et al., 2015; Seidenberg & McClelland, 1989; Balota & Chumbley, 1984) and we suggested that the Google frequency measure might indeed be a good proxy of the latter. However, if the function of these first three stages is visual and orthographic processing, this reversal might appear to contradict this interpretation: The character strings of more frequent nonwords (pseudowords in particular) should be more recognizable and could thus be expected to facilitate processing, comparable to words. Indeed, Hauk, Patterson, and colleagues (2006) interpreted lower EEG amplitudes for more “typical” pseudowords (measured by higher syllable frequency) in a similarly early time window (around 100 msec post stimulus onset) as evidence for such facilitated processing. However, the direction of these early amplitude differences appears to vary across studies. For example, Segalowitz and Zheng (2009) suggested that higher amplitudes in the same time frame of 100–150 msec in posterior regions for more “typical stimuli” could be attributed to facilitated orthographic processing as suggested by Hauk, Patterson, and colleagues (2006). Our ERPs hinted at lower EEG amplitudes from 100 to 150 msec post stimulus onset for more frequent pseudowords in the parietal and occipital electrodes (see Figure 8), which could reflect a similarly facilitatory effect of frequency.
Because it cannot be ruled out that these amplitude differences are a result of ongoing oscillations in the EEG signal and because ERPs do not account for the trial-level variation in the onset of the processing stages, these amplitude differences might however not be related to the prolonged stage durations (Borst & Anderson, 2024; Luck, 2014). Alternatively, this mismatch between the effects of frequency on stage duration and EEG amplitude could indicate that both measures capture different aspects of processing. Specifically, the early reduced EEG amplitudes observed for more frequent (i.e., word-like) or “typical” (Hauk, Patterson, et al., 2006) nonwords might only reflect that these strings are more in line with what the language system considers orthographically probable input. In this specific sense, the corresponding strings would still be easier to process as put forward by Hauk, Patterson, and colleagues (2006).
As for the positive effect of frequency on stage duration for nonwords, we hypothesize that more word-like nonwords might still activate orthographic representations that are similar to multiple words (e.g., Hendrix & Sun, 2021; Seidenberg & McClelland, 1989; McClelland & Rumelhart, 1985). Both the dual route cascade and multiple read out models would predict prolonged orthographic processing as a result of such “nonspecific” orthographic activation (Coltheart et al., 2001; Grainger & Jacobs, 1996): Although more frequent nonwords might initially be considered to be plausible word input, possibly reflected in lowered EEG amplitudes (Hauk, Patterson, et al., 2006), their activation pattern might remain nonspecific enough to prolong processing.
Compared with the very similar effects of word type and frequency across the first three stages, the effects visible in Stages 4 and 5 are notably different. Berberyan, van Rijn, et al. (2021) attributed a familiarity process to the fourth and a decision process (e.g., Wagenmakers et al., 2008; Ratcliff et al., 2004; Balota & Spieler, 1999; Seidenberg & McClelland, 1989; Balota & Chumbley, 1984) to the fifth processing stage, based mainly on the evidence that frequency and word type effects were limited to the later stage. However, in our study, the effects of frequency and word type are no longer limited to the fifth stage. Instead, at least random strings are clearly distinguishable from the remaining word types based on the duration of the fourth stage already (Figure 3).
Orthographic Familiarity Assessment
This clear distinction between random strings and the remaining word types suggests that at the end of Stage 4, all word types except random strings have activated orthographically word-like representations (Plaut, 1997; Seidenberg & McClelland, 1989). This would suggest that although initial visual processing is prolonged for both more frequent pseudowords and random strings (see frequency effects in Stages 1–3), random strings ultimately fail to result in a distributed activation pattern matching the expectation for a word-like character string (Seidenberg & McClelland, 1989; McClelland & Rumelhart, 1985).
Intriguingly, the topologies preceding the fourth stage match the “mid frontal old/new” ERP pattern, previously linked to familiarity processes (e.g., Rugg & Curran, 2007), which further indicates that during Stage 4, the representations activated by a presented stimulus become either orthographically “familiar” or not. The N170 ERP component, which similarly has been argued to reflect early processes trained specifically on character strings, also has a topography comparable to the topologies preceding Stage 4 (e.g., Mahé et al., 2012; Maurer, Brandeis, et al., 2005; Maurer, Brem, et al., 2005). However, although the model selection procedure revealed the topology preceding the fourth stage to differ between word types, with stronger positive amplitudes over a larger frontal region for pseudowords and random strings (see Figure 2), the differences between word types are more gradual compared with the clear pattern visible in the duration of the stage.
At first glance, the topologies and pattern visible in durations might thus appear to be in conflict. However, the same early processing stages were completed for words and nonwords, which suggests that the same processes have been involved in word and nonword processing so far. This could explain why the bump topologies preceding Stage 4 are so similar across word types. Intriguingly, it then appears to be the timing of these early and specialized processes (e.g., those attributed to the N170; Maurer, Brandeis, et al., 2005; Maurer, Brem, et al., 2005), as revealed by the HsMM-MVPA method, that starts to differ more drastically between word types, thereby providing more insights into the state of word and nonword processing during Stage 4 than an analysis of topographies alone.
Decision Process
Although a discrimination between random strings and word-like stimuli appears to be possible after Stage 4, pseudowords remain nearly indistinguishable from words at this point (except for infrequent pseudowords; see Figure 3). Given that this experiment featured both random strings and pseudowords, all presumably differing in their word-likeness, it would also be unlikely that orthographic information alone, presumably available at this point, would result in representations sufficiently distinct for discrimination between words and nonwords (Ratcliff et al., 2004; Plaut, 1997; Seidenberg & McClelland, 1989; Balota & Chumbley, 1984). This necessitates an elaborate decision process, likely taking into account additional sources of information, to reach the final LD (Berberyan, van Rijn, et al., 2021; Wagenmakers et al., 2008; Ratcliff et al., 2004; Balota & Spieler, 1999; Plaut, 1997; Balota & Chumbley, 1984).
In our study, the fifth stage was completed much faster for random strings than for words and pseudowords. In addition, completing this stage took the longest for pseudowords (see Figure 2) and there was overlap in stage completion times for infrequent foreign words and similarly frequent pseudowords (see Figure 3). This complex pattern visible in this stage matches an elaborate decision process as outlined above, one in which the information available from the representations activated by the different stimuli is integrated and evaluated (the latter is likely a noisy process; Ratcliff et al., 2004). A decision-process account of Stage 5 is further supported by the fact that the effects visible in this stage mirrored the effects of frequency and word type on the RT data (see Figure 4). This aligns with previous proposals that patterns in the RT data can be traced back almost exclusively to a single decision process (e.g., Wagenmakers et al., 2008; Ratcliff et al., 2004).
To see why different information has to be integrated (potentially into a single decision criterion; Ratcliff et al., 2004; Seidenberg & McClelland, 1989; Balota & Chumbley, 1984), consider the example of infrequent words: The information available from the first four stages about the orthographic plausibility of the word alone does not indicate a word response (stage duration generally similar to pseudowords), so additional information (phonological or potentially even semantic; Plaut, 1997; Seidenberg & McClelland, 1989) has to be taken into account as well (Seidenberg & McClelland, 1989; Balota & Chumbley, 1984). This additional information might already start to become available during Stage 4, when the orthographic representations activated for different stimuli in turn start activating higher level representations (e.g., “semantic patterns”; Plaut, 1997; Seidenberg & McClelland, 1989).
Such an activation of higher level representations could also offer an account for the intriguing nonword frequency effect visible in this stage: Stage 5 took longer to complete for more frequent random strings but was completed slightly faster for more frequent pseudowords. This pattern is similar to the base-word frequency effects reported by Yap and colleagues (2015), who observed shorter RTs for nonwords with a higher average base-word frequency. Importantly, they found the effect was even more pronounced for the set of difficult nonwords (i.e., those eliciting higher RTs in general). They suggested that more difficult nonwords might be subjected to a “verification process” (Paap et al., 1982) in which they end up being compared with the representations of their more frequent base words. Because these base words are more frequent, their representations are assumed to be more robust so that the comparisons result in greater mismatch, which could enable faster rejections (Yap et al., 2015; McClelland & Rumelhart, 1985; Paap et al., 1982).
Similarly, Mahé, Grisetto, Macchi, Javourey-Drevet, and Roger (2024) recently suggested an “internal monitoring” system, which fulfills a comparable role and explained how such a system might develop. Specifically, they suggest that “performance or error monitoring” might play an important role during reading acquisition: Early and external reading performance feedback will act like training data for an internal monitoring system, resulting in increasingly more refined internal representation of what constitutes the “correct reading” for different words. Once sufficiently trained, they argue that the “error score” obtained from monitoring the difference between the activated representation triggered by reading a word and the refined internal representation could act as additional internal feedback and prompt alternative strategies to correct the mismatch. The error score discussed by Mahé and colleagues (2024) takes on a role similar to the mismatch value proposed by Paap and colleagues (1982).
We hypothesize that such a mismatch value or error score, potentially computed from representations that start to become activated by the end of Stage 4, might be available as an additional source of information for the proposed decision process in this fifth stage. Because of their more plausible make-up, pseudowords can be expected to generally be more word-like than random character strings (see Figure 5 and the effects visible in Stage 4). Hence, it can be expected that evaluating the information available for them might take more time, on average, which could be reflected in the longer durations of Stage 5 observed here. At the same time, more frequent pseudowords, but not random strings, might be so word-like that they produce greater mismatch values resulting in increased evidence for a nonword response and thus ultimately faster decisions (cf. Yap et al., 2015; Ratcliff et al., 2004; Paap et al., 1982).
The relevance of this error score for the pseudoword decision process in particular could also be the reason for the observed split of their fifth stage. As discussed briefly in the Results section, the similarity of the topology preceding Substage 5a (see Figure 2), the absence of strong trial-level variability or a frequency effect on the duration of this substage (see Figures 2 and 4), and the observations that the effect of frequency on RTs best matches the effect of frequency on the combined duration of Substages 5a and 5b (see Figure 4) all suggest that the extra stage for pseudowords was inserted only because the process attributed to Stage 5 (i.e., the decision process) caused two brief perturbations of the EEG signal for pseudowords. The greater magnitude of the error score for pseudowords could be the source of such a perturbation of the EEG signal. Specifically, while such an error score would have to be computed not just for pseudowords, we hypothesized that pseudowords would result in particularly large errors, which would likely be reflected in the EEG (e.g., Gehring, Goss, Coles, Meyer, & Donchin, 2018). The onset of this extra stage for pseudowords (i.e., Substage 5a), right after the shared fourth stage, would also match the hypothesized moment where the representations involved in the computation of this mismatch value become activated.
The presence of both an orthographic familiarity and a decision stage also aligns with earlier dual-stage decision models of LDs that proposed an initial judgment based on familiarity information followed by a more elaborate assessment (e.g., Balota & Spieler, 1999; Balota & Chumbley, 1984). However, although early models assumed the subsequent assessment to be contingent on a failure of the familiarity assessment, our results suggest that both processes need to be completed to reach conclusive LDs. A similar combination of familiarity and subsequent decision processes has been observed in associative recognition (i.e., deciding whether a pair of words has previously been presented together or not; Borst, Ghuman, & Anderson, 2016; Borst & Anderson, 2015), suggesting that the same processes might be shared across related tasks. Although the familiarity process in associative recognition starts earlier (180–280 msec; Borst & Anderson, 2015), the EEG activation patterns recovered for both processes largely match the topologies we observed here, adding further support to the claim that these might be similar processes.
Response-related Processing
The last stage recovered by the HsMM-MVPA method has been interpreted as response related, involving the issuing of motor commands to press the response button (Berberyan, van Rijn, et al., 2021). The effects in this stage are qualitatively similar to the pattern visible in the fifth stage and the duration estimates for Stages 5 and 6 correlate moderately (r = .68, .57, .57 for Random Strings/Pseudowords/Words), suggesting the time until the motor stage commences to be predictive of the time it takes to provide the LD response. Similar effects have again been observed in associative recognition tasks (Borst & Anderson, 2015) and have been argued to reflect a relationship between decision confidence and the time it takes to indicate the response (Wixted & Stretch, 2004; Ratcliff & Murdock, 1976). In line with this, the confidence in the LD, supposedly established during the fifth stage, would then partially determine the speed of the final response.
Testing the Three-way Interaction
The previous discussion highlighted different effects of frequency across processing stages. Although the visualization of the estimated frequency effects (Figure 3) indeed suggests differences in the effect of frequency not just between word types but also across processing stages, no formal test was conducted of this possible three-way interaction. In principle, a shared three-way interaction model, fitted to the data from all processing stages, could have been compared with a nested simpler model to check whether the three-way interaction offers a better account of the data than a model containing only the three 2-way interactions (e.g., Wood, 2017). Indeed, an AIC comparison of these two models would suggest that the model including the three-way interaction offers superior fit for the data (AIC difference = 25.470). Inspecting the residuals of such a shared model, however, confirms that their variance fluctuates substantially between processing stages, indicating a violation of the constant variance assumption applicable to normal GAMMs (Wood, 2017). Thus, the AIC comparison involves obviously miss-specified models, which is clearly undesirable. Here, we instead opted to fit and present stage-specific GAMMs but future research might want to consider estimating a Gaussian location and scale model as discussed by Wood and colleagues (2016) that is a shared model, which, for example, allows not just for the mean but also the variance of the observation-specific normal distribution to vary as a function of predictor variables like processing stage or frequency. Such a model is arguably more complex, both conceptually and computationally, but would pool all data into a single model and might be better suited to test three-way interactions like the one between processing stage, word type, and frequency.
Robustness and Meaning of Frequency Effects
Although we expected early effects of frequency based on previous ERP studies (e.g., Segalowitz & Zheng, 2009; Hauk, Davis, et al., 2006; Hauk, Patterson, et al., 2006; Sereno et al., 1998), the reported effects (end of first stage around 70 msec) might be considered surprisingly early (Lamme & Roelfsema, 2000), despite the many proposals that encoding of early internal representations might commence earlier than traditionally assumed (e.g., Araújo et al., 2015; Hauk, Patterson, et al., 2006; Sereno & Rayner, 2003). To make sure that the effects were not a result of the HsMM-MVPA method, we also performed a traditional ERP analysis, and visual inspection suggested subtle amplitude differences for different frequency quantiles in the corresponding time window (see Figure 8). In addition, early effects on stage duration appeared independent of the used frequency measure (Google vs. SUBTLEX; see Figure 7). We thus believe the effects of Google frequency on stage duration to be robust and remain convinced that the measure is likely more reflective of the different roles “frequency” can play in different processes and for different word types. The extra information “mixed into” the Google frequency scores (reflected in the correlations with classical frequency measures and other predictors; see Figure 6, Table 1, and Hendrix & Sun, 2021) might serve to further highlight these different roles. At the same time, the correlations with other predictors as well as the way the Google frequency scores are obtained also complicate the interpretation of the information captured by them.
As pointed out in the Introduction section, different frequency measures will all reflect different experiences with words and it is therefore not always obvious what information exactly is captured by conventional measures either (e.g., Baayen et al., 2016). For example, according to a subtitle-based (e.g., SUBTLEX; Keuleers, Brysbaert, & New, 2010) frequency measure, only slightly more than 2% of the words in the DLP are more frequent than the Dutch word for “murdered” (“vermoord”). In contrast, a corpus-based frequency measure (e.g., CELEX; Baayen et al., 1995) suggests that more than 10% of words in the DLP are more frequent.
These differences likely arise at least in part because both measures capture a different context in which words are experienced and because subtitle-based frequency measures will likely carry very different information about the emotional valence of stimuli compared with corpus-based measures (e.g., Baayen et al., 2016). Hence, what contributes to a word's “frequency” rating generally differs between measures. Here, we thus opted for a more multifaceted interpretation of “frequency” as holistic “word-likeness” (e.g., Hendrix & Sun, 2021; Seidenberg & McClelland, 1989; Balota & Chumbley, 1984). For one, this interpretation offers a more parsimonious account of the effects in the first three stages, likely reflecting visual processing, which might be difficult to reconcile with the classical or lexical interpretation of frequency. Considering the relation of frequency to word-likeness, the early Google frequency effects become more intuitive and comparable across word types. In addition, we also believe that this more multifaceted interpretation allows to better address the mixed nonword “frequency” effects described in the Introduction section. We suggested that these mixed effects could result from different nonword predictors interacting with each other (e.g., Yap et al., 2015) but also because base-word and substring predictors (e.g., base-word and n-gram frequency) might fail to sufficiently account for the overall notion of stimulus word-likeness (e.g., Hendrix & Sun, 2021; Seidenberg & McClelland, 1989; Balota & Chumbley, 1984). As mentioned in the Introduction section, a unified and more direct measure of holistic word-likeness would go a long way to address these problems and would also permit for a unified account of both word and nonword frequency effects.
Evidence indicating that the Google frequency scores act as a sufficient proxy of this “word-likeness” interpretation was provided by the Google frequency score distributions for our set of stimuli (see Figure 5). These clearly indicate that the Google frequency measure captured continuous differences between all our stimuli on a single dimension. More importantly, the measure maintained the expected “word-likeness” ordering: On average, words scored higher than pseudowords, which scored higher than random strings. Finally, the measure clearly captured information relevant for all processes involved in LDs: The two-way interaction of frequency and word type contributed strongly to the model of stage duration across all processing stages. Although these results do not constitute proof that the Google frequency measure indeed reflects word-likeness, they at least provide evidence that the measure captures information related to this concept.
HsMM-MVPA Limitations and Related EEG Analysis Methods
A limitation of the HsMM-MVPA method used here is that it currently cannot capture effects of continuous predictors on the EEG topologies marking the onset of the estimated stages. Rather, these topologies only reflect approximately the average (over trials and thus frequency differences) EEG pattern in the first 50 msec of a stage (cf. Anderson et al., 2016). Contrary to ERPs, however, the trial-level stage duration estimates account for the inevitable variability in the duration of language processes (Borst & Anderson, 2024) enabling a direct investigation of the effects of continuous predictors like frequency.
The goal of the HsMM-MVPA method, to generate insights into the cognitive stages enabling task performance, is also shared with other methods like the topographic segmentation of EEG into distinctive “microstates” (e.g., Murray, Brunet, & Michel, 2008; Lehmann, Strik, Henggeler, Koenig, & Koukkou, 1998). This segmentation approach assumes that the EEG signal can be broken down into a set of briefly stationary EEG topographies, which are usually recovered from average ERPs by means of clustering algorithms and then fitted to individual or trial-level data in an optional second step (e.g., Mahé et al., 2015; Murray et al., 2008). As mentioned before, any pattern visible in the average ERP will be a distorted version of the actual trial-level patterns in case of trial-level variability in the onset of the patterns (Borst & Anderson, 2015; Luck, 2014). As a consequence, multiple distinctive patterns on the trial level with variable onset might no longer be distinguishable in the averaged ERP waveform, which could impair the precise recovery of both the number and onset of processing stages. This is precisely what the HsMM-MVPA method avoids, by fitting models to the trial-level EEG data (see the Methods section for details and Anderson et al., 2016).
Another approach is to untangle the brain responses related to different types of processing by means of estimating “temporal response functions” (TRFs) from EEG data (Brodbeck et al., 2023; Gaston, Brodbeck, Phillips, & Lau, 2023; Lalor, Pearlmutter, Reilly, McDarby, & Foxe, 2006). As discussed in more detail by Brodbeck and colleagues (2023), TRF analysis involves estimating the individual time-lagged EEG responses to different predictor variables. The latter can reflect isolated events in time (e.g., word onset) but also continuous changes over time (e.g., features derived from the sound wave in auditory stimulus presentation). Like the HsMM-MVPA method, TRFs are usually estimated directly from trial-level data (e.g., Brodbeck et al., 2023). Unlike the HsMM-MVPA method, TRF analysis assumes that the onsets of EEG-response-eliciting events are known in advance and specified as predictor variables. Given events and EEG data, TRF analysis then obtains an estimate of the EEG response elicited by the events (e.g., Brodbeck et al., 2023). In contrast, the HsMM-MVPA method only starts with a partial definition of what the response to a specific type of “event” (i.e., the onset of a cognitive stage) should look like in the EEG, a certain bump topology corresponding to 50-msec bumps across electrodes (see Figure 1), and then estimates the onset of these events as well as the optimal weights for the bumps making up the topology.
As pointed out by Brodbeck and colleagues (2023), the advantage of a TRF analysis is most clearly realized in experiments where different predictors or their EEG responses overlap temporally: Although a TRF analysis can distinguish between these different responses, a conventional ERP analysis cannot (e.g., Brodbeck et al., 2023). However, for tasks like the one presented here, in which the onset of at most two events is known (i.e., stimulus and response onset), a TRF analysis would add little value over a traditional ERP analysis, provided that sufficient time passes between two trials to prevent temporal response overlap. In these tasks, the HsMM-MVPA method, by detecting the onset of latent cognitive stages, can provide a much more detailed account of the processing involved.
We thus believe that the trial-level stage duration analysis presented here is a valuable addition to the existing set of tools available for the study of LDs and cognitive processing in general.
Conclusion
In conclusion, the combination of the HsMM-MVPA method (Anderson et al., 2016), to recover trial-level cognitive stages, and GAMMs (Wood, 2017; Hastie & Tibshirani, 1990), to directly analyze the effects of continuous predictors on these stages, provided unprecedented insights into the way changing a shared frequency measure, capturing word-likeness, impacts the processing stages involved in LDs for words and nonwords. The combination of methods compliments existing tools available for the analysis of the time-course of cognitive processing, allowing researchers to (1) directly investigate all of the processes enabling a specific task and to (2) study how continuous and categorical predictors alike shape the duration of these processes.
APPENDIX A: CHANGES IN HELD-OUT LOG-LIKELIHOOD
APPENDIX B: ORIGINAL TRIAL-LEVEL ANALYSIS
Corresponding author: Joshua Krause, Bernoulli Institute for Mathematics, Computer Science and Artificial Intelligence, University of Groningen, Nijenborgh 9, 9747 AG Groningen, The Netherlands, or via e-mail: [email protected].
Data Availability Statement
The preprocessed trial-level EEG and RT data described in the article have been made available on GitHub (https://github.com/JoKra1/HSMM_LD_EEG/). In addition, the repository contains a list of all stimuli used in the experiment, their corresponding Google frequency estimates, and scripts to replicate the results presented in the article.
Author Contributions
Joshua Krause: Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Software; Validation; Visualization; Writing—Original draft; Writing—Review & editing. Jacolien van Rij: Conceptualization; Methodology; Project administration; Resources; Software; Supervision; Validation; Writing—Review & editing. Jelmer P. Borst: Conceptualization; Methodology; Project administration; Resources; Software; Supervision; Validation; Writing—Review & editing.
Diversity in Citation Practices
Retrospective analysis of the citations in every article published in this journal from 2010 to 2021 reveals a persistent pattern of gender imbalance: Although the proportions of authorship teams (categorized by estimated gender identification of first author/last author) publishing in the Journal of Cognitive Neuroscience (JoCN) during this period were M(an)/M = .407, W(oman)/M = .32, M/W = .115, and W/W = .159, the comparable proportions for the articles that these authorship teams cited were M/M = .549, W/M = .257, M/W = .109, and W/W = .085 (Postle and Fulvio, JoCN, 34:1, pp. 1–3). Consequently, JoCN encourages all authors to consider gender balance explicitly when selecting which articles to cite and gives them the opportunity to report their article's gender citation balance.
Notes
Because no topology is estimated for the first stage in the HsMM-MVPA method (Anderson et al., 2016), evidence for six stages supports five distinct bump topologies.
According to the likelihood function given by Anderson et al. (2016).
Because a HsMM-MVPA model estimates a separate scale parameter for the sojourn or duration distribution of each processing stage.
These are not the Bonferroni-corrected difference CIs discussed in the Methods section. They provide an indication of uncertainty in the estimates only and were not used for inference directly.
AIC comparisons for the stage-specific models would similarly indicate that the two-way interaction offers a much better account of the data from each stage: For every stage, the model including the two-way interaction has substantially lower AIC than the model including only the main effect of frequency and word type (the smallest difference in AIC was 18.662 for the second stage, for the remaining stages, the AIC differences ranged between 22.290 and 153.454).
These substage GAMMs were still fitted to the data from all word types. For pseudowords, the duration of either Substage 5a or 5b was used as a dependent variable. For the remaining word types, the duration of the shared fifth stage was used instead.