Regression is a principal tool for relating brain responses to stimuli or tasks in computational neuroscience. This often involves fitting linear models with predictors that can be divided into groups, such as distinct stimulus feature subsets in encoding models or features of different neural response channels in decoding models. When fitting such models, it can be relevant to allow differential shrinkage of the different groups of regression weights. Here, we explore a framework that allows for straightforward definition and estimation of such models. We present an expectation-maximization algorithm for tuning hyperparameters that control shrinkage of groups of weights. We highlight properties, limitations, and potential use-cases of the model using simulated data. Next, we explore the model in the context of a BOLD fMRI encoding analysis and an EEG decoding analysis. Finally, we discuss cases where the model can be useful and scenarios where regularization procedures complicate model interpretation.

Understanding how the brain responds to different stimuli or tasks is a key objective in systems neuroscience. In recent years, significant progress has been made in developing models to analyze neural responses to complex, naturalistic stimuli (Hamilton & Huth, 2020). One potent approach is encoding models that combine stimulus features to predict channel-wise neural responses from fMRI or electrophysiology through linear regression (Broderick et al., 2018; de Heer et al., 2017; Di Liberto et al., 2015; Dumoulin & Wandell, 2008; Kay et al., 2013; Kell et al., 2018). Regularization is often needed in such models to reduce the risk of over-fitting and improve model prediction on held-out data. However, multiple - potentially colinear - stimulus feature sets can complicate the assessment of the relative importance of the individual features. In this case, it can be relevant to allow for differential shrinkage of different groups of regression weights. For instance, neural responses to natural speech depend on both low-level acoustic features and high-level phonetic or semantic features (Broderick et al., 2018; Di Liberto et al., 2015; Huth et al., 2016; Jain & Huth, 2018). Such feature groups vary on different time scales, have different dimensionality, and some features may not even be reflected in the neural response.

Several regularization strategies could have relevance for this situation (David et al., 2007; Golub et al., 1999; Hoerl & Kennard, 1970; Massy, 1965; Park & Pillow, 2011; Sahani & Linden, 2002; Tibshirani, 1996; Tibshirani et al., 2005; Tikhonov & Arsenin, 1977; Zou & Hastie, 2005). Recently, Nunez-Elizalde et al. (2019) and la Tour et al. (2022) proposed to divide predictors in encoding analyses into separate feature “bands” (groups of predictors) and allow differential shrinkage of weights associated with these different bands. The authors proposed banded Ridge regression - also referred to as grouped Ridge regression (van de Wiel et al., 2021) - and showed that banded Ridge regression can yield higher out-of-sample predictive accuracy compared to standard Ridge regression in fMRI encoding analyses (la Tour et al., 2022; Nunez-Elizalde et al., 2019). Other methods that can incorporate grouping structure constraints and other priors (Boss et al., 2023; Friedman et al., 2010; Simon et al., 2013; van de Wiel et al., 2016; Xu & Ghosh, 2015; Yuan & Lin, 2006; Zhao et al., 2009) could also be relevant in encoding analyses with grouped features.

Similar regression problems arise in decoding models, that is, models that combine neural channels (voxels, electrodes) to predict a continuous stimulus feature or task feature (Naselaris et al., 2011; Wu et al., 2006). One pertinent example is decoding models for EEG or MEG responses to continuous stimuli, like speech or music (Di Liberto et al., 2021; Ding & Simon, 2012; O’Sullivan et al., 2014). Here, regularized regression can be used to identify multidimensional finite impulse response (FIR) filters that predict stimulus features by combining different electrode signals at different time lags and/or different frequency bands (de Cheveigné et al., 2018; Mesgarani & Chang, 2012; Mirkovic et al., 2015). As with encoding models, it can be useful to allow for differential shrinkage of groups of weights corresponding to groups of predictors.

Here, we explore an empirical Bayes framework (Efron & Morris, 1973, 1975; Morris, 1983; Robbins, 1964) for estimating “banded”-type regression models. The framework is closely related to automatic relevance determination (Bishop, 2006; MacKay, 1994; Park & Pillow, 2011; Sahani & Linden, 2002; Tipping, 2001; Wipf & Nagarajan, 2007) and can be seen as an extension of work by Tipping (2001). The framework can be used to formulate regularized regression estimators with different regularization properties. We use an expectation-maximization (EM) algorithm to tune hyperparameters that control variances of prior distributions. The model can be used to formulate priors that allow for differential shrinkage of groups of regression weights. The model can additionally be used to promote smoothness on groups of regression weights by encouraging correlation between related regression weights.

In this paper, we describe the modeling framework and provide simulations to illustrate the properties, limitations, and potential use-cases of the described algorithm. Additionally, we illustrate properties of the model by using it to fit stimulus-response models on two publicly available datasets (Broderick et al., 2018; Nakai et al., 2022) and compare these model fits with the widely used Ridge estimator (Hoerl & Kennard, 1970). In the first dataset, we consider encoding models that use a set of audio features to predict BOLD fMRI responses to music (Nakai et al., 2022). In the second dataset, we fit decoding models that use features of multichannel time-lagged EEG data to predict an audio stimulus feature during a speech listening task (Broderick et al., 2018). Finally, we describe how simplifying model assumptions can allow for an efficient estimation in regression problems with multiple target variables. Software implementations of the model in both MATLAB and Python are available at https://github.com/safugl/embanded.

2.1 Regression model

Encoding and decoding analyses often involve linear regression models in the following form:

y=Xw+ϵ,
(1)

where ϵ is a zero-mean Gaussian variable.

Here, encoding models would treat stimulus features or task features as predictors, X, and a (preprocessed) neural response as target variable, y. Decoding models treat features extracted from neural activity measurements as predictors, X, and a stimulus feature or task feature as target variable, y. The differences in direction of inference have large implications for model interpretation and model properties (Haufe et al., 2014; Naselaris et al., 2011) as well as how potential stimulus confounders or task confounders can be dealt with.

Despite their differences, such encoding and decoding models lead to regression problems that can be encompassed in a general formulation. Let yRM×1 denote a target variable where M indicates the number of observations. Moreover, let X denote a matrix with D predictors such that XRM×D. Suppose that X can be divided into a set of meaningful groups such that X=[F1,F2,...,FJ], j=1,2,...,j,...J. Here, Fj denotes a group of Dj predictors, that is, FjRM×Dj. This group of predictors could, for example, be a stimulus feature subset in an encoding model, or response features of different voxels or electrodes in a decoding model. We thus consider a regression model in the following form (la Tour et al., 2022; Nunez-Elizalde et al., 2019):

y=F1β1+F2β2+...+FJβJ+ϵ,
(2)

where βj denotes regression weights for a given group of predictors, Fj. Further, let w denote a column vector collecting all the corresponding weights w=[β1,β2,...,βJ], with wRD×1. We will assume that input variables and target variables have been centered and that no intercept term is needed in the regression model (since the intercept typically carries no meaningful information and hence shrinkage is typically not desired). The conditional distribution of the target variable y given a set of observed predictors X and given the model with parameters w and ν will be assumed to follow a Gaussian distribution in the following form:

p(y|X,w,ν)=N(y|Xw,νIM),
(3)

where N(y|Xw,νIM) is used to indicate a multivariate Gaussian with mean Xw and a diagonal covariance matrix νIM of size M×M. A zero-mean Gaussian prior distribution will be placed over the weights:

p(w|Λ,η,τ)=N(w|0,Λ),
(4)

where Λ is a D×D block-diagonal covariance matrix. Such a prior distribution over the weights can be used to “shrink” the estimated regression weights. Here, we specifically define that Λ has a block structure defined as follows:

Λ[λ1ID100λJIDj][Ω100ΩJ].
(5)

By introducing such a prior distribution over the weights, it is now possible to allow for differential shrinkage of different groups of weights. We define that the j-th block in Λ relates to a single group of predictors, Fj, and that Ωj has size Dj×Dj. The column indices in X unique to predictor group Fj will thus index rows and columns in Λ for that block. The terms Ωj, j=1,...,J, denote matrices that are assumed to be fixed a priori. For now, we will assume that they are identity matrices. The following parametrization (Matérn, 1960; Rasmussen & Williams, 2005) of these terms will later be used for encouraging local smoothness via a Matérn covariance function with order 3/2 and length scale hj:

Ωj=(1+3|ki|hj)exp(3|ki|hj),k=1,...,Dj,i=1,...,Dj.
(6)

Note that rather than attempting to estimate hj and impose hyperpriors on the hj terms, we will fix these parameters a priori. We complete the model by placing Inverse-Gamma prior distributions over the λj terms and over the ν term:

p(ν|ϕ,κ)=Inv-Gamma(ν|ϕ,κ),
(7)
p(λj|η,τ)=Inv-Gamma(λj|η,τ),
(8)

where Inv-Gamma (x|a,   b) denotes an Inverse-Gamma distribution with shape a, scale b, and mode b/(a+1):

Inv-Gamma(x|a,b)=baΓ(a)1xa+1exp(bx),a>0,b>0.
(9)

To simplify notation, we will henceforth let λ denote a set of {λj}j=1J parameters and not highlight dependence on terms that remain fixed. We now need to regard λ and ν as random variables and by doing so the posterior distribution over all variables takes the form:

p(w,λ,ν|y,X)=p(y|X,w,ν)p(w|λ)p(λ)p(ν)p(y|X,w,ν)p(w|λ)p(λ)p(ν)dwdλdν.
(10)

This expression cannot be evaluated analytically. In what follows, we will resort to an empirical Bayesian estimation procedure and assume that p(w|y,X)=p(w,ν,λ|X,y)dνdλ can be approximated by p(w|y,X,ν˜,λ˜), where ν˜ and λ˜ are fixed values obtained by maximizing the marginal posterior distribution:

p(λ,ν|y,X)=p(w,λ,ν|y,X)dw.
(11)

We will return to assumptions entailed by this procedure later. Our goal is to use an expectation-maximization (EM) framework (Dempster et al., 1977; Minka, 1998; Neal & Hinton, 1998) to maximize this marginal distribution. In brief, this involves lower bounding lnp(λ,ν|y,X) with a function (λ,ν,q) that is a function of a valid probability distribution q(w) over w, and a function of the parameters λ and ν (Bishop, 2006):

(λ,ν,q)=q(w)lnp(w,λ,ν|y,X)q(w) dw.
(12)

The framework involves alternating between an “E-step” (finding a good lower bound) and an “M-step” (subsequently maximizing this bound) (Minka, 1998). Suppose we initialize the algorithm with λ(k) and ν(k) where k=0. In the expectation step (E-step), we fix λ and ν to these values and maximize (λ(k),ν(k),q) with respect to q(w). It can be shown that the lower bound will touch the objective when q(w)=p(w|y,X,λ(k),ν(k))q(k), where the posterior distribution of w given a fixed set of parameters λ and ν takes the form:

p(w|y,X,λ,ν)=p(y|X,w,ν)p(w|λ)p(y|X,ν,λ)
(13)
                                   =N(w|μ,Σ),
                               Σ=(Λ1+ν1XX)1,
(14)
                               μ=ν1ΣXy.
(15)

We will henceforth let μ˜ and Σ˜ denote mean and covariance of p(w|y,X,λ(k),ν(k)). Defining q(w) given λ(k) and ν(k) is the E-step. In the maximization step (M-step), we keep this distribution fixed and now maximize (λ,ν,q(k)) with respect to λ and ν. Ignoring irrelevant terms, one can show that we need to find a set of parameters λ(k+1) and ν(k+1) that maximizes Eq(k)[ln (p(y|X,w,ν)p(w|λ)p(λ)p(ν))] where Eq(k)[] denotes expectation under q(k) and where

ln p(y,w,λ,ν|X)=12ν(yXw)(yXw)M2Inν12wTΛ1w12In|Λ|+j=1JIn1λj1+ηj=1Jτλj+In1ν1+ϕκν+constant.
(16)

Recall that w is a column vector that contains all βj weights and that Λ is a block-diagonal matrix. We know that p(w|y,X,λ(k),ν(k)) is Gaussian with mean μ˜ and covariance Σ˜. Let μ˜j denote blocks in μ˜ associated with predictor group j. Further, let Σ˜j denote a block of size Dj×Dj along the diagonal in Σ˜ associated with predictor group j. We again discard irrelevant terms, and now see that the M-step amounts to maximizing the following expression with respect to ν and λ:

=12ν((yXμ˜)(yXμ˜)+Tr(XXΣ˜))M2lnνj=1J12λj(μ˜jTΩj1μ˜j+Tr(Ωj1Σ˜j))12j=1JDjInλj+j=1JIn1λj1+ηj=1Jτλj+In1ν1+φκν.
(17)

Finding a set of λ and ν parameters that fulfill /λj=0 and /ν=0 yields the following closed-form update rules:

λj(k+1)=μ˜jΩj1μ˜j+Tr(Ωj1Σ˜j)+2τDj+2η+2,
(18)
ν(k+1)=(yXμ˜)(yXμ˜)+Tr(XXΣ˜)+2κM+2+2ϕ.
(19)

The procedure of fitting the model thus involves (I) specifying hyperprior parameters ϕ, κ, η and τ, (II) starting with an initial guess for λ1,λ2,...λJ and ν, (III) estimating Σ˜ and μ˜, (IV) updating λj and ν according to Eqs. 18 and 19, and subsequently iterating between step (III) and (IV) until convergence.

2.1.1 Predictions

The above procedure yields a set of hyperparameters ν˜ and λ˜ from the training data that may be sensible in several circumstances, and we will consequently focus on p(w|y,X,ν˜,λ˜). For predictive purposes, we assume that the following approximation holds:

p(y^|y,X,X^)p(y^|y,X,X^,ν˜,λ˜)=p(y^|w,X^,ν˜)p(w|y,X,ν˜,λ˜)dw.
(20)

That is, to make predictions about a novel response variable, y^, given a new set of predictors X^ and inferred model parameters, we focus on p(y^|y,X,X^) and let p(λ,ν|X,y) be represented by a delta function δ(νν˜,λλ˜) (Tipping, 2001). The resulting distribution is Gaussian with mean X^μ˜ and covariance ν˜I+X^Σ˜X^. The terms μ˜ and Σ˜ here refer to the estimates in Eqs. 14 and 15, respectively. To make point predictions, we may now focus on the estimated model weights w=μ˜ and simply compute X^w. Our focus will be on such point predictions.

For the interested reader and illustrative purposes, we visualize variability in the estimated weights in Figures 1-3 based on Eq. 13. It is important to stress that p(w|y,X,λ˜,ν˜) is conditional on the point estimates of λ˜ and ν˜ and that it may fail to be even close to a reasonable approximation of p(w,ν,λ|X,y)dνdλ (Wolpert & Strauss, 1996).

3.1 Simulations

In addition to evaluating the predictive accuracy of encoding models or decoding models on held-out data, neuroimaging studies often inspect and interpret point estimates of model parameters. In linearized encoding models, model weights are often assumed to be directly interpretable in terms of the underlying brain activity generating a measured response (Haufe et al., 2014). Weights of general linear models (GLMs) in fMRI analysis (Friston et al., 1994) or temporal response functions in M/EEG analysis (Ding & Simon, 2012; Lalor et al., 2009) are prominent examples. However, the choice of regularization can significantly impact the relative magnitude of the estimated weights, in particular in problems with many, potentially correlated, sets of predictors. It can therefore be relevant to simulate data with known target model weights in order to better understand the properties of regularized regression estimators and identify situations where they can be useful or can complicate interpretation.

Below, we consider different simulations to illustrate the properties of the described model. We compare the proposed EM-based banded regression estimator (henceforth EM-banded) with a standard Ridge regression estimator: w=(XTX+αI)1XTy (Hoerl & Kennard, 1970). To simplify readability, we consider simulations where we fix the parameters related to the Inverse-Gamma priors to the same value, γ, such that τ=η=κ=ϕ=γ. Notice that the hyperpriors become broad when γ approaches zero.

3.1.1 Simulation 1: Suppressing contributions from a group of predictors

In models with groups of predictors representing different stimulus feature spaces, it is sometimes needed to identify entire feature spaces that can be excluded from an analysis without compromising predictive accuracy. To simulate such a situation, we here define three stimulus feature spaces, F1, F2, and F3, and one response variable y which contains mixed versions of F1 and F3, but not F2 (F2 does not contribute to the response). We thus simulate the response as y=F1w1+F3w3+ϵ, where ϵ is a Gaussian noise term. In this situation, we thus need the estimator to identify (close to) zero weights for F2. Each feature space has 64 predictors and a total of M=1024 observations are simulated. The weights, w1 and w3, as well as the feature space F1 and F3, are drawn from Gaussian distributions. The mixed target signal and the noise term have equal power. We compare weights estimated by the EM-banded model with weights estimated by Ridge regression models. To exemplify how hyperparameters controlling Inverse-Gamma hyperpriors can impact the estimated weights, we fit models for γ fixed to 104, 103, 102, and 101. The Ridge α hyperparameter is set to 104, 103, 102, and 101. The terms Ω1, Ω2, and Ω3 are identity matrices.

Figure 1 shows results from Simulation 1. The Ridge regression model yields “dense” solutions for all considered Ridge hyperparameters, α, with no dramatic overall scale differences between weights associated with F1, F2, or F3. In contrast, the EM-banded model excessively shrinks weights associated with feature set F2 for smaller values of γ. As expected, the EM-banded model tends to show signs of under-shrinkage when the Inverse-Gamma prior distributions have little mass near zero and in these cases, the overall scale differences between the different sets of weights associated with F1, F2, and F3 are smaller.

Fig. 1.

Simulation 1 illustrating regularization properties in a regression problem with three groups of predictors. Left middle panel: simulated target weights (black). The predictors are divided into three distinct predictor groups, F1, F2, and F3. Target weights associated with F2 are equal to zero. Top row: weights estimated with EM-banded model for different values of η=ϕ=τ=κ=γ. Shaded gray errorbars depict 1%, respectively, 99% percentile of samples from a multivariate Gaussian with mean and covariance defined according to Eq. 13. Bottom: weights estimated with Ridge regression model for different Ridge regularization parameters α. The target weights are shown in black.

Fig. 1.

Simulation 1 illustrating regularization properties in a regression problem with three groups of predictors. Left middle panel: simulated target weights (black). The predictors are divided into three distinct predictor groups, F1, F2, and F3. Target weights associated with F2 are equal to zero. Top row: weights estimated with EM-banded model for different values of η=ϕ=τ=κ=γ. Shaded gray errorbars depict 1%, respectively, 99% percentile of samples from a multivariate Gaussian with mean and covariance defined according to Eq. 13. Bottom: weights estimated with Ridge regression model for different Ridge regularization parameters α. The target weights are shown in black.

Close modal

3.1.2 Simulation 2: Encouraging smoothness for a feature subset

It is sometimes relevant to encourage smoothness on sets of weights, for instance, in analyses that approximate smooth FIR filters. An example hereof is fMRI encoding analyses that model the relation between a task-regressor augmented with multiple time lags and the BOLD voxel response. In this case, it can a priori be expected that the approximated FIR filter should be smooth due to hemodynamic properties of the BOLD signal (Goutte et al., 2000). It can similarly be relevant to assume smoothness across spectral channels in analyses that model relations between average stimulus power spectra and BOLD fMRI responses.

Here, we explore one way of encouraging smoothness using the definition of Ωj in Eq. 6. To illustrate the effect of Ωj, we consider a simulated regression problem with two feature spaces, F1 and F2. Each feature space has 64 predictors and is drawn from a Gaussian distribution. Data are simulated as y=F1w1+F2w2+ϵ, where ϵ is drawn from a Gaussian. A total of M=1024 observations are simulated. The weights associated with the first feature space w1 are simulated to be sinusoidal (i.e., smooth), while w2 weights are Gaussian. The signal-to-noise ratio (SNR) is -5 decibels (dB) and we fit the EM-banded models with γ fixed to a low value of 104 (i.e., corresponding to a broad hyperprior). The Ω1 term associated with F1 is parameterized according to Eq. 6 with h1 set to 1, 5, and 10 to illustrate its effect.

Figure 2 shows weights estimated with Ridge estimators for different α values (below) and weights estimated by the EM-banded estimators (above). At higher levels of h1, the EM-banded estimator approximates the weight distribution of the first feature space F1 with high accuracy. The Ridge estimator, on the other hand, yields weights that tend to show more rapid (high frequency) fluctuations across neighboring weights associated with F1 compared to the target weights.

Fig. 2.

Simulation 2 illustrating effects of smoothness encouragement. Left middle panel: simulated target weights (black) divided into two distinct predictor groups, F1 and F2. Weights associated with F1 are sinusoidal, while weights associated with F2 are drawn from a Gaussian. Top row: weights estimated with the EM-banded model for γ=104 but with different parameterizations for Ω1 (defined in Eq. 6). Specifically, we change the h1 term in this equation in order to encourage smoothness. Shaded gray errorbars depict 1%, respectively, 99% percentile of samples from a multivariate Gaussian with mean and covariance defined according to Eq. 13. Bottom: weights estimated with Ridge regression model for different Ridge regularization parameters α. The target weights are shown in black.

Fig. 2.

Simulation 2 illustrating effects of smoothness encouragement. Left middle panel: simulated target weights (black) divided into two distinct predictor groups, F1 and F2. Weights associated with F1 are sinusoidal, while weights associated with F2 are drawn from a Gaussian. Top row: weights estimated with the EM-banded model for γ=104 but with different parameterizations for Ω1 (defined in Eq. 6). Specifically, we change the h1 term in this equation in order to encourage smoothness. Shaded gray errorbars depict 1%, respectively, 99% percentile of samples from a multivariate Gaussian with mean and covariance defined according to Eq. 13. Bottom: weights estimated with Ridge regression model for different Ridge regularization parameters α. The target weights are shown in black.

Close modal

3.1.3 Simulation 3: Encouraging differential shrinkage of individual weights

In Simulation 1, the EM-banded model was used to shrink weights of an entire feature space (i.e., groups of predictors). The model can similarly encourage differential shrinkage of individual weights. To illustrate this, we consider a simulation where we assume that the number of groups equals the number of predictors (i.e., each group contains a single predictor). We simulate 512 predictors, F1, F2,…, F512, and a response variable y which contains mixed versions of some of these predictors. A total of M=1024 observations are simulated. It is assumed that y=jFjwj+ϵ where ϵ is Gaussian noise. We assume that a high proportion of the target weights wj is equal to zero. Each row in [F1,F2,...,F512] is drawn from a multivariate Gaussian with zero mean and a covariance defined as C=exp(0.3|ki|2),k=1,...,512,i=1,...,512. The SNR is equal to 0 dB. We again compare Ridge regression model fits with EM-banded model fits using the same set of fixed hyperparameters as in Simulation 1.

The estimated weights for the Ridge and EM-banded model are shown in Figure 3. It is evident that the EM-banded estimator with γ=104 excessively shrinks several of the individual weights. In effect, the EM-banded model recovers the target weights with relatively high accuracy. In comparison, the Ridge estimator tends to dilute effects across neighboring (correlated) predictors and pull weights toward zero when α attains high values.

Fig. 3.

Simulation 3 illustrating differential shrinkage of individual weights. Left middle panel: simulated target weights (black). Top row: weights estimated with EM-banded model for different values of η=ϕ=τ=κ=γ. Shaded gray errorbars depict 1%, respectively, 99% percentile of samples from a multivariate Gaussian with mean and covariance defined according to Eq. 13. Bottom: weights estimated with Ridge regression model for different Ridge regularization parameters α. The target weights are shown in black.

Fig. 3.

Simulation 3 illustrating differential shrinkage of individual weights. Left middle panel: simulated target weights (black). Top row: weights estimated with EM-banded model for different values of η=ϕ=τ=κ=γ. Shaded gray errorbars depict 1%, respectively, 99% percentile of samples from a multivariate Gaussian with mean and covariance defined according to Eq. 13. Bottom: weights estimated with Ridge regression model for different Ridge regularization parameters α. The target weights are shown in black.

Close modal

3.1.4 Simulation 4: Interpretation of weights with correlated predictors

Until now, we have not explored how strong correlations among groups of predictors can impact results. Models involving correlated groups of predictors occur often in EEG or fMRI encoding models that model responses to naturalistic stimuli (Hamilton & Huth, 2020). To exemplify this situation, this simulation considers the example of predicting responses to natural speech. Speech signals can be characterized by their lower level acoustic features as well as higher level phonetic or semantic features (Broderick et al., 2019; Huth et al., 2016). The different hierarchical levels can each be modeled by multidimensional features, but the statistics of natural speech imply a degree of correlation between such feature groups (see, e.g., panel B in Fig. 4). This can complicate inferences about neural responses unique to a given group.

Fig. 4.

Simulation 4 illustrating behavior of the Ridge estimator and the EM-banded estimators when there are strong correlations among groups of predictors, F1 and F2. The target variable is simulated as y=F1w1+ϵ. (A) Target weights associated with each of the two feature sets. (B) Matrix of Pearson correlation coefficients between predictors. Dashed lines are used to highlight predictor groups, F1 and F2. (C) Weights estimated with Ridge estimator for each of the two groups. (D) Weights estimated with the EM-banded estimator for each of the two groups. Ridge estimators with α set to 106, 105, and 104 are shown. EM-banded estimators were considered with γ set to 106, 105, and 104.

Fig. 4.

Simulation 4 illustrating behavior of the Ridge estimator and the EM-banded estimators when there are strong correlations among groups of predictors, F1 and F2. The target variable is simulated as y=F1w1+ϵ. (A) Target weights associated with each of the two feature sets. (B) Matrix of Pearson correlation coefficients between predictors. Dashed lines are used to highlight predictor groups, F1 and F2. (C) Weights estimated with Ridge estimator for each of the two groups. (D) Weights estimated with the EM-banded estimator for each of the two groups. Ridge estimators with α set to 106, 105, and 104 are shown. EM-banded estimators were considered with γ set to 106, 105, and 104.

Close modal

To illustrate this, we simulate responses to speech acoustic and phonetic features extracted from natural speech audio (Garofolo et al., 1993) (see Section 4 in the Supplementary Material for details on feature extraction). A first feature set F1 contains time-lagged spectrogram features and F2 contains time-lagged regressors created from phonetic annotations. We simulate a response as y=F1w1+ϵ, that is, only the feature set associated with the spectrogram features F1 affects the response variable, y. The SNR is equal to 0 dB and the target filter weights, w1, are shown in Figure 4A. We again compare Ridge and EM-banded regression model fits using a set of fixed hyperparameters.

Figure 4 shows the results of this simulation. The EM-banded estimator accurately recovers the weights associated with the speech spectrogram F1 while excessively shrinking the weights associated with the phonetic features F2, as desired. The Ridge estimator, on the other hand, pulls weights associated with correlated predictors in the two feature groups toward each other at higher levels of regularization. This has the undesired consequence that a temporally located response function associated with the phonetic feature group F2 emerges. This can complicate the interpretation of the weights associated with F2, for example, if such point estimates of weights were interpreted or further analyzed in second-level inferences. Simulating the opposite case, that is, simulating only responses to phonetic features (and no contribution from spectrogram features), leads to the equivalent result: weights associated with the spectrogram feature group show temporally located response functions (see Section 4 in the Supplementary Material). The described effect depends on the degree of correlation among predictors and on the amount of data available for fitting the models. We note that high levels of regularization often occur in practice for Ridge estimators, for instance, when tuning the regularization term to maximize the Pearson correlation coefficient between model predictions and target variables (due to its insensitivity to scaling and shift mismatches). For a related, but more stylized simulation, see Section 5 in the Supplementary Material.

This simulation highlights a potential source of pitfall for the interpretation of model weights in situations with co-varying features affecting the response. To illustrate this further, consider again the simulation where the response is driven by the first feature set F1, but now only the correlated set F2 is included in the model. Figure 5 depicts the estimated weights by Ridge models in this case. The model with only F2 predictors now misleadingly indicates a temporally located response for all of the considered α parameters, despite the fact that the target regression weights associated with F2 are zero. Interpretation of the reduced model is clearly different from the model that includes both groups of predictors.

Fig. 5.

Overlooking important features in a Ridge regularized encoding model. The outcome variable is again simulated as y=F1w1+ϵ, that is, the simulated target is simulated to be affected by F1 and not F2, but only F2 is included in the prediction model. Panels reflect estimates with different Ridge regularization strengths.

Fig. 5.

Overlooking important features in a Ridge regularized encoding model. The outcome variable is again simulated as y=F1w1+ϵ, that is, the simulated target is simulated to be affected by F1 and not F2, but only F2 is included in the prediction model. Panels reflect estimates with different Ridge regularization strengths.

Close modal

3.1.5 Simulation 5: Visualizing behavior under unfavorable signal-to-noise ratios

As illustrated in Simulation 3, the EM-banded estimator can be used to encourage differential shrinkage of weights associated with individual predictors. The behavior of such an estimator will depend on several factors, including the amount of data available to support the identification of λ, correlations among predictors, and the SNR. Here, we illustrate the behavior of the EM-banded estimator in a simulation with poor SNR. We simulate two predictors F1 and F2 and a response variable y=F1w1+F2w2+ϵ. The predictors as well as the noise term ϵ is drawn from Gaussian distributions. The target weights are fixed to non-zero values. The SNR is approximately 20 dB. We consider simulations where the number of observations is 512, 1024, and 8192, and we fit EM-banded models to the data from each simulation. We focus on estimators with γ=104, respectively. γ=102 to illustrate how differences in these prior choices affect the estimated parameters. This procedure is repeated many times to elucidate the effect.

Figure 6 shows 2D histogram visualizations of the estimated weights w1 and w2 from these simulations. Notably, the estimator with the lower γ=104 tends to exhibit excessive shrinkage to either of the two weights when the number of observations is low (M=512). Such over-shrinkage is undesirable. The estimator even collapses to w1w20 occasionally, which again is undesirable. Increasing γ to 102 here yields dramatically different estimates and less differential shrinkage of the two weights (increasing γ encourages solutions that allow for less excessive shrinkage). These simulations underline subtleties that can yield dramatic effects on the estimated weights, which is important to keep in mind when focusing on “default” priors for λj.

Fig. 6.

Simulation 5 illustrating regularization properties with only two predictors and a poor SNR. 2D histogram visualizations of estimated regression weights, w1 and w2. Columns show simulations with increasing number of observations. Top row: weights estimated with the EM-banded estimator with γ=104. Bottom row: weights estimated with the EM-banded estimator with γ=102.

Fig. 6.

Simulation 5 illustrating regularization properties with only two predictors and a poor SNR. 2D histogram visualizations of estimated regression weights, w1 and w2. Columns show simulations with increasing number of observations. Top row: weights estimated with the EM-banded estimator with γ=104. Bottom row: weights estimated with the EM-banded estimator with γ=102.

Close modal

3.2 fMRI encoding analysis with “stimulus-irrelevant” predictors

Building encoding models often involves deciding on which of multiple potentially relevant feature sets to include as predictors. Choosing features both implies a risk of overlooking feature sets that are potentially relevant, and including feature sets that are irrelevant for predicting the target variables. In this section, we investigate how the inclusion of stimulus-irrelevant predictors - here defined as simulated noise predictors that are unrelated to the task or stimuli - can impact results in an fMRI encoding analysis and how the EM-banded estimator may help suppress their contribution.

This example uses a publicly available BOLD fMRI dataset that has been described in Nakai et al. (2021, 2022). The dataset contains BOLD fMRI data acquired from five participants listening to excerpts from music in 12 training runs and 6 test runs. Informed consent had been obtained from all participants prior to their participation, and procedures were approved by local ethics and safety committees, see Nakai et al. (2021, 2022). Each functional run lasted 10 min and consisted of 40 musical clips that each was 15 s long. We focused on encoding models with three feature sets, F1, F2, and F3. The first two feature sets, F1 and F2, were related to the music stimuli: a 10-dimensional genre-label feature given by Tzanetakis and Cook (2002) indicating which of 10 musical genres a given stimulus is assigned to (F1), and a time-averaged 32-dimensional spectrogram of the audio stimulus (F2). As the third feature set (F3) unrelated to the stimuli, we included a 400-dimensional multivariate Gaussian noise. F1, F2, and F3 were considered separate groups for the EM-banded model. For simplicity, we always initialize the EM algorithm with λj and ν parameters set to 1 and define that no more than 200 iterations should be considered. It may sometimes be advisable to use different initialization strategies. More details related to fMRI preprocessing, audio feature extraction, and model fitting are described in Section 1 in the Supplementary Material.

Figure 7 shows the predictive accuracies obtained with the Ridge estimator and the EM-banded estimator. Predictive accuracy is here evaluated both as Pearson correlation coefficient between the predicted and measured BOLD fMRI response data, and as explained variance ratio. Compared to the Ridge estimator, the EM-banded estimator yielded higher predictive accuracy for all subjects in both metrics (all p<0.05, Wilcoxon signed-rank tests across all voxels used for fitting models). The improvement in terms of Pearson correlation coefficient among the top 10000 voxels was (mean ± standard deviation): 0.0335±0.0515 (sub-001), 0.0411±0.0629 (sub-002), 0.0568±0.0699 (sub-003), 0.0391±0.0637 (sub-004), and 0.0298±0.0522 (sub-005).

Fig. 7.

A BOLD fMRI encoding analysis where stimulus-irrelevant predictors are included in the analysis. (A) Group-mean prediction accuracies for models trained with the EM-banded framework. (B) Group-mean prediction accuracies for cross-validated Ridge regression models. Maps in (A) and (B) have been arbitrarily thresholded for visualization purposes. (C) 2D histogram visualization of prediction accuracies (Pearson correlation coefficient, ρ) for the EM-banded and Ridge estimators for each subject. The dashed line indicates similar performance across models. (D) Similar visualizations as in (C) but using R2 as performance metric.

Fig. 7.

A BOLD fMRI encoding analysis where stimulus-irrelevant predictors are included in the analysis. (A) Group-mean prediction accuracies for models trained with the EM-banded framework. (B) Group-mean prediction accuracies for cross-validated Ridge regression models. Maps in (A) and (B) have been arbitrarily thresholded for visualization purposes. (C) 2D histogram visualization of prediction accuracies (Pearson correlation coefficient, ρ) for the EM-banded and Ridge estimators for each subject. The dashed line indicates similar performance across models. (D) Similar visualizations as in (C) but using R2 as performance metric.

Close modal

The regularization term α of the Ridge estimator varies substantially across voxels. This can occur, for instance, if voxels with higher signal-to-noise ratio require less overall regularization. Changes in α impacts the overall scale of all weights. This can lead to misleading spatial weight maps indicating effects of predictors that are stimulus irrelevant. This is illustrated in Figure 8 showing the group-mean standard deviation of predictions obtained with stimulus-irrelevant predictors. The predictions were obtained using only the task-irrelevant predictors to predict voxel responses (discarding the spectrogram and genre features) with the Ridge estimator. As can be seen, the standard deviation of these predictions is high specifically in auditory cortical regions. This may reflect that the Ridge estimator declares less overall shrinkage in voxels in regions with higher SNR. In effect, the map can misleadingly be interpreted to suggest regional auditory-evoked activation to a random feature because of the regularization. Encouraging differential shrinkage of each feature space with the EM-banded estimator avoids this problem, showing no clear spatial patterns in these maps.

Fig. 8.

Group-mean standard deviations of predictions with a stimulus-unrelated noise feature estimated with Ridge regression (A) and the EM-banded framework (B).

Fig. 8.

Group-mean standard deviations of predictions with a stimulus-unrelated noise feature estimated with Ridge regression (A) and the EM-banded framework (B).

Close modal

3.3 EEG-based decoding analysis

In this example, we consider a decoding analysis of single-trial EEG responses to continuous natural speech. The example follows a popular approach of predicting the continuous speech amplitude envelope using linear combinations of time-lagged EEG scalp electrode responses at lower frequencies (Broderick et al., 2018; Crosse et al., 2016; Ding & Simon, 2012; O’Sullivan et al., 2014; Wong et al., 2018). In this approach, the goal is to find a multidimensional filter w that linearly maps from (features in) the multichannel EEG response to the envelope of the speech stimulus y. The decoding analysis can be formulated as y(t)=ctx(tτ,c)w(c,τ)+ϵ, where x(tτ,c) denotes preprocessed electrode response in channel c at time point tτ and where τ indicates the time delay between stimulus and response. The model can be specified by concatenating multiple time-lagged versions of each EEG electrode response in a design matrix X in Eq. 1. The regression problem now amounts to estimating coefficients of a multi-dimensional FIR filter (de Cheveigné et al., 2018). In this context, it is natural to consider each time-lagged electrode response as a feature group Fj in Eq. 2 and allow for differential shrinkage of weights associated with each group. For instance, it can be relevant to allow for excessive shrinkage of a noisy or seemingly irrelevant electrode for all its time-lag copies.

While EEG speech decoding studies often focus on low-frequency activity (<15 Hz), it is known from intracranial recordings that the power of higher frequency neural activity in the gamma range also tracks features of speech stimuli (Crone et al., 2001; Mesgarani & Chang, 2012; Pasley et al., 2012). These frequencies are typically highly attenuated in scalp EEG. Nonetheless, including high gamma power features has been shown to improve predictive accuracy of EEG decoding models for some subjects (Synigal et al., 2020). However, including more EEG features, some of which may have very low SNR, is also be associated with a risk of adding predictors to the decoding model that introduce noise to the predictions. It may thus be desirable to allow for differential shrinkage of different EEG features. In this example, we therefore explore predictive accuracy of the EM-banded model when both lower and higher frequency EEG features are included in such a decoding analysis.

This example uses publicly available EEG data (https://doi.org/10.5061/dryad.070jc) described in Broderick et al. (2018) and Di Liberto et al. (2015). In brief, 19 subjects listened to excerpts from an audiobook while continuous EEG data from 128 scalp electrodes were recorded. All procedures were undertaken in accordance with the Declaration of Helsinki and were approved by local ethics committees, see Broderick et al. (2019) and Di Liberto et al. (2015). Data were acquired in 20 trials each approximately 180 s in length. We fit decoding models again using the EM-banded framework compared to standard Ridge regression. For both models, we used time-lagged low-frequency (LF) EEG features (1-8 Hz) as well as time-lagged high-frequency (HF) EEG power features (25-35 Hz) (Giraud & Poeppel, 2012) to predict a speech envelope feature. Further details related to EEG preprocessing, audio feature extraction, and model fitting are described in Section 2 in the Supplementary Material. Note that cross-validation was only used for hyperparameter tuning in the Ridge model, but that both models always were tested on held-out data.

The results from the analysis are shown in Figure 9. The prediction accuracies obtained with the EM-banded estimator are modestly but consistently higher than the Ridge estimates (p<0.01, Wilcoxon signed rank test). The average correlation between the predicted speech envelope and the target envelope is 0.1368±0.0383 for the EM-banded model and 0.1267±0.0.0377 for the Ridge model. The EM-banded approach excessively shrinks groups of weights, whereas the Ridge estimator yields a denser distribution of weights. Figure 9 (right) shows an example of weights estimated by the two models for one subject. This illustrates how the overall scale of weights associated with the HF features and LF features is comparable for the Ridge estimator whereas the EM-banded estimator differentially shrinks both weights associated with HF features and groups of LF features associated with single electrodes. Given that high-frequency EEG features are weak and in some subjects not even measurable by scalp electrodes, including HF features can negatively model impact predictions, in particular with a Ridge estimator. Differential shrinkage of groups of weights may in such cases allow for better exploitation of potentially informative HF features.

Fig. 9.

Results from the EEG decoding analysis. Left: boxplot of prediction accuracies obtained with each model. Each point reflects data from a given subject. Prediction accuracy was defined as the average correlation coefficient between the target envelope and decoded envelope in held-out trials. Topographies depict weights of each model at a single time lag for illustrative purposes. The weights are here first normalized by their standard deviation and averaged across participants. Topographies are shown both for weights associated with low-frequency (LF) features and high-frequency (HF) features. Right: Examples of regression weights for one participant.

Fig. 9.

Results from the EEG decoding analysis. Left: boxplot of prediction accuracies obtained with each model. Each point reflects data from a given subject. Prediction accuracy was defined as the average correlation coefficient between the target envelope and decoded envelope in held-out trials. Topographies depict weights of each model at a single time lag for illustrative purposes. The weights are here first normalized by their standard deviation and averaged across participants. Topographies are shown both for weights associated with low-frequency (LF) features and high-frequency (HF) features. Right: Examples of regression weights for one participant.

Close modal

We explored an empirical Bayes framework for estimating “banded”-type encoding and decoding models. The framework can allow differential shrinkage of groups of weights associated with different groups of predictors (e.g., feature “bands”). The model can further be used to encourage smoothness on such groups of weights. In the following, we discuss modeling considerations and model limitations.

4.1 Encoding and decoding models with correlated predictors

Regression models with highly correlated predictors frequently occur in encoding and decoding analyses. Features of continuous naturalistic stimuli (e.g., video or speech) used as predictors in encoding models are often highly correlated. Continuous scalp EEG channels used as predictors in decoding models are typically also highly correlated, and pairs of channels can commonly exceed ρ=0.8 with high-density EEG recordings.

The described estimator may encourage solutions where subsets of correlated predictors are excessively shrunken. However, the true posterior of weights in analyses with many correlated predictors may show complex multimodal distributions, and point estimates of model parameters (or, approximate conditional distributions) should be interpreted with caution (see, e.g., Simulation 5). Spurious excessive shrinkage of a given feature set in an encoding analysis with correlated feature sets could, for example, be misinterpreted as indicating that the feature is not encoded in a response where it is contributing weakly. Such issues can be particularly difficult to identify when there are many outcome variables (e.g., different voxels with different noise profiles), and little training data to support the identification of model parameters.

Shrinkage procedures can complicate interpretation of weights in encoding analyses with correlated predictors, as illustrated in our simulations. In decoding models, model prediction accuracy on held-out data is often a primary measure. If the main goal is to optimize model performance for some application (e.g., a decoding model optimized for BCI performance), then the choice of shrinkage procedure may be less critical for interpretation as long as it does not compromise performance or lead to exclusion of important features. However, another goal may be to correlate prediction accuracies with behavior or clinical outcome measures. Here, the choice of shrinkage procedure can affect group-level inferences. For instance, multi-channel EEG data may show distinct noise characteristics in different individuals (e.g., noisy electrodes or attenuation of high-frequency activity due to head anatomy). A given shrinkage procedure might differentially impact decoding accuracies across individuals, potentially complicating the interpretation of group-level correlations with behavior.

With highly correlated predictors, it can sometimes be relevant to transform (groups of) the predictors using data dimensionality reduction techniques (de Cheveigné, 2021; de Cheveigné & Parra, 2014; de Cheveigné & Simon, 2008; de Cheveigné et al., 2019; Hyvärinen & Oja, 2000; Zou et al., 2006) and then fit models using the transformed data. This can potentially reduce computational burden without compromising predictive accuracy. Dimensionality reduction can be combined with banded regularization procedures to allow differential shrinkage of regression weights associated with each component. The described framework can also be used in the context of feature selection (Bair et al., 2006; Fan & Lv, 2008; Guyon & Elisseeff, 2006; Neal & Zhang, 2006), for instance, to inform about which stimulus-response lags to consider. Whether the assumptions entailed by such procedures are reasonable again depends on the specific research goals.

Several of our simulations assumed that there is a set of “true” regression weights as defined by data-generating equations. For example, in Simulation 4 we assumed that only one of two correlated feature groups drives a response. Such data-generating processes are deliberately oversimplistic in order to illustrate properties of the model, but are unlikely to reflect realistic scenarios. In naturalistic scenarios, correlated data are likely to be generated by a chain of latent causes. For example, in encoding analyses with different sets of highly correlated features (e.g., different high- and low-level features derived from natural speech), it is conceivable that the different stimulus features as well as the target regression variables each is generated by one of more underlying variables unknown to the modeler. Therefore, regression weights do not allow straightforward causal interpretation in such scenarios, and banded regression procedures do not solve such issues.

4.2 Computational burden

Encoding and decoding analyses often involve high-dimensional regression problems. In encoding models, the stimulus or task feature selection is often motivated by some pre-defined hypotheses. Yet, even for well-defined hypotheses, feature sets often translate to a high number of predictors. This, in turn, implies a high computational burden. The proposed EM-banded model involves matrix inversions (often with complexity O(D3) unless specific structure of the matrix can be exploited) which may limit the relevance if the number of predictors is very high. This problem is further amplified in situations with many response variables. Algorithms described in la Tour et al. (2022); Nunez-Elizalde et al. (2019) are well optimized for fMRI encoding analyses with many predictors and a large number of response variables (voxels). For this situation, Section 3 in the Supplementary Material presents an alternative formulation of the EM-banded model where it is assumed that the λj and ν parameters are shared across multiple outcome variables. This formulation allows for a more efficient implementation in scenarios with many outcome variables. Runtime test examples can be found at https://github.com/safugl/embanded.

4.3 Risks of under-shrinkage and over-shrinkage

The described model applies independent Inverse-Gamma priors on λj terms associated with different groups of predictors. The hyperpriors could also be omitted and replaced by a constant which will lead to typical type-II maximum likelihood estimates. Several efficient algorithms have specifically addressed such scenarios (Cai et al., 2021; Tipping & Faul, 2003; Wipf & Nagarajan, 2007). The choice of hyperpriors, or lack hereof, can have a large impact on the results and misspecified hyperpriors can be associated with over-shrinkage or under-shrinkage, as exemplified in our simulations. In both cases, this may compromise both predictive accuracy and interpretability of model weights.

In the encoding and decoding analyses examples, we focused on Inverse-Gamma hyperpriors with hyperparameters fixed to τ=η=κ=ϕ=104, which corresponds to broad hyperpriors. Choosing broad hyperpriors is convenient in several practical analysis situations, allowing for excessive shrinkage of (groups of) predictors while also putting prior mass on larger values for λj and ν. Moreover, fixing hyperparameters allows for determination of λj and ν terms from training data alone without having to tune hyperparameters using cross-validation. This can be particularly convenient in banded regression problems with many feature groups. Section 7 in the Supplementary Material shows out-of-sample predictive accuracies for EM-banded models with various different hyperpriors fit to data from three of our simulations. Here, the predictive accuracies tend to plateau as γ approaches zero, suggesting that broad hyperpriors are reasonable choices in these simulations.

In many practical applications, a broad hyperprior can be chosen to provide an initial “default” reference model that can be used as a starting point for subsequent model comparisons (Gelman, 2006). In the context of encoding and decoding analyses (where focus mostly is on out-of-sample predictive accuracy), such reference models can be highly useful for identifying cases where certain regularization properties are inappropriate. This procedure could be taken, for example, when exploring whether grouping structures or smoothness constraints are relevant. The EM-banded model offers one estimator among several others that can be considered in such a process.

4.4 Empirical Bayes and related work

We explored a parametric empirical Bayes framework (Efron & Morris, 1973; 1975; Robbins, 1964; MacKay, 1992, 1996; Morris, 1983) in the context of encoding and decoding analyses. We used this framework to tune regularization hyperparameters from the data and subsequently focus on moments in a conditional distribution, an idea that is widely adopted in hierarchical analyses of neuroimaging data as outlined by Friston, Penny, et al. (2002). Parametric empirical Bayes has been considered in the context of MRI decoding analyses (Sabuncu & Van Leemput, 2011; Wen et al., 2019), in M/EEG source localization problems (Cai et al., 2018, 2021; Friston et al., 2008; Owen et al., 2008; Owen et al., 2012; Wipf et al., 2010), in multi-subject hierarchical fMRI analyses (Friston, et al., 2002), in MRI segmentation (Iglesias et al., 2015; Puonti et al., 2016, 2020), and in receptive field models of cell recordings (Park & Pillow, 2011). Likewise, dividing variables into groups and encouraging group-wise sparsity via the group-lasso (Yuan & Lin, 2006) have also been used in M/EEG source localization problems (Haufe et al., 2008; Lim et al., 2017; Ou et al., 2009) and in causal discovery (Bolstad et al., 2011; Haufe et al., 2010; Tank et al., 2017). The group-lasso is convenient in the latter situations because it promotes truly (group-wise) sparse solutions unlike the EM-banded model and banded Ridge regression (la Tour et al., 2022; Nunez-Elizalde et al., 2019), where weight pruning (such as setting excessively shrunken regression weights to zero) would otherwise be necessary.

It is important to once again stress that the considered modeling framework may provide poor approximations to the true posterior distribution over the model weights. The hope is instead that the model yields reasonable out-of-sample predictions (Tipping, 2001), which we indeed found to be the case in several analyses. In encoding or decoding analyses, the emphasis is typically on out-of-sample predictive accuracy and point estimates of model parameters. Our motivation behind highlighting model weights in several of our simulations was to better illustrate model properties and highlight regression problems where group shrinkage can be relevant.

4.5 Over-fitting and cross-validation

Tuning hyperparameters using cross-validated estimates of expected prediction accuracy may reduce the risk of over-fitting. One benefit of tuning hyperparameters using cross-validation compared to empirical Bayes (without cross-validation) is that it optimizes an estimate of expected prediction accuracy which can lead to good generalization, especially in cases with many observations, high SNR, and independent training and validation splits. Moreover, such procedures can be highly useful for identifying model misspecifications. However, with limited and noisy data, the variance of the expected prediction accuracy estimate can be high. Tuning many free hyperparameters to optimize such an estimate can be associated with a considerable risk of over-fitting in these circumstances.

Cross-validation can similarly be used to tune hyperparameters of the EM-banded model. This may prove beneficial in some cases (compared to fixing model parameters a priori) and could potentially reduce the risk of over-fitting, but the same concerns apply here. In practice, rather than tuning all free hyperparameters of the EM-banded jointly, one may choose to introduce the constraint that only γ is tuned, with γ=τ=η=κ=ϕ. This will practically make it more straightforward to fit EM-banded models across a range of γ parameters and tune γ to minimize expected prediction error on held-out data. Section 7 in the Supplementary Material shows results from such procedures.

4.6 Other perspectives

We used the described modeling framework to fit encoding and decoding models trained on data from single subjects (i.e., single-subject analyses). The described framework could also be incorporated in “searchlight” decoding analyses (Kriegeskorte et al., 2006) where different subsets of features from multi-dimensional neural data recordings are used as predictors in single-subject or group-level decoding models. The framework could for example be relevant in studies that attempt to use one or more features extracted from multiple imaging modalities to predict clinical scores (Woo et al., 2017).

One appealing aspect of group shrinkage procedures is that they can allow for ranking importance of feature groups. This may aid interpretation of encoding analyses (la Tour et al., 2022; Nunez-Elizalde et al., 2019). In recent years, encoding models have been used increasingly to model neural responses with features extracted from task-optimized neural networks (Kell et al., 2018; Tuckute et al., 2023; Yamins et al., 2014; Yamins & DiCarlo, 2016). This typically involves very high-dimensional feature sets and hence encoding models where the number of predictors is much larger than the number of observations, that is, DM. Surprisingly, over-parameterized models may show good generalization performance even with vanishing explicit regularization in minimum-norm least squares estimators (Belkin et al., 2019; Hastie et al., 2022; Kobak et al., 2020). This behavior is clearly different from cases where explicit regularization is critical for performance. Yet, the ability to establish importance of sets of features can similarly be useful in these settings, and group shrinkage procedures may be one way of achieving this.

In this paper, we explored a framework for estimating banded-type regression models in encoding and decoding analyses. It offers a tool that can have relevance for specific modeling problems in computational neuroscience, complementing other related estimators. We used simulations and data examples to illustrate properties and limitations of the described model in comparison with Ridge estimators.

Code implementations in both Matlab (The MathWorks, Inc., Natick, MA, United States) and Python are available on https://github.com/safugl/embanded. The Python code utilizes libraries such as Numpy (Harris et al., 2020), Scipy (Virtanen et al., 2020), and Pytorch (Paszke et al., 2019). The public repository contains examples and runtime tests. EEG data and BOLD fMRI data used in this study have been presented previously (Broderick et al., 2018; Di Liberto et al., 2015; Nakai et al., 2021, 2022) and are available on https://doi.org/10.5061/dryad.070jc and https://openneuro.org/datasets/ds003720, respectively.

S.A.F.: Conceptualization, Formal analysis, Methodology, Software, Writing – original draft, Writing – review & editing, Interpretation of results, Visualization. K.H.M.: Software, Methodology, Writing – review, Interpretation of results. O.P.: Software, Methodology, Writing – review, Interpretation of results. H.R.S.: Writing – review & editing, Project administration, Funding acquisition, Interpretation of results. J.H.: Software, Methodology, Supervision, Writing – original draft, Writing – review & editing, Project administration, Funding acquisition, Interpretation of results.

This work was supported by the Center for Auditory Neuroscience grant from the William Demant foundation and by a synergy grant from Novo Nordisk foundation to the UHEAL project Uncovering Hidden Hearing Loss (grant number NNF17OC0027872). H.R.S. was supported by a grand solutions grant “Precision Brain-Circuit Therapy - Precision-BCT)” from Innovation Fund Denmark (grant number 9068-00025B) and a collaborative project grant “ADAptive and Precise Targeting of cortex-basal ganglia circuits in Parkinson’s Disease - ADAPT-PD” from Lundbeckfonden (grant number R336-2020-1035). O.P. was supported by a grant from Lundbeckfonden (grant number R360–2021–395). K.H.M. was supported by the Pioneer Centre for AI, DNRF grant number P1.

H.R.S. has received honoraria as speaker and consultant from Lundbeck AS, Denmark, and as editor (Neuroimage Clinical) from Elsevier Publishers, Amsterdam, The Netherlands. He has received royalties as book editor from Springer Publishers, Stuttgart, Germany, Oxford University Press, Oxford, UK, and from Gyldendal Publishers, Copenhagen, Denmark.

The authors would like to thank Edmund C. Lalor (University of Rochester) for helpful discussions.

Supplementary material for this article is available with the online version here: https://doi.org/10.1162/imag_a_00155.

Bair
,
E.
,
Hastie
,
T.
,
Paul
,
D.
, &
Tibshirani
,
R.
(
2006
).
Prediction by supervised principal components
.
Journal of the American Statistical Association
,
101
(
473
),
119
137
. https://doi.org/10.1198/016214505000000628
Belkin
,
M.
,
Hsu
,
D.
,
Ma
,
S.
, &
Mandal
,
S.
(
2019
).
Reconciling modern machine-learning practice and the classical bias–variance trade-off
.
Proceedings of the National Academy of Sciences of the United States of America
,
116
(
32
),
15849
15854
. https://doi.org/10.1073/pnas.1903070116
Bishop
,
C. M.
(
2006
).
Pattern recognition and machine learning. Information science and statistics
.
Springer
. https://link.springer.com/book/10.1007/978-0-387-45528-0
Bolstad
,
A.
,
Van Veen
,
B. D.
, &
Nowak
,
R.
(
2011
).
Causal network inference via group sparse regularization
.
IEEE Transactions on Signal Processing
,
59
(
6
),
2628
2641
. https://doi.org/10.1109/tsp.2011.2129515
Boss
,
J.
,
Datta
,
J.
,
Wang
,
X.
,
Park
,
S. K.
,
Kang
,
J.
, &
Mukherjee
,
B.
(
2023
).
Group inverse-gamma gamma shrinkage for sparse linear models with block-correlated regressors
.
Bayesian Analysis
,
1
(
1
),
1
30
. https://doi.org/10.1214/23-ba1371
Broderick
,
M. P.
,
Anderson
,
A. J.
,
Di Liberto
,
G. M.
,
Crosse
,
M. J.
, &
Lalor
,
E. C.
(
2018
).
Electrophysiological correlates of semantic dissimilarity reflect the comprehension of natural, narrative speech
.
Current Biology
,
28
(
5
),
803.e3
809.e3
. https://doi.org/10.1016/j.cub.2018.01.080
Broderick
,
M. P.
,
Anderson
,
A. J.
, &
Lalor
,
E. C.
(
2019
).
Semantic context enhances the early auditory encoding of natural speech
.
Journal of Neuroscience
,
39
(
38
),
7564
7575
. https://doi.org/10.1523/jneurosci.0584-19.2019
Cai
,
C.
,
Hashemi
,
A.
,
Diwakar
,
M.
,
Haufe
,
S.
,
Sekihara
,
K.
, &
Nagarajan
,
S. S.
(
2021
).
Robust estimation of noise for electromagnetic brain imaging with the champagne algorithm
.
NeuroImage
,
225
,
117411
. https://doi.org/10.1016/j.neuroimage.2020.117411
Cai
,
C.
,
Sekihara
,
K.
, &
Nagarajan
,
S. S.
(
2018
).
Hierarchical multiscale Bayesian algorithm for robust MEG/EEG source reconstruction
.
NeuroImage
,
183
,
698
715
. https://doi.org/10.1016/j.neuroimage.2018.07.056
Crone
,
N. E.
,
Boatman
,
D.
,
Gordon
,
B.
, &
Hao
,
L.
(
2001
).
Induced electrocorticographic gamma activity during auditory perception
.
Clinical Neurophysiology
,
112
(
4
),
565
582
. https://doi.org/10.1016/s1388-2457(00)00545-9
Crosse
,
M. J.
,
Di Liberto
,
G. M.
, &
Lalor
,
E. C.
(
2016
).
Eye can hear clearly now: Inverse effectiveness in natural audiovisual speech processing relies on long-term crossmodal temporal integration
.
Journal of Neuroscience
,
36
(
38
),
9888
9895
. https://doi.org/10.1523/jneurosci.1396-16.2016
David
,
S. V.
,
Mesgarani
,
N.
, &
Shamma
,
S. A.
(
2007
).
Estimating sparse spectro-temporal receptive fields with natural stimuli
.
Network: Computation in Neural Systems
,
18
(
3
),
191
212
. https://doi.org/10.1080/09548980701609235
de Cheveigné
,
A.
(
2021
).
Shared component analysis
.
NeuroImage
,
226
,
117614
. https://doi.org/10.1016/j.neuroimage.2020.117614
de Cheveigné
,
A.
,
Di Liberto
,
G. M.
,
Arzounian
,
D.
,
Wong
,
D. D.
,
Hjortkjær
,
J.
,
Fuglsang
,
S.
, &
Parra
,
L. C.
(
2019
).
Multiway canonical correlation analysis of brain data
.
NeuroImage
,
186
,
728
740
. https://doi.org/10.1016/j.neuroimage.2018.11.026
de Cheveigné
,
A.
, &
Parra
,
L. C.
(
2014
).
Joint decorrelation, a versatile tool for multichannel data analysis
.
NeuroImage
,
98
,
487
505
. https://doi.org/10.1016/j.neuroimage.2014.05.068
de Cheveigné
,
A.
, &
Simon
,
J. Z.
(
2008
).
Denoising based on spatial filtering
.
Journal of Neuroscience Methods
,
171
(
2
),
331
339
. https://doi.org/10.1016/j.jneumeth.2008.03.015
de Cheveigné
,
A.
,
Wong
,
D. D.
,
Di Liberto
,
G. M.
,
Hjortkjær
,
J.
,
Slaney
,
M.
, &
Lalor
,
E.
(
2018
).
Decoding the auditory brain with canonical component analysis
.
NeuroImage
,
172
,
206
216
. https://doi.org/10.1016/j.neuroimage.2018.01.033
de Heer
,
W. A.
,
Huth
,
A. G.
,
Griffiths
,
T. L.
,
Gallant
,
J. L.
, &
Theunissen
,
F. E.
(
2017
).
The hierarchical cortical organization of human speech processing
.
Journal of Neuroscience
,
37
(
27
),
6539
6557
. https://doi.org/10.1523/jneurosci.3267-16.2017
Dempster
,
A. P.
,
Laird
,
N. M.
, &
Rubin
,
D. B.
(
1977
).
Maximum likelihood from incomplete data via the EM algorithm
.
Journal of the Royal Statistical Society: Series B (Methodological)
,
39
(
1
),
1
22
. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x
Di Liberto
,
G. M.
,
Marion
,
G.
, &
Shamma
,
S. A.
(
2021
).
Accurate decoding of imagined and heard melodies
.
Frontiers in Neuroscience
,
15
,
673401
. https://doi.org/10.3389/fnins.2021.673401
Di Liberto
,
G. M.
,
O’Sullivan
,
J. A.
, &
Lalor
,
E. C.
(
2015
).
Low-frequency cortical entrainment to speech reflects phoneme-level processing
.
Current Biology
,
25
(
19
),
2457
2465
. https://doi.org/10.1016/j.cub.2015.08.030
Ding
,
N.
, &
Simon
,
J. Z.
(
2012
).
Neural coding of continuous speech in auditory cortex during monaural and dichotic listening
.
Journal of Neurophysiology
,
107
(
1
),
78
89
. https://doi.org/10.1152/jn.00297.2011
Dumoulin
,
S. O.
, &
Wandell
,
B. A.
(
2008
).
Population receptive field estimates in human visual cortex
.
NeuroImage
,
39
(
2
),
647
660
. https://doi.org/10.1016/j.neuroimage.2007.09.034
Efron
,
B.
, &
Morris
,
C.
(
1973
).
Stein’s estimation rule and its competitors—An empirical Bayes approach
.
Journal of the American Statistical Association
,
68
(
341
),
117
130
. https://doi.org/10.1080/01621459.1973.10481350
Efron
,
B.
, &
Morris
,
C.
(
1975
).
Data analysis using Stein’s estimator and its generalizations
.
Journal of the American Statistical Association
,
70
(
350
),
311
319
. https://doi.org/10.1080/01621459.1975.10479864
Fan
,
J.
, &
Lv
,
J.
(
2008
).
Sure independence screening for ultrahigh dimensional feature space
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
70
(
5
),
849
911
. https://doi.org/10.1111/j.1467-9868.2008.00674.x
Friedman
,
J.
,
Hastie
,
T.
, &
Tibshirani
,
R.
(
2010
).
A note on the group lasso and a sparse group lasso
.
arXiv
, arXiv:1001.0736. https://doi.org/10.48550/arXiv.1001.0736
Friston
,
K.
,
Harrison
,
L.
,
Daunizeau
,
J.
,
Kiebel
,
S.
,
Phillips
,
C.
,
Trujillo-Barreto
,
N.
,
Henson
,
R.
,
Flandin
,
G.
, &
Mattout
,
J.
(
2008
).
Multiple sparse priors for the M/EEG inverse problem
.
NeuroImage
,
39
(
3
),
1104
1120
. https://doi.org/10.1016/j.neuroimage.2007.09.048
Friston
,
K. J.
,
Glaser
,
D. E.
,
Henson
,
R. N.
,
Kiebel
,
S.
,
Phillips
,
C.
, &
Ashburner
,
J.
(
2002
).
Classical and Bayesian inference in neuroimaging: Applications
.
NeuroImage
,
16
(
2
),
484
512
. https://doi.org/10.1006/nimg.2002.1091
Friston
,
K. J.
,
Holmes
,
A. P.
,
Worsley
,
K. J.
,
Poline
,
J.-P.
,
Frith
,
C. D.
, &
Frackowiak
,
R. S.
(
1994
).
Statistical parametric maps in functional imaging: A general linear approach
.
Human Brain Mapping
,
2
(
4
),
189
210
. https://doi.org/10.1002/hbm.460020402
Friston
,
K. J.
,
Penny
,
W.
,
Phillips
,
C.
,
Kiebel
,
S.
,
Hinton
,
G.
, &
Ashburner
,
J.
(
2002
).
Classical and Bayesian inference in neuroimaging: Theory
.
NeuroImage
,
16
(
2
),
465
483
. https://doi.org/10.1006/nimg.2002.1090
Garofolo
,
J. S.
,
Lamel
,
L. F.
,
Fisher
,
W. M.
,
Fiscus
,
J. G.
, &
Pallett
,
D. S.
(
1993
).
DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST Speech Disc 1-1.1
.
NASA STI/Recon Technical Report n
,
93
,
27403
. https://doi.org/10.6028/nist.ir.4930
Gelman
,
A.
(
2006
).
Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper)
.
Bayesian Analysis
,
1
(
3
),
515
534
. https://doi.org/10.1214/06-ba117a
Giraud
,
A.-L.
, &
Poeppel
,
D.
(
2012
).
Cortical oscillations and speech processing: Emerging computational principles and operations
.
Nature Neuroscience
,
15
(
4
),
511
517
. https://doi.org/10.1038/nn.3063
Golub
,
G. H.
,
Hansen
,
P. C.
, &
O’Leary
,
D. P.
(
1999
).
Tikhonov regularization and total least squares
.
SIAM Journal on Matrix Analysis and Applications
,
21
(
1
),
185
194
. https://doi.org/10.1137/s0895479897326432
Goutte
,
C.
,
Nielsen
,
F. A.
, &
Hansen
,
K.
(
2000
).
Modeling the hemodynamic response in fMRI using smooth fir filters
.
IEEE Transactions on Medical Imaging
,
19
(
12
),
1188
1201
. https://doi.org/10.1109/42.897811
Guyon
,
I.
, &
Elisseeff
,
A.
(
2006
).
An introduction to feature extraction
. In
I.
Guyon
,
M.
Nikravesh
,
S.
Gunn
, &
L. A.
Zadeh
(Eds.)
Feature extraction: Foundations and applications
(pp.
1
25
).
Springer
. https://doi.org/10.1007/978-3-540-35488-8_1
Hamilton
,
L. S.
, &
Huth
,
A. G.
(
2020
).
The revolution will not be controlled: Natural stimuli in speech neuroscience
.
Language, Cognition and Neuroscience
,
35
(
5
),
573
582
. https://doi.org/10.1080/23273798.2018.1499946
Harris
,
C. R.
,
Millman
,
K. J.
,
van der Walt
,
S. J.
,
Gommers
,
R.
,
Virtanen
,
P.
,
Cournapeau
,
D.
,
Wieser
,
E.
,
Taylor
,
J.
,
Berg
,
S.
,
Smith
,
N. J.
,
Kern
,
R.
,
Picus
,
M.
,
Hoyer
,
S.
,
van Kerkwijk
,
M. H.
,
Brett
,
M.
,
Haldane
,
A.
,
del Río
,
J. F.
,
Wiebe
,
M.
,
Peterson
,
P.
,…
Oliphant
,
T. E.
(
2020
).
Array programming with NumPy
.
Nature
,
585
(
7825
),
357
362
. https://doi.org/10.1038/s41586-020-2649-2
Hastie
,
T.
,
Montanari
,
A.
,
Rosset
,
S.
, &
Tibshirani
,
R. J.
(
2022
).
Surprises in high-dimensional ridgeless least squares interpolation
.
Annals of Statistics
,
50
(
2
),
949
. https://doi.org/10.1214/21-aos2133
Haufe
,
S.
,
Meinecke
,
F.
,
Görgen
,
K.
,
Dähne
,
S.
,
Haynes
,
J.-D.
,
Blankertz
,
B.
, &
Bießmann
,
F.
(
2014
).
On the interpretation of weight vectors of linear models in multivariate neuroimaging
.
NeuroImage
,
87
,
96
110
. https://doi.org/10.1016/j.neuroimage.2013.10.067
Haufe
,
S.
,
Müller
,
K.-R.
,
Nolte
,
G.
, &
Krämer
,
N.
(
2010
).
Sparse causal discovery in multivariate time series
. In
Causality: Objectives and assessment
(pp.
97
106
).
PMLR
. https://proceedings.mlr.press/v6/haufe10a.html
Haufe
,
S.
,
Nikulin
,
V. V.
,
Ziehe
,
A.
,
Müller
,
K.-R.
, &
Nolte
,
G.
(
2008
).
Combining sparsity and rotational invariance in EEG/MEG source reconstruction
.
NeuroImage
,
42
(
2
),
726
738
. https://doi.org/10.1016/j.neuroimage.2008.04.246
Hoerl
,
A. E.
, &
Kennard
,
R. W.
(
1970
).
Ridge regression: Biased estimation for nonorthogonal problems
.
Technometrics
,
12
(
1
),
55
67
. https://doi.org/10.1080/00401706.1970.10488634
Huth
,
A. G.
,
De Heer
,
W. A.
,
Griffiths
,
T. L.
,
Theunissen
,
F. E.
, &
Gallant
,
J. L.
(
2016
).
Natural speech reveals the semantic maps that tile human cerebral cortex
.
Nature
,
532
(
7600
),
453
458
. https://doi.org/10.1038/nature17637
Hyvärinen
,
A.
, &
Oja
,
E.
(
2000
).
Independent component analysis: Algorithms and applications
.
Neural Networks
,
13
(
4–5
),
411
430
. https://doi.org/10.1016/s0893-6080(00)00026-5
Iglesias
,
J. E.
,
Van Leemput
,
K.
,
Bhatt
,
P.
,
Casillas
,
C.
,
Dutt
,
S.
,
Schuff
,
N.
,
Truran-Sacrey
,
D.
,
Boxer
,
A.
,
Fischl
,
B.
, & for the Alzheimer's Disease Neuroimaging Initiative
. (
2015
).
Bayesian segmentation of brainstem structures in MRI
.
NeuroImage
,
113
,
184
195
. https://doi.org/10.1016/j.neuroimage.2015.02.065
Jain
,
S.
, &
Huth
,
A.
(
2018
).
Incorporating context into language encoding models for fMRI
.
Advances in Neural Information Processing Systems
,
31
,
327601
. https://doi.org/10.1101/327601
Kay
,
K. N.
,
Winawer
,
J.
,
Rokem
,
A.
,
Mezer
,
A.
, &
Wandell
,
B. A.
(
2013
).
A two-stage cascade model of BOLD responses in human visual cortex
.
PLoS Computational Biology
,
9
(
5
),
e1003079
. https://doi.org/10.1371/journal.pcbi.1003079
Kell
,
A. J.
,
Yamins
,
D. L.
,
Shook
,
E. N.
,
Norman-Haignere
,
S. V.
, &
McDermott
,
J. H.
(
2018
).
A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy
.
Neuron
,
98
(
3
),
630
644
. https://doi.org/10.1016/j.neuron.2018.03.044
Kobak
,
D.
,
Lomond
,
J.
, &
Sanchez
,
B.
(
2020
).
The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization
.
The Journal of Machine Learning Research
,
21
(
1
),
6863
6878
. https://jmlr.org/papers/v21/19-844.html
Kriegeskorte
,
N.
,
Goebel
,
R.
, &
Bandettini
,
P.
(
2006
).
Information-based functional brain mapping
.
Proceedings of the National Academy of Sciences of the United States of America
,
103
(
10
),
3863
3868
. https://doi.org/10.1073/pnas.0600244103
la Tour
,
T. D.
,
Eickenberg
,
M.
,
Nunez-Elizalde
,
A. O.
, &
Gallant
,
J. L.
(
2022
).
Feature-space selection with banded ridge regression
.
NeuroImage
,
264
,
119728
. https://doi.org/10.1016/j.neuroimage.2022.119728
Lalor
,
E. C.
,
Power
,
A. J.
,
Reilly
,
R. B.
, &
Foxe
,
J. J.
(
2009
).
Resolving precise temporal processing properties of the auditory system using continuous stimuli
.
Journal of Neurophysiology
,
102
(
1
),
349
359
. https://doi.org/10.1152/jn.90896.2008
Lim
,
M.
,
Ales
,
J. M.
,
Cottereau
,
B. R.
,
Hastie
,
T.
, &
Norcia
,
A. M.
(
2017
).
Sparse EEG/MEG source estimation via a group lasso
.
PLoS One
,
12
(
6
),
e0176835
. https://doi.org/10.1371/journal.pone.0176835
MacKay
,
D. J.
(
1992
).
The evidence framework applied to classification networks
.
Neural Computation
,
4
(
5
),
720
736
. https://doi.org/10.1162/neco.1992.4.5.720
MacKay
,
D. J.
(
1994
).
Bayesian nonlinear modeling for the prediction competition
.
ASHRAE Transactions
,
100
(
2
),
1053
1062
.
MacKay
,
D. J.
(
1996
).
Hyperparameters: Optimize, or integrate out?
In
Maximum entropy and Bayesian methods: Santa Barbara, California, USA, 1993
(pp.
43
59
).
Springer
. https://doi.org/10.1007/978-94-015-8729-7_2
Massy
,
W. F.
(
1965
).
Principal components regression in exploratory statistical research
.
Journal of the American Statistical Association
,
60
(
309
),
234
256
. https://doi.org/10.1080/01621459.1965.10480787
Matérn
,
B.
(
1960
).
Spatial variation: Stochastic models and their application to some problems in forst survey and other sampling investigations
.
Esselte
.
Mesgarani
,
N.
, &
Chang
,
E. F.
(
2012
).
Selective cortical representation of attended speaker in multi-talker speech perception
.
Nature
,
485
(
7397
),
233
236
. https://doi.org/10.1038/nature11020
Minka
,
T.
(
1998
).
Expectation-maximization as lower bound maximization
. https://tminka.github.io/papers/minka-em-tut.pdf
Mirkovic
,
B.
,
Debener
,
S.
,
Jaeger
,
M.
, &
De Vos
,
M.
(
2015
).
Decoding the attended speech stream with multi-channel EEG: Implications for online, daily-life applications
.
Journal of Neural Engineering
,
12
(
4
),
46007
. https://doi.org/10.1088/1741-2560/12/4/046007
Morris
,
C. N.
(
1983
).
Parametric empirical Bayes inference: Theory and applications
.
Journal of the American Statistical Association
,
78
(
381
),
47
55
. https://doi.org/10.1080/01621459.1983.10477920
Nakai
,
T.
,
Koide-Majima
,
N.
, &
Nishimoto
,
S.
(
2021
).
Correspondence of categorical and feature-based representations of music in the human brain
.
Brain and Behavior
,
11
(
1
),
e01936
. https://doi.org/10.1002/brb3.1936
Nakai
,
T.
,
Koide-Majima
,
N.
, &
Nishimoto
,
S.
(
2022
).
Music genre neuroimaging dataset
.
Data in Brief
,
40
,
107675
. https://doi.org/10.1016/j.dib.2021.107675
Naselaris
,
T.
,
Kay
,
K. N.
,
Nishimoto
,
S.
, &
Gallant
,
J. L.
(
2011
).
Encoding and decoding in fMRI
.
NeuroImage
,
56
(
2
),
400
410
. https://doi.org/10.1016/j.neuroimage.2010.07.073
Neal
,
R. M.
, &
Hinton
,
G. E.
(
1998
).
A view of the EM algorithm that justifies incremental, sparse, and other variants
. In
M. I.
Jordan
(Ed.),
Learning in graphical models
(pp.
355
368
).
Springer
. https://doi.org/10.1007/978-94-011-5014-9_12
Neal
,
R. M.
, &
Zhang
,
J.
(
2006
).
High dimensional classification with Bayesian neural networks and Dirichlet diffusion trees
. In
Guyon
I.
,
Nikravesh
M.
,
Gunn
S.
, &
Zadeh
L. A.
(Eds.),
Feature extraction: Foundations and applications
(pp.
265
296
).
Springer
. https://doi.org/10.1007/978-3-540-35488-8_11
Nunez-Elizalde
,
A. O.
,
Huth
,
A. G.
, &
Gallant
,
J. L.
(
2019
).
Voxelwise encoding models with non-spherical multivariate normal priors
.
NeuroImage
,
197
,
482
492
. https://doi.org/10.1016/j.neuroimage.2019.04.012
O’Sullivan
,
J. A.
,
Power
,
A. J.
,
Mesgarani
,
N.
,
Rajaram
,
S.
,
Foxe
,
J. J.
,
Shinn-Cunningham
,
B. G.
,
Slaney
,
M.
,
Shamma
,
S. A.
, &
Lalor
,
E. C.
(
2014
).
Attentional selection in a cocktail party environment can be decoded from single-trial EEG
.
Cerebral Cortex
,
25
(
7
),
1697
1706
. https://doi.org/10.1093/cercor/bht355
Ou
,
W.
,
Hämäläinen
,
M. S.
, &
Golland
,
P.
(
2009
).
A distributed spatio-temporal EEG/MEG inverse solver
.
NeuroImage
,
44
(
3
),
932
946
. https://doi.org/10.1016/j.neuroimage.2008.05.063
Owen
,
J.
,
Attias
,
H.
,
Sekihara
,
K.
,
Nagarajan
,
S.
, &
Wipf
,
D.
(
2008
).
Estimating the location and orientation of complex, correlated neural activity using MEG
.
Advances in Neural Information Processing Systems
,
21
.
Owen
,
J. P.
,
Wipf
,
D. P.
,
Attias
,
H. T.
,
Sekihara
,
K.
, &
Nagarajan
,
S. S.
(
2012
).
Performance evaluation of the champagne source reconstruction algorithm on simulated and real M/EEG data
.
NeuroImage
,
60
(
1
),
305
323
. https://doi.org/10.1016/j.neuroimage.2011.12.027
Park
,
M.
, &
Pillow
,
J. W.
(
2011
).
Receptive field inference with localized priors
.
PLoS Computational Biology
,
7
(
10
),
e1002219
. https://doi.org/10.1371/journal.pcbi.1002219
Pasley
,
B. N.
,
David
,
S. V.
,
Mesgarani
,
N.
,
Flinker
,
A.
,
Shamma
,
S. A.
,
Crone
,
N. E.
,
Knight
,
R. T.
, &
Chang
,
E. F.
(
2012
).
Reconstructing speech from human auditory cortex
.
PLoS Biology
,
10
(
1
),
e1001251
. https://doi.org/10.1371/journal.pbio.1001251
Paszke
,
A.
,
Gross
,
S.
,
Massa
,
F.
,
Lerer
,
A.
,
Bradbury
,
J.
,
Chanan
,
G.
,
Killeen
,
T.
,
Lin
,
Z.
,
Gimelshein
,
N.
,
Antiga
,
L.
,
Desmaison
,
A.
,
Köpf
,
A.
,
Yang
,
E.
,
DeVito
,
Z.
,
Raison
,
M.
,
Tejani
,
A.
,
Chilamkurthy
,
S.
,
Steiner
,
B.
,
Fang
,
L.
,…
Chintala
,
S.
(
2019
).
Pytorch: An imperative style, high-performance deep learning library
.
Advances in Neural Information Processing Systems
,
32
.
Puonti
,
O.
,
Iglesias
,
J. E.
, &
Van Leemput
,
K.
(
2016
).
Fast and sequence-adaptive whole-brain segmentation using parametric Bayesian modeling
.
NeuroImage
,
143
,
235
249
. https://doi.org/10.1016/j.neuroimage.2016.09.011
Puonti
,
O.
,
Van Leemput
,
K.
,
Saturnino
,
G. B.
,
Siebner
,
H. R.
,
Madsen
,
K. H.
, &
Thielscher
,
A.
(
2020
).
Accurate and robust whole-head segmentation from magnetic resonance images for individualized head modeling
.
NeuroImage
,
219
,
117044
. https://doi.org/10.1016/j.neuroimage.2020.117044
Rasmussen
,
C. E.
, &
Williams
,
C. K. I.
(
2005
).
Gaussian processes for machine learning. Adaptive computation and machine learning series
.
MIT Press
. https://doi.org/10.7551/mitpress/3206.001.0001
Robbins
,
H.
(
1964
).
The empirical bayes approach to statistical decision problems
.
The Annals of Mathematical Statistics
,
35
(
1
),
1
20
. https://doi.org/10.1214/aoms/1177703729
Sabuncu
,
M. R.
, &
Van Leemput
,
K.
(
2011
).
The relevance voxel machine (RVoxM): A Bayesian method for image-based prediction
. In
G.
Fichtinger
,
A.
Martel
, &
T.
Peters
(Eds.),
Medical Image Computing and Computer-Assisted Intervention—MICCAI 2011: 14th International Conference, Toronto, Canada, September 18–22, 2011, Proceedings, Part III 14
(pp.
99
106
).
Springer
. https://doi.org/10.1007/978-3-642-23626-6_13
Sahani
,
M.
, &
Linden
,
J.
(
2002
).
Evidence optimization techniques for estimating stimulus-response functions
.
Advances in Neural Information Processing Systems
,
15
.
Simon
,
N.
,
Friedman
,
J.
,
Hastie
,
T.
, &
Tibshirani
,
R.
(
2013
).
A sparse-group lasso
.
Journal of Computational and Graphical Statistics
,
22
(
2
),
231
245
. https://doi.org/10.1080/10618600.2012.681250
Synigal
,
S. R.
,
Teoh
,
E. S.
, &
Lalor
,
E. C.
(
2020
).
Including measures of high gamma power can improve the decoding of natural speech from EEG
.
Frontiers in Human Neuroscience
,
14
,
130
. https://doi.org/10.3389/fnhum.2020.00130
Tank
,
A.
,
Cover
,
I.
,
Foti
,
N. J.
,
Shojaie
,
A.
, &
Fox
,
E. B.
(
2017
).
An interpretable and sparse neural network model for nonlinear granger causality discovery
.
arXiv
, arXiv:1711.08160. https://doi.org/10.1109/tpami.2021.3065601
Tibshirani
,
R.
(
1996
).
Regression shrinkage and selection via the lasso
.
Journal of the Royal Statistical Society: Series B (Methodological)
,
58
(
1
),
267
288
. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Tibshirani
,
R.
,
Saunders
,
M.
,
Rosset
,
S.
,
Zhu
,
J.
, &
Knight
,
K.
(
2005
).
Sparsity and smoothness via the fused lasso
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
67
(
1
),
91
108
. https://doi.org/10.1111/j.1467-9868.2005.00490.x
Tikhonov
,
A. N.
, &
Arsenin
,
V. Y.
(
1977
).
Solutions of ill-posed problems
. Translation Editor Fritz John.
Wiley
. https://doi.org/10.2307/2006360
Tipping
,
M. E.
(
2001
).
Sparse Bayesian learning and the relevance vector machine
.
Journal of Machine Learning Research
,
1
,
211
244
. https://doi.org/10.1109/icdm.2012.58
Tipping
,
M. E.
, &
Faul
,
A. C.
(
2003
).
Fast marginal likelihood maximisation for sparse Bayesian models
. In
International workshop on artificial intelligence and statistics
(pp.
276
283
).
PMLR
.
Tuckute
,
G.
,
Feather
,
J.
,
Boebinger
,
D.
, &
McDermott
,
J. H.
(
2023
).
Many but not all deep neural network audio models capture brain responses and exhibit correspondence between model stages and brain regions
.
PLoS Biology
,
21
(
12
),
1
70
. https://doi.org/10.1371/journal.pbio.3002366
Tzanetakis
,
G.
, &
Cook
,
P.
(
2002
).
Musical genre classification of audio signals
.
IEEE Transactions on Speech and Audio Processing
,
10
(
5
),
293
302
. https://doi.org/10.1109/tsa.2002.800560
van de Wiel
,
M. A.
,
Lien
,
T. G.
,
Verlaat
,
W.
,
van Wieringen
,
W. N.
, &
Wilting
,
S. M.
(
2016
).
Better prediction by use of co-data: Adaptive group-regularized ridge regression
.
Statistics in Medicine
,
35
(
3
),
368
381
. https://doi.org/10.1002/sim.6732
van de Wiel
,
M. A.
,
van Nee
,
M. M.
, &
Rauschenberger
,
A.
(
2021
).
Fast cross-validation for multi-penalty high-dimensional ridge regression
.
Journal of Computational and Graphical Statistics
,
30
(
4
),
835
847
. https://doi.org/10.1080/10618600.2021.1904962
Virtanen
,
P.
,
Gommers
,
R.
,
Oliphant
,
T. E.
,
Haberland
,
M.
,
Reddy
,
T.
,
Cournapeau
,
D.
,
Burovski
,
E.
,
Peterson
,
P.
,
Weckesser
,
W.
,
Bright
,
J.
,
van der Walt
,
S. J.
,
Brett
,
M.
,
Wilson
,
J.
,
Millman
,
K. J.
,
Mayorov
,
N.
,
Nelson
,
A. R. J.
,
Jones
,
E.
,
Kern
,
R.
,
Larson
,
E.
,… SciPy 1.0 Contributors
. (
2020
).
SciPy 1.0: Fundamental algorithms for scientific computing in python
.
Nature Methods
,
17
,
261
272
. https://doi.org/10.1038/s41592-020-0772-5
Wen
,
Z.
,
Yu
,
T.
,
Yu
,
Z.
, &
Li
,
Y.
(
2019
).
Grouped sparse Bayesian learning for voxel selection in multivoxel pattern analysis of fMRI data
.
NeuroImage
,
184
,
417
430
. https://doi.org/10.1016/j.neuroimage.2018.09.031
Wipf
,
D.
, &
Nagarajan
,
S.
(
2007
).
A new view of automatic relevance determination
.
Advances in Neural Information Processing Systems
,
20
.
Wipf
,
D. P.
,
Owen
,
J. P.
,
Attias
,
H. T.
,
Sekihara
,
K.
, &
Nagarajan
,
S. S.
(
2010
).
Robust bayesian estimation of the location, orientation, and time course of multiple correlated neural sources using MEG
.
NeuroImage
,
49
(
1
),
641
655
. https://doi.org/10.1016/j.neuroimage.2009.06.083
Wolpert
,
D. H.
, &
Strauss
,
C. E.
(
1996
).
What bayes has to say about the evidence procedure
. In
Maximum entropy and Bayesian methods: Santa Barbara, California, USA, 1993
(pp.
61
78
).
Springer
. https://doi.org/10.1007/978-94-015-8729-7_3
Wong
,
D. D. E.
,
Fuglsang
,
S. A.
,
Hjortkjær
,
J.
,
Ceolini
,
E.
,
Slaney
,
M.
, &
de Cheveigné
,
A.
(
2018
).
A comparison of regularization methods in forward and backward models for auditory attention decoding
.
Frontiers in Neuroscience
,
12
,
531
. https://doi.org/10.3389/fnins.2018.00531
Woo
,
C.-W.
,
Chang
,
L. J.
,
Lindquist
,
M. A.
, &
Wager
,
T. D.
(
2017
).
Building better biomarkers: Brain models in translational neuroimaging
.
Nature Neuroscience
,
20
(
3
),
365
377
. https://doi.org/10.1038/nn.4478
Wu
,
M. C.
,
David
,
S. V.
, &
Gallant
,
J. L.
(
2006
).
Complete functional characterization of sensory neurons by system identification
.
Annual Review of Neuroscience
,
29
,
477
505
. https://doi.org/10.1146/annurev.neuro.29.051605.113024
Xu
,
X.
, &
Ghosh
,
M.
(
2015
).
Bayesian variable selection and estimation for group lasso
.
Bayesian Analysis
,
10
(
4
),
909
936
. https://doi.org/10.1214/14-ba929
Yamins
,
D. L.
,
Hong
,
H.
,
Cadieu
,
C. F.
,
Solomon
,
E. A.
,
Seibert
,
D.
, &
DiCarlo
,
J. J.
(
2014
).
Performance-optimized hierarchical models predict neural responses in higher visual cortex
.
Proceedings of the National Academy of Sciences of the United States of America
,
111
(
23
),
8619
8624
. https://doi.org/10.1073/pnas.1403112111
Yamins
,
D. L. K.
, &
DiCarlo
,
J. J.
(
2016
).
Using goal-driven deep learning models to understand sensory cortex
.
Nature Neuroscience
,
19
(
3
),
356
. https://doi.org/10.1038/nn.4244
Yuan
,
M.
, &
Lin
,
Y.
(
2006
).
Model selection and estimation in regression with grouped variables
.
Journal of the Royal Statistical Society Series B: Statistical Methodology
,
68
(
1
),
49
67
. https://doi.org/10.1111/j.1467-9868.2005.00532.x
Zhao
,
P.
,
Rocha
,
G.
, &
Yu
,
B.
(
2009
).
The composite absolute penalties family for grouped and hierarchical variable selection
.
The Annals of Statistics
,
37
(
6A
),
3468
3497
. https://doi.org/10.1214/07-aos584
Zou
,
H.
, &
Hastie
,
T.
(
2005
).
Regularization and variable selection via the elastic net
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
67
(
2
),
301
320
. https://doi.org/10.1111/j.1467-9868.2005.00503.x
Zou
,
H.
,
Hastie
,
T.
, &
Tibshirani
,
R.
(
2006
).
Sparse principal component analysis
.
Journal of Computational and Graphical Statistics
,
15
(
2
),
265
286
. https://doi.org/10.1198/106186006x113430
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.

Supplementary data