## Abstract

Regression is a principal tool for relating brain responses to stimuli or tasks in computational neuroscience. This often involves fitting linear models with predictors that can be divided into groups, such as distinct stimulus feature subsets in encoding models or features of different neural response channels in decoding models. When fitting such models, it can be relevant to allow differential shrinkage of the different groups of regression weights. Here, we explore a framework that allows for straightforward definition and estimation of such models. We present an expectation-maximization algorithm for tuning hyperparameters that control shrinkage of groups of weights. We highlight properties, limitations, and potential use-cases of the model using simulated data. Next, we explore the model in the context of a BOLD fMRI encoding analysis and an EEG decoding analysis. Finally, we discuss cases where the model can be useful and scenarios where regularization procedures complicate model interpretation.

## 1 Introduction

Understanding how the brain responds to different stimuli or tasks is a key objective in systems neuroscience. In recent years, significant progress has been made in developing models to analyze neural responses to complex, naturalistic stimuli (Hamilton & Huth, 2020). One potent approach is encoding models that combine stimulus features to predict channel-wise neural responses from fMRI or electrophysiology through linear regression (Broderick et al., 2018; de Heer et al., 2017; Di Liberto et al., 2015; Dumoulin & Wandell, 2008; Kay et al., 2013; Kell et al., 2018). Regularization is often needed in such models to reduce the risk of over-fitting and improve model prediction on held-out data. However, multiple - potentially colinear - stimulus feature sets can complicate the assessment of the relative importance of the individual features. In this case, it can be relevant to allow for differential shrinkage of different groups of regression weights. For instance, neural responses to natural speech depend on both low-level acoustic features and high-level phonetic or semantic features (Broderick et al., 2018; Di Liberto et al., 2015; Huth et al., 2016; Jain & Huth, 2018). Such feature groups vary on different time scales, have different dimensionality, and some features may not even be reflected in the neural response.

Several regularization strategies could have relevance for this situation (David et al., 2007; Golub et al., 1999; Hoerl & Kennard, 1970; Massy, 1965; Park & Pillow, 2011; Sahani & Linden, 2002; Tibshirani, 1996; Tibshirani et al., 2005; Tikhonov & Arsenin, 1977; Zou & Hastie, 2005). Recently, Nunez-Elizalde et al. (2019) and la Tour et al. (2022) proposed to divide predictors in encoding analyses into separate feature “bands” (groups of predictors) and allow differential shrinkage of weights associated with these different bands. The authors proposed banded Ridge regression - also referred to as grouped Ridge regression (van de Wiel et al., 2021) - and showed that banded Ridge regression can yield higher out-of-sample predictive accuracy compared to standard Ridge regression in fMRI encoding analyses (la Tour et al., 2022; Nunez-Elizalde et al., 2019). Other methods that can incorporate grouping structure constraints and other priors (Boss et al., 2023; Friedman et al., 2010; Simon et al., 2013; van de Wiel et al., 2016; Xu & Ghosh, 2015; Yuan & Lin, 2006; Zhao et al., 2009) could also be relevant in encoding analyses with grouped features.

Similar regression problems arise in decoding models, that is, models that combine neural channels (voxels, electrodes) to predict a continuous stimulus feature or task feature (Naselaris et al., 2011; Wu et al., 2006). One pertinent example is decoding models for EEG or MEG responses to continuous stimuli, like speech or music (Di Liberto et al., 2021; Ding & Simon, 2012; O’Sullivan et al., 2014). Here, regularized regression can be used to identify multidimensional finite impulse response (FIR) filters that predict stimulus features by combining different electrode signals at different time lags and/or different frequency bands (de Cheveigné et al., 2018; Mesgarani & Chang, 2012; Mirkovic et al., 2015). As with encoding models, it can be useful to allow for differential shrinkage of groups of weights corresponding to groups of predictors.

Here, we explore an empirical Bayes framework (Efron & Morris, 1973, 1975; Morris, 1983; Robbins, 1964) for estimating “banded”-type regression models. The framework is closely related to automatic relevance determination (Bishop, 2006; MacKay, 1994; Park & Pillow, 2011; Sahani & Linden, 2002; Tipping, 2001; Wipf & Nagarajan, 2007) and can be seen as an extension of work by Tipping (2001). The framework can be used to formulate regularized regression estimators with different regularization properties. We use an expectation-maximization (EM) algorithm to tune hyperparameters that control variances of prior distributions. The model can be used to formulate priors that allow for differential shrinkage of groups of regression weights. The model can additionally be used to promote smoothness on groups of regression weights by encouraging correlation between related regression weights.

In this paper, we describe the modeling framework and provide simulations to illustrate the properties, limitations, and potential use-cases of the described algorithm. Additionally, we illustrate properties of the model by using it to fit stimulus-response models on two publicly available datasets (Broderick et al., 2018; Nakai et al., 2022) and compare these model fits with the widely used Ridge estimator (Hoerl & Kennard, 1970). In the first dataset, we consider encoding models that use a set of audio features to predict BOLD fMRI responses to music (Nakai et al., 2022). In the second dataset, we fit decoding models that use features of multichannel time-lagged EEG data to predict an audio stimulus feature during a speech listening task (Broderick et al., 2018). Finally, we describe how simplifying model assumptions can allow for an efficient estimation in regression problems with multiple target variables. Software implementations of the model in both MATLAB and Python are available at https://github.com/safugl/embanded.

## 2 Methods

### 2.1 Regression model

Encoding and decoding analyses often involve linear regression models in the following form:

where $\u03f5$ is a zero-mean Gaussian variable.

Here, encoding models would treat stimulus features or task features as predictors, $X$, and a (preprocessed) neural response as target variable, $y$. Decoding models treat features extracted from neural activity measurements as predictors, $X$, and a stimulus feature or task feature as target variable, $y$. The differences in direction of inference have large implications for model interpretation and model properties (Haufe et al., 2014; Naselaris et al., 2011) as well as how potential stimulus confounders or task confounders can be dealt with.

Despite their differences, such encoding and decoding models lead to regression problems that can be encompassed in a general formulation. Let $y\u2208RM\xd71$ denote a target variable where $M$ indicates the number of observations. Moreover, let $X$ denote a matrix with $D$ predictors such that $X\u2208RM\xd7D$. Suppose that $X$ can be divided into a set of meaningful groups such that $X=[F1,F2,...,FJ]$, $j=1,2,\u200b...,j,\u200b...J$. Here, $Fj$ denotes a group of $Dj$ predictors, that is, $Fj\u2208RM\xd7Dj$. This group of predictors could, for example, be a stimulus feature subset in an encoding model, or response features of different voxels or electrodes in a decoding model. We thus consider a regression model in the following form (la Tour et al., 2022; Nunez-Elizalde et al., 2019):

where $\beta j$ denotes regression weights for a given group of predictors, $Fj$. Further, let $w$ denote a column vector collecting all the corresponding weights $w=[\beta 1\u22a4,\beta 2\u22a4,...,\beta J\u22a4]\u22a4$, with $w\u2208RD\xd71$. We will assume that input variables and target variables have been centered and that no intercept term is needed in the regression model (since the intercept typically carries no meaningful information and hence shrinkage is typically not desired). The conditional distribution of the target variable $y$ given a set of observed predictors $X$ and given the model with parameters $w$ and $\nu $ will be assumed to follow a Gaussian distribution in the following form:

where $N(y|Xw,\nu IM)$ is used to indicate a multivariate Gaussian with mean $Xw$ and a diagonal covariance matrix $\nu IM$ of size $M\xd7M$. A zero-mean Gaussian prior distribution will be placed over the weights:

where $\Lambda $ is a $D\xd7D$ block-diagonal covariance matrix. Such a prior distribution over the weights can be used to “shrink” the estimated regression weights. Here, we specifically define that $\Lambda $ has a block structure defined as follows:

By introducing such a prior distribution over the weights, it is now possible to allow for differential shrinkage of different groups of weights. We define that the $j$-th block in $\Lambda $ relates to a single group of predictors, $Fj$, and that $\Omega j$ has size $Dj\xd7Dj$. The column indices in $X$ unique to predictor group $Fj$ will thus index rows and columns in $\Lambda $ for that block. The terms $\Omega j$, $j=1,...,J$, denote matrices that are assumed to be fixed *a priori*. For now, we will assume that they are identity matrices. The following parametrization (Matérn, 1960; Rasmussen & Williams, 2005) of these terms will later be used for encouraging local smoothness via a Matérn covariance function with order $3\u200b/\u200b2$ and length scale $hj$:

Note that rather than attempting to estimate $hj$ and impose hyperpriors on the $hj$ terms, we will fix these parameters *a priori*. We complete the model by placing Inverse-Gamma prior distributions over the $\lambda j$ terms and over the $\nu $ term:

where Inv-Gamma (*x*|*a*, *b*) denotes an Inverse-Gamma distribution with shape $a$, scale $b,$ and mode $b\u200b/\u200b(a+1)$:

To simplify notation, we will henceforth let $\lambda $ denote a set of ${\lambda j}j=1J$ parameters and not highlight dependence on terms that remain fixed. We now need to regard $\lambda $ and $\nu $ as random variables and by doing so the posterior distribution over all variables takes the form:

This expression cannot be evaluated analytically. In what follows, we will resort to an empirical Bayesian estimation procedure and assume that $p(w|y,X)=$$\u222cp(w,\nu ,\lambda |X,y)d\nu \u2009d\lambda $ can be approximated by $p(w|y,X,\nu \u02dc,\lambda \u02dc),$ where $\nu \u02dc$ and $\lambda \u02dc$ are fixed values obtained by maximizing the marginal posterior distribution:

We will return to assumptions entailed by this procedure later. Our goal is to use an expectation-maximization (EM) framework (Dempster et al., 1977; Minka, 1998; Neal & Hinton, 1998) to maximize this marginal distribution. In brief, this involves lower bounding $ln\u200ap(\lambda ,\nu \u200b|y,X)$ with a function $\mathcal{F}(\lambda ,\nu ,q)$ that is a function of a valid probability distribution $q(w)$ over $w$, and a function of the parameters $\lambda $ and $\nu $ (Bishop, 2006):

The framework involves alternating between an “E-step” (finding a good lower bound) and an “M-step” (subsequently maximizing this bound) (Minka, 1998). Suppose we initialize the algorithm with $\lambda (k)$ and $\nu (k)$ where $k=0$. In the expectation step (E-step), we fix $\lambda $ and $\nu $ to these values and maximize $\mathcal{F}(\lambda (k),\nu (k),q)$ with respect to $q(w)$. It can be shown that the lower bound $\mathcal{F}$ will touch the objective when $q(w)=p(w|\u200ay,X,\lambda (k),\nu (k))\u2261q(k),$ where the posterior distribution of $w$ given a fixed set of parameters $\lambda $ and $\nu $ takes the form:

We will henceforth let $\mu \u02dc$ and $\Sigma \u02dc$ denote mean and covariance of $p(w|y,X,\lambda (k),\nu (k))$. Defining $q(w)$ given $\lambda (k)$ and $\nu (k)$ is the E-step. In the maximization step (M-step), we keep this distribution fixed and now maximize $\mathcal{F}(\lambda ,\nu ,q(k))$ with respect to $\lambda $ and $\nu $. Ignoring irrelevant terms, one can show that we need to find a set of parameters $\lambda (k+1)$ and $\nu (k+1)$ that maximizes $Eq(k)\u200b[ln (p(y|X,w,\nu )p(w|\lambda )p(\lambda )p(\nu ))]$ where $Eq(k)[\u22c5]$ denotes expectation under $q(k)$ and where

Recall that $w$ is a column vector that contains all $\beta j$ weights and that $\Lambda $ is a block-diagonal matrix. We know that $p(w|y,X,\lambda (k),\nu (k))$ is Gaussian with mean $\mu \u02dc$ and covariance $\Sigma \u02dc$. Let $\mu \u02dcj$ denote blocks in $\mu \u02dc$ associated with predictor group $j$. Further, let $\Sigma \u02dcj$ denote a block of size $Dj\xd7Dj$ along the diagonal in $\Sigma \u02dc$ associated with predictor group $j$. We again discard irrelevant terms, and now see that the M-step amounts to maximizing the following expression with respect to $\nu $ and $\lambda $:

Finding a set of $\lambda $ and $\nu $ parameters that fulfill $\u2202\mathcal{L}/\u2202\lambda j=0$ and $\u2202\mathcal{L}/\u2202\nu =0$ yields the following closed-form update rules:

The procedure of fitting the model thus involves (I) specifying hyperprior parameters $\varphi $, $\kappa $, $\eta $ and $\tau $, (II) starting with an initial guess for $\lambda 1,\lambda 2,...\lambda J$ and $\nu $, (III) estimating $\Sigma \u02dc$ and $\mu \u02dc$, (IV) updating $\lambda j$ and $\nu $ according to Eqs. 18 and 19, and subsequently iterating between step (III) and (IV) until convergence.

#### 2.1.1 Predictions

The above procedure yields a set of hyperparameters $\nu \u02dc$ and $\lambda \u02dc$ from the training data that may be sensible in several circumstances, and we will consequently focus on $p(w|\u200by,X,\nu \u02dc,\lambda \u02dc)$. For predictive purposes, we assume that the following approximation holds:

That is, to make predictions about a novel response variable, $y^$, given a new set of predictors $X^$ and inferred model parameters, we focus on $p(y^|y,X,X^)$ and let $p(\lambda ,\nu |X,y)$ be represented by a delta function $\delta (\nu \u2212\nu \u02dc,\lambda \u2212\lambda \u02dc)$ (Tipping, 2001). The resulting distribution is Gaussian with mean $X^\mu \u02dc$ and covariance $\nu \u02dcI+X^\Sigma \u02dcX^\u22a4$. The terms $\mu \u02dc$ and $\Sigma \u02dc$ here refer to the estimates in Eqs. 14 and 15, respectively. To make point predictions, we may now focus on the estimated model weights $w=\mu \u02dc$ and simply compute $X^w$. Our focus will be on such point predictions.

For the interested reader and illustrative purposes, we visualize variability in the estimated weights in Figures 1-3 based on Eq. 13. It is important to stress that $p(w|y,X,\lambda \u02dc,\nu \u02dc)$ is conditional on the point estimates of $\lambda \u02dc$ and $\nu \u02dc$ and that it may fail to be even close to a reasonable approximation of $\u222b\u222bp(w,\nu ,\lambda |X,y)d\nu d\lambda $ (Wolpert & Strauss, 1996).

## 3 Results

### 3.1 Simulations

In addition to evaluating the predictive accuracy of encoding models or decoding models on held-out data, neuroimaging studies often inspect and interpret point estimates of model parameters. In linearized encoding models, model weights are often assumed to be directly interpretable in terms of the underlying brain activity generating a measured response (Haufe et al., 2014). Weights of general linear models (GLMs) in fMRI analysis (Friston et al., 1994) or temporal response functions in M/EEG analysis (Ding & Simon, 2012; Lalor et al., 2009) are prominent examples. However, the choice of regularization can significantly impact the relative magnitude of the estimated weights, in particular in problems with many, potentially correlated, sets of predictors. It can therefore be relevant to simulate data with known target model weights in order to better understand the properties of regularized regression estimators and identify situations where they can be useful or can complicate interpretation.

Below, we consider different simulations to illustrate the properties of the described model. We compare the proposed EM-based banded regression estimator (henceforth EM-banded) with a standard Ridge regression estimator: $w=(XTX+\alpha I)\u22121XTy$ (Hoerl & Kennard, 1970). To simplify readability, we consider simulations where we fix the parameters related to the Inverse-Gamma priors to the same value, $\gamma $, such that $\tau =\eta =\kappa =\varphi =\gamma $. Notice that the hyperpriors become broad when $\gamma $ approaches zero.

#### 3.1.1 Simulation 1: Suppressing contributions from a group of predictors

In models with groups of predictors representing different stimulus feature spaces, it is sometimes needed to identify entire feature spaces that can be excluded from an analysis without compromising predictive accuracy. To simulate such a situation, we here define three stimulus feature spaces, $F1$, $F2$, and $F3$, and one response variable $y$ which contains mixed versions of $F1$ and $F3$, but not $F2$ ($F2$ does not contribute to the response). We thus simulate the response as $y=F1w1+F3w3+\u03f5$, where $\u03f5$ is a Gaussian noise term. In this situation, we thus need the estimator to identify (close to) zero weights for $F2$. Each feature space has 64 predictors and a total of $M=1024$ observations are simulated. The weights, $w1$ and $w3$, as well as the feature space $F1$ and $F3$, are drawn from Gaussian distributions. The mixed target signal and the noise term have equal power. We compare weights estimated by the EM-banded model with weights estimated by Ridge regression models. To exemplify how hyperparameters controlling Inverse-Gamma hyperpriors can impact the estimated weights, we fit models for $\gamma $ fixed to $10\u22124$, $10\u22123$, $10\u22122$, and $10\u22121$. The Ridge $\alpha $ hyperparameter is set to $104$, $103$, $102$, and $101$. The terms $\Omega 1$, $\Omega 2$, and $\Omega 3$ are identity matrices.

Figure 1 shows results from Simulation 1. The Ridge regression model yields “dense” solutions for all considered Ridge hyperparameters, $\alpha $, with no dramatic overall scale differences between weights associated with $F1$, $F2$, or $F3$. In contrast, the EM-banded model excessively shrinks weights associated with feature set $F2$ for smaller values of $\gamma $. As expected, the EM-banded model tends to show signs of under-shrinkage when the Inverse-Gamma prior distributions have little mass near zero and in these cases, the overall scale differences between the different sets of weights associated with $F1$, $F2$, and $F3$ are smaller.

#### 3.1.2 Simulation 2: Encouraging smoothness for a feature subset

It is sometimes relevant to encourage smoothness on sets of weights, for instance, in analyses that approximate smooth FIR filters. An example hereof is fMRI encoding analyses that model the relation between a task-regressor augmented with multiple time lags and the BOLD voxel response. In this case, it can *a priori* be expected that the approximated FIR filter should be smooth due to hemodynamic properties of the BOLD signal (Goutte et al., 2000). It can similarly be relevant to assume smoothness across spectral channels in analyses that model relations between average stimulus power spectra and BOLD fMRI responses.

Here, we explore one way of encouraging smoothness using the definition of $\Omega j$ in Eq. 6. To illustrate the effect of $\Omega j$, we consider a simulated regression problem with two feature spaces, $F1$ and $F2$. Each feature space has 64 predictors and is drawn from a Gaussian distribution. Data are simulated as $y=F1w1+F2w2+\u03f5$, where $\u03f5$ is drawn from a Gaussian. A total of $M=1024$ observations are simulated. The weights associated with the first feature space $w1$ are simulated to be sinusoidal (i.e., smooth), while $w2$ weights are Gaussian. The signal-to-noise ratio (SNR) is -5 decibels (dB) and we fit the EM-banded models with $\gamma $ fixed to a low value of $10\u22124$ (i.e., corresponding to a broad hyperprior). The $\Omega 1$ term associated with $F1$ is parameterized according to Eq. 6 with $h1$ set to $1$, $5$, and $10$ to illustrate its effect.

Figure 2 shows weights estimated with Ridge estimators for different $\alpha $ values (below) and weights estimated by the EM-banded estimators (above). At higher levels of $h1$, the EM-banded estimator approximates the weight distribution of the first feature space $F1$ with high accuracy. The Ridge estimator, on the other hand, yields weights that tend to show more rapid (high frequency) fluctuations across neighboring weights associated with $F1$ compared to the target weights.

#### 3.1.3 Simulation 3: Encouraging differential shrinkage of individual weights

In Simulation 1, the EM-banded model was used to shrink weights of an entire feature space (i.e., groups of predictors). The model can similarly encourage differential shrinkage of individual weights. To illustrate this, we consider a simulation where we assume that the number of groups equals the number of predictors (i.e., each group contains a single predictor). We simulate 512 predictors, $F1$, $F2$,…, $F512$, and a response variable $y$ which contains mixed versions of some of these predictors. A total of $M=1024$ observations are simulated. It is assumed that $y=\u2211jFjwj+\u03f5$ where $\u03f5$ is Gaussian noise. We assume that a high proportion of the target weights $wj$ is equal to zero. Each row in $[F1,F2,...,F512]$ is drawn from a multivariate Gaussian with zero mean and a covariance defined as $C=exp(\u22120.3|\u200ak\u2212i|2),k=1,...,512,i=1,...,512$. The SNR is equal to 0 dB. We again compare Ridge regression model fits with EM-banded model fits using the same set of fixed hyperparameters as in Simulation 1.

The estimated weights for the Ridge and EM-banded model are shown in Figure 3. It is evident that the EM-banded estimator with $\gamma =10\u22124$ excessively shrinks several of the individual weights. In effect, the EM-banded model recovers the target weights with relatively high accuracy. In comparison, the Ridge estimator tends to dilute effects across neighboring (correlated) predictors and pull weights toward zero when $\alpha $ attains high values.

#### 3.1.4 Simulation 4: Interpretation of weights with correlated predictors

Until now, we have not explored how strong correlations among groups of predictors can impact results. Models involving correlated groups of predictors occur often in EEG or fMRI encoding models that model responses to naturalistic stimuli (Hamilton & Huth, 2020). To exemplify this situation, this simulation considers the example of predicting responses to natural speech. Speech signals can be characterized by their lower level acoustic features as well as higher level phonetic or semantic features (Broderick et al., 2019; Huth et al., 2016). The different hierarchical levels can each be modeled by multidimensional features, but the statistics of natural speech imply a degree of correlation between such feature groups (see, e.g., panel B in Fig. 4). This can complicate inferences about neural responses unique to a given group.

To illustrate this, we simulate responses to speech acoustic and phonetic features extracted from natural speech audio (Garofolo et al., 1993) (see Section 4 in the Supplementary Material for details on feature extraction). A first feature set $F1$ contains time-lagged spectrogram features and $F2$ contains time-lagged regressors created from phonetic annotations. We simulate a response as $y=F1w1+\u03f5$, that is, only the feature set associated with the spectrogram features $F1$ affects the response variable, $y$. The SNR is equal to 0 dB and the target filter weights, $w1$, are shown in Figure 4A. We again compare Ridge and EM-banded regression model fits using a set of fixed hyperparameters.

Figure 4 shows the results of this simulation. The EM-banded estimator accurately recovers the weights associated with the speech spectrogram $F1$ while excessively shrinking the weights associated with the phonetic features $F2$, as desired. The Ridge estimator, on the other hand, pulls weights associated with correlated predictors in the two feature groups toward each other at higher levels of regularization. This has the undesired consequence that a temporally located response function associated with the phonetic feature group $F2$ emerges. This can complicate the interpretation of the weights associated with $F2$, for example, if such point estimates of weights were interpreted or further analyzed in second-level inferences. Simulating the opposite case, that is, simulating only responses to phonetic features (and no contribution from spectrogram features), leads to the equivalent result: weights associated with the spectrogram feature group show temporally located response functions (see Section 4 in the Supplementary Material). The described effect depends on the degree of correlation among predictors and on the amount of data available for fitting the models. We note that high levels of regularization often occur in practice for Ridge estimators, for instance, when tuning the regularization term to maximize the Pearson correlation coefficient between model predictions and target variables (due to its insensitivity to scaling and shift mismatches). For a related, but more stylized simulation, see Section 5 in the Supplementary Material.

This simulation highlights a potential source of pitfall for the interpretation of model weights in situations with co-varying features affecting the response. To illustrate this further, consider again the simulation where the response is driven by the first feature set $F1$, but now only the correlated set $F2$ is included in the model. Figure 5 depicts the estimated weights by Ridge models in this case. The model with only $F2$ predictors now misleadingly indicates a temporally located response for all of the considered $\alpha $ parameters, despite the fact that the target regression weights associated with $F2$ are zero. Interpretation of the reduced model is clearly different from the model that includes both groups of predictors.

#### 3.1.5 Simulation 5: Visualizing behavior under unfavorable signal-to-noise ratios

As illustrated in Simulation 3, the EM-banded estimator can be used to encourage differential shrinkage of weights associated with individual predictors. The behavior of such an estimator will depend on several factors, including the amount of data available to support the identification of $\lambda $, correlations among predictors, and the SNR. Here, we illustrate the behavior of the EM-banded estimator in a simulation with poor SNR. We simulate two predictors $F1$ and $F2$ and a response variable $y=F1w1+F2w2+\u03f5$. The predictors as well as the noise term $\u03f5$ is drawn from Gaussian distributions. The target weights are fixed to non-zero values. The SNR is approximately $\u221220$ dB. We consider simulations where the number of observations is $512$, $1024$, and $8192$, and we fit EM-banded models to the data from each simulation. We focus on estimators with $\gamma =10\u22124$, respectively. $\gamma =10\u22122$ to illustrate how differences in these prior choices affect the estimated parameters. This procedure is repeated many times to elucidate the effect.

Figure 6 shows 2D histogram visualizations of the estimated weights $w1$ and $w2$ from these simulations. Notably, the estimator with the lower $\gamma =10\u22124$ tends to exhibit excessive shrinkage to either of the two weights when the number of observations is low ($M=512$). Such over-shrinkage is undesirable. The estimator even collapses to $w1\u2248w2\u22480$ occasionally, which again is undesirable. Increasing $\gamma $ to $10\u22122$ here yields dramatically different estimates and less differential shrinkage of the two weights (increasing $\gamma $ encourages solutions that allow for less excessive shrinkage). These simulations underline subtleties that can yield dramatic effects on the estimated weights, which is important to keep in mind when focusing on “default” priors for $\lambda j$.

### 3.2 fMRI encoding analysis with “stimulus-irrelevant” predictors

Building encoding models often involves deciding on which of multiple potentially relevant feature sets to include as predictors. Choosing features both implies a risk of overlooking feature sets that are potentially relevant, and including feature sets that are irrelevant for predicting the target variables. In this section, we investigate how the inclusion of stimulus-irrelevant predictors - here defined as simulated noise predictors that are unrelated to the task or stimuli - can impact results in an fMRI encoding analysis and how the EM-banded estimator may help suppress their contribution.

This example uses a publicly available BOLD fMRI dataset that has been described in Nakai et al. (2021, 2022). The dataset contains BOLD fMRI data acquired from five participants listening to excerpts from music in 12 training runs and 6 test runs. Informed consent had been obtained from all participants prior to their participation, and procedures were approved by local ethics and safety committees, see Nakai et al. (2021, 2022). Each functional run lasted 10 min and consisted of 40 musical clips that each was 15 s long. We focused on encoding models with three feature sets, $F1$, $F2$, and $F3$. The first two feature sets, $F1$ and $F2$, were related to the music stimuli: a 10-dimensional genre-label feature given by Tzanetakis and Cook (2002) indicating which of 10 musical genres a given stimulus is assigned to ($F1$), and a time-averaged 32-dimensional spectrogram of the audio stimulus ($F2$). As the third feature set ($F3$) unrelated to the stimuli, we included a 400-dimensional multivariate Gaussian noise. $F1$, $F2$, and $F3$ were considered separate groups for the EM-banded model. For simplicity, we always initialize the EM algorithm with $\lambda j$ and $\nu $ parameters set to $1$ and define that no more than $200$ iterations should be considered. It may sometimes be advisable to use different initialization strategies. More details related to fMRI preprocessing, audio feature extraction, and model fitting are described in Section 1 in the Supplementary Material.

Figure 7 shows the predictive accuracies obtained with the Ridge estimator and the EM-banded estimator. Predictive accuracy is here evaluated both as Pearson correlation coefficient between the predicted and measured BOLD fMRI response data, and as explained variance ratio. Compared to the Ridge estimator, the EM-banded estimator yielded higher predictive accuracy for all subjects in both metrics (all $p<0.05$, Wilcoxon signed-rank tests across all voxels used for fitting models). The improvement in terms of Pearson correlation coefficient among the top 10000 voxels was (mean $\xb1$ standard deviation): $0.0335\xb10.0515$ (sub-001), $0.0411\xb10.0629$ (sub-002), $0.0568\xb10.0699$ (sub-003), $0.0391\xb10.0637$ (sub-004), and $0.0298\xb10.0522$ (sub-005).

The regularization term $\alpha $ of the Ridge estimator varies substantially across voxels. This can occur, for instance, if voxels with higher signal-to-noise ratio require less overall regularization. Changes in $\alpha $ impacts the overall scale of all weights. This can lead to misleading spatial weight maps indicating effects of predictors that are stimulus irrelevant. This is illustrated in Figure 8 showing the group-mean standard deviation of predictions obtained with stimulus-irrelevant predictors. The predictions were obtained using only the task-irrelevant predictors to predict voxel responses (discarding the spectrogram and genre features) with the Ridge estimator. As can be seen, the standard deviation of these predictions is high specifically in auditory cortical regions. This may reflect that the Ridge estimator declares less overall shrinkage in voxels in regions with higher SNR. In effect, the map can misleadingly be interpreted to suggest regional auditory-evoked activation to a random feature *because of* the regularization. Encouraging differential shrinkage of each feature space with the EM-banded estimator avoids this problem, showing no clear spatial patterns in these maps.

### 3.3 EEG-based decoding analysis

In this example, we consider a decoding analysis of single-trial EEG responses to continuous natural speech. The example follows a popular approach of predicting the continuous speech amplitude envelope using linear combinations of time-lagged EEG scalp electrode responses at lower frequencies (Broderick et al., 2018; Crosse et al., 2016; Ding & Simon, 2012; O’Sullivan et al., 2014; Wong et al., 2018). In this approach, the goal is to find a multidimensional filter $w$ that linearly maps from (features in) the multichannel EEG response to the envelope of the speech stimulus $y$. The decoding analysis can be formulated as $y(t)=\u2211c\u2211tx(t\u2212\tau ,c)w(c,\tau )+\u03f5$, where $x(t\u2212\tau ,c)$ denotes preprocessed electrode response in channel $c$ at time point $t\u2212\tau $ and where $\tau $ indicates the time delay between stimulus and response. The model can be specified by concatenating multiple time-lagged versions of each EEG electrode response in a design matrix $X$ in Eq. 1. The regression problem now amounts to estimating coefficients of a multi-dimensional FIR filter (de Cheveigné et al., 2018). In this context, it is natural to consider each time-lagged electrode response as a feature group $Fj$ in Eq. 2 and allow for differential shrinkage of weights associated with each group. For instance, it can be relevant to allow for excessive shrinkage of a noisy or seemingly irrelevant electrode for all its time-lag copies.

While EEG speech decoding studies often focus on low-frequency activity ($<15$ Hz), it is known from intracranial recordings that the power of higher frequency neural activity in the gamma range also tracks features of speech stimuli (Crone et al., 2001; Mesgarani & Chang, 2012; Pasley et al., 2012). These frequencies are typically highly attenuated in scalp EEG. Nonetheless, including high gamma power features has been shown to improve predictive accuracy of EEG decoding models for some subjects (Synigal et al., 2020). However, including more EEG features, some of which may have very low SNR, is also be associated with a risk of adding predictors to the decoding model that introduce noise to the predictions. It may thus be desirable to allow for differential shrinkage of different EEG features. In this example, we therefore explore predictive accuracy of the EM-banded model when both lower and higher frequency EEG features are included in such a decoding analysis.

This example uses publicly available EEG data (https://doi.org/10.5061/dryad.070jc) described in Broderick et al. (2018) and Di Liberto et al. (2015). In brief, 19 subjects listened to excerpts from an audiobook while continuous EEG data from 128 scalp electrodes were recorded. All procedures were undertaken in accordance with the Declaration of Helsinki and were approved by local ethics committees, see Broderick et al. (2019) and Di Liberto et al. (2015). Data were acquired in 20 trials each approximately 180 s in length. We fit decoding models again using the EM-banded framework compared to standard Ridge regression. For both models, we used time-lagged low-frequency (LF) EEG features (1-8 Hz) as well as time-lagged high-frequency (HF) EEG power features (25-35 Hz) (Giraud & Poeppel, 2012) to predict a speech envelope feature. Further details related to EEG preprocessing, audio feature extraction, and model fitting are described in Section 2 in the Supplementary Material. Note that cross-validation was only used for hyperparameter tuning in the Ridge model, but that both models always were tested on held-out data.

The results from the analysis are shown in Figure 9. The prediction accuracies obtained with the EM-banded estimator are modestly but consistently higher than the Ridge estimates ($p<0.01$, Wilcoxon signed rank test). The average correlation between the predicted speech envelope and the target envelope is $0.1368\xb10.0383$ for the EM-banded model and $0.1267\xb10.0.0377$ for the Ridge model. The EM-banded approach excessively shrinks groups of weights, whereas the Ridge estimator yields a denser distribution of weights. Figure 9 (right) shows an example of weights estimated by the two models for one subject. This illustrates how the overall scale of weights associated with the HF features and LF features is comparable for the Ridge estimator whereas the EM-banded estimator differentially shrinks both weights associated with HF features and groups of LF features associated with single electrodes. Given that high-frequency EEG features are weak and in some subjects not even measurable by scalp electrodes, including HF features can negatively model impact predictions, in particular with a Ridge estimator. Differential shrinkage of groups of weights may in such cases allow for better exploitation of potentially informative HF features.

## 4 Discussion

We explored an empirical Bayes framework for estimating “banded”-type encoding and decoding models. The framework can allow differential shrinkage of groups of weights associated with different groups of predictors (e.g., feature “bands”). The model can further be used to encourage smoothness on such groups of weights. In the following, we discuss modeling considerations and model limitations.

### 4.1 Encoding and decoding models with correlated predictors

Regression models with highly correlated predictors frequently occur in encoding and decoding analyses. Features of continuous naturalistic stimuli (e.g., video or speech) used as predictors in encoding models are often highly correlated. Continuous scalp EEG channels used as predictors in decoding models are typically also highly correlated, and pairs of channels can commonly exceed $\rho =0.8$ with high-density EEG recordings.

The described estimator may encourage solutions where subsets of correlated predictors are excessively shrunken. However, the true posterior of weights in analyses with many correlated predictors may show complex multimodal distributions, and point estimates of model parameters (or, approximate conditional distributions) should be interpreted with caution (see, e.g., Simulation 5). Spurious excessive shrinkage of a given feature set in an encoding analysis with correlated feature sets could, for example, be misinterpreted as indicating that the feature is not encoded in a response where it is contributing weakly. Such issues can be particularly difficult to identify when there are many outcome variables (e.g., different voxels with different noise profiles), and little training data to support the identification of model parameters.

Shrinkage procedures can complicate interpretation of weights in encoding analyses with correlated predictors, as illustrated in our simulations. In decoding models, model prediction accuracy on held-out data is often a primary measure. If the main goal is to optimize model performance for some application (e.g., a decoding model optimized for BCI performance), then the choice of shrinkage procedure may be less critical for interpretation as long as it does not compromise performance or lead to exclusion of important features. However, another goal may be to correlate prediction accuracies with behavior or clinical outcome measures. Here, the choice of shrinkage procedure can affect group-level inferences. For instance, multi-channel EEG data may show distinct noise characteristics in different individuals (e.g., noisy electrodes or attenuation of high-frequency activity due to head anatomy). A given shrinkage procedure might differentially impact decoding accuracies across individuals, potentially complicating the interpretation of group-level correlations with behavior.

With highly correlated predictors, it can sometimes be relevant to transform (groups of) the predictors using data dimensionality reduction techniques (de Cheveigné, 2021; de Cheveigné & Parra, 2014; de Cheveigné & Simon, 2008; de Cheveigné et al., 2019; Hyvärinen & Oja, 2000; Zou et al., 2006) and then fit models using the transformed data. This can potentially reduce computational burden without compromising predictive accuracy. Dimensionality reduction can be combined with banded regularization procedures to allow differential shrinkage of regression weights associated with each component. The described framework can also be used in the context of feature selection (Bair et al., 2006; Fan & Lv, 2008; Guyon & Elisseeff, 2006; Neal & Zhang, 2006), for instance, to inform about which stimulus-response lags to consider. Whether the assumptions entailed by such procedures are reasonable again depends on the specific research goals.

Several of our simulations assumed that there is a set of “true” regression weights as defined by data-generating equations. For example, in Simulation 4 we assumed that only one of two correlated feature groups drives a response. Such data-generating processes are deliberately oversimplistic in order to illustrate properties of the model, but are unlikely to reflect realistic scenarios. In naturalistic scenarios, correlated data are likely to be generated by a chain of latent causes. For example, in encoding analyses with different sets of highly correlated features (e.g., different high- and low-level features derived from natural speech), it is conceivable that the different stimulus features as well as the target regression variables each is generated by one of more underlying variables unknown to the modeler. Therefore, regression weights do not allow straightforward causal interpretation in such scenarios, and banded regression procedures do not solve such issues.

### 4.2 Computational burden

Encoding and decoding analyses often involve high-dimensional regression problems. In encoding models, the stimulus or task feature selection is often motivated by some pre-defined hypotheses. Yet, even for well-defined hypotheses, feature sets often translate to a high number of predictors. This, in turn, implies a high computational burden. The proposed EM-banded model involves matrix inversions (often with complexity $O(D3)$ unless specific structure of the matrix can be exploited) which may limit the relevance if the number of predictors is very high. This problem is further amplified in situations with many response variables. Algorithms described in la Tour et al. (2022); Nunez-Elizalde et al. (2019) are well optimized for fMRI encoding analyses with many predictors and a large number of response variables (voxels). For this situation, Section 3 in the Supplementary Material presents an alternative formulation of the EM-banded model where it is assumed that the $\lambda j$ and $\nu $ parameters are shared across multiple outcome variables. This formulation allows for a more efficient implementation in scenarios with many outcome variables. Runtime test examples can be found at https://github.com/safugl/embanded.

### 4.3 Risks of under-shrinkage and over-shrinkage

The described model applies independent Inverse-Gamma priors on $\lambda j$ terms associated with different groups of predictors. The hyperpriors could also be omitted and replaced by a constant which will lead to typical type-II maximum likelihood estimates. Several efficient algorithms have specifically addressed such scenarios (Cai et al., 2021; Tipping & Faul, 2003; Wipf & Nagarajan, 2007). The choice of hyperpriors, or lack hereof, can have a large impact on the results and misspecified hyperpriors can be associated with over-shrinkage or under-shrinkage, as exemplified in our simulations. In both cases, this may compromise both predictive accuracy and interpretability of model weights.

In the encoding and decoding analyses examples, we focused on Inverse-Gamma hyperpriors with hyperparameters fixed to $\tau =\eta =\kappa =\varphi =10\u22124$, which corresponds to broad hyperpriors. Choosing broad hyperpriors is convenient in several practical analysis situations, allowing for excessive shrinkage of (groups of) predictors while also putting prior mass on larger values for $\lambda j$ and $\nu $. Moreover, fixing hyperparameters allows for determination of $\lambda j$ and $\nu $ terms from training data alone without having to tune hyperparameters using cross-validation. This can be particularly convenient in banded regression problems with many feature groups. Section 7 in the Supplementary Material shows out-of-sample predictive accuracies for EM-banded models with various different hyperpriors fit to data from three of our simulations. Here, the predictive accuracies tend to plateau as $\gamma $ approaches zero, suggesting that broad hyperpriors are reasonable choices in these simulations.

In many practical applications, a broad hyperprior can be chosen to provide an initial “default” reference model that can be used as a starting point for subsequent model comparisons (Gelman, 2006). In the context of encoding and decoding analyses (where focus mostly is on out-of-sample predictive accuracy), such reference models can be highly useful for identifying cases where certain regularization properties are inappropriate. This procedure could be taken, for example, when exploring whether grouping structures or smoothness constraints are relevant. The EM-banded model offers one estimator among several others that can be considered in such a process.

### 4.4 Empirical Bayes and related work

We explored a parametric empirical Bayes framework (Efron & Morris, 1973; 1975; Robbins, 1964; MacKay, 1992, 1996; Morris, 1983) in the context of encoding and decoding analyses. We used this framework to tune regularization hyperparameters from the data and subsequently focus on moments in a conditional distribution, an idea that is widely adopted in hierarchical analyses of neuroimaging data as outlined by Friston, Penny, et al. (2002). Parametric empirical Bayes has been considered in the context of MRI decoding analyses (Sabuncu & Van Leemput, 2011; Wen et al., 2019), in M/EEG source localization problems (Cai et al., 2018, 2021; Friston et al., 2008; Owen et al., 2008; Owen et al., 2012; Wipf et al., 2010), in multi-subject hierarchical fMRI analyses (Friston, et al., 2002), in MRI segmentation (Iglesias et al., 2015; Puonti et al., 2016, 2020), and in receptive field models of cell recordings (Park & Pillow, 2011). Likewise, dividing variables into groups and encouraging group-wise sparsity via the group-lasso (Yuan & Lin, 2006) have also been used in M/EEG source localization problems (Haufe et al., 2008; Lim et al., 2017; Ou et al., 2009) and in causal discovery (Bolstad et al., 2011; Haufe et al., 2010; Tank et al., 2017). The group-lasso is convenient in the latter situations because it promotes truly (group-wise) sparse solutions unlike the EM-banded model and banded Ridge regression (la Tour et al., 2022; Nunez-Elizalde et al., 2019), where weight pruning (such as setting excessively shrunken regression weights to zero) would otherwise be necessary.

It is important to once again stress that the considered modeling framework may provide poor approximations to the true posterior distribution over the model weights. The hope is instead that the model yields reasonable out-of-sample predictions (Tipping, 2001), which we indeed found to be the case in several analyses. In encoding or decoding analyses, the emphasis is typically on out-of-sample predictive accuracy and point estimates of model parameters. Our motivation behind highlighting model weights in several of our simulations was to better illustrate model properties and highlight regression problems where group shrinkage can be relevant.

### 4.5 Over-fitting and cross-validation

Tuning hyperparameters using cross-validated estimates of expected prediction accuracy may reduce the risk of over-fitting. One benefit of tuning hyperparameters using cross-validation compared to empirical Bayes (without cross-validation) is that it optimizes an estimate of expected prediction accuracy which can lead to good generalization, especially in cases with many observations, high SNR, and independent training and validation splits. Moreover, such procedures can be highly useful for identifying model misspecifications. However, with limited and noisy data, the variance of the expected prediction accuracy estimate can be high. Tuning many free hyperparameters to optimize such an estimate can be associated with a considerable risk of over-fitting in these circumstances.

Cross-validation can similarly be used to tune hyperparameters of the EM-banded model. This may prove beneficial in some cases (compared to fixing model parameters *a priori*) and could potentially reduce the risk of over-fitting, but the same concerns apply here. In practice, rather than tuning all free hyperparameters of the EM-banded jointly, one may choose to introduce the constraint that only $\gamma $ is tuned, with $\gamma =\tau =\eta =\kappa =\varphi $. This will practically make it more straightforward to fit EM-banded models across a range of $\gamma $ parameters and tune $\gamma $ to minimize expected prediction error on held-out data. Section 7 in the Supplementary Material shows results from such procedures.

### 4.6 Other perspectives

We used the described modeling framework to fit encoding and decoding models trained on data from single subjects (i.e., single-subject analyses). The described framework could also be incorporated in “searchlight” decoding analyses (Kriegeskorte et al., 2006) where different subsets of features from multi-dimensional neural data recordings are used as predictors in single-subject or group-level decoding models. The framework could for example be relevant in studies that attempt to use one or more features extracted from multiple imaging modalities to predict clinical scores (Woo et al., 2017).

One appealing aspect of group shrinkage procedures is that they can allow for ranking importance of feature groups. This may aid interpretation of encoding analyses (la Tour et al., 2022; Nunez-Elizalde et al., 2019). In recent years, encoding models have been used increasingly to model neural responses with features extracted from task-optimized neural networks (Kell et al., 2018; Tuckute et al., 2023; Yamins et al., 2014; Yamins & DiCarlo, 2016). This typically involves very high-dimensional feature sets and hence encoding models where the number of predictors is much larger than the number of observations, that is, $D\u226bM$. Surprisingly, over-parameterized models may show good generalization performance even with vanishing explicit regularization in minimum-norm least squares estimators (Belkin et al., 2019; Hastie et al., 2022; Kobak et al., 2020). This behavior is clearly different from cases where explicit regularization is critical for performance. Yet, the ability to establish importance of sets of features can similarly be useful in these settings, and group shrinkage procedures may be one way of achieving this.

## 5 Conclusion

In this paper, we explored a framework for estimating banded-type regression models in encoding and decoding analyses. It offers a tool that can have relevance for specific modeling problems in computational neuroscience, complementing other related estimators. We used simulations and data examples to illustrate properties and limitations of the described model in comparison with Ridge estimators.

## Data and Code Availability

Code implementations in both Matlab (The MathWorks, Inc., Natick, MA, United States) and Python are available on https://github.com/safugl/embanded. The Python code utilizes libraries such as *Numpy* (Harris et al., 2020), *Scipy* (Virtanen et al., 2020), and *Pytorch* (Paszke et al., 2019). The public repository contains examples and runtime tests. EEG data and BOLD fMRI data used in this study have been presented previously (Broderick et al., 2018; Di Liberto et al., 2015; Nakai et al., 2021, 2022) and are available on https://doi.org/10.5061/dryad.070jc and https://openneuro.org/datasets/ds003720, respectively.

## Author Contributions

S.A.F.: Conceptualization, Formal analysis, Methodology, Software, Writing – original draft, Writing – review & editing, Interpretation of results, Visualization. K.H.M.: Software, Methodology, Writing – review, Interpretation of results. O.P.: Software, Methodology, Writing – review, Interpretation of results. H.R.S.: Writing – review & editing, Project administration, Funding acquisition, Interpretation of results. J.H.: Software, Methodology, Supervision, Writing – original draft, Writing – review & editing, Project administration, Funding acquisition, Interpretation of results.

## Funding

This work was supported by the Center for Auditory Neuroscience grant from the William Demant foundation and by a synergy grant from Novo Nordisk foundation to the UHEAL project Uncovering Hidden Hearing Loss (grant number NNF17OC0027872). H.R.S. was supported by a grand solutions grant “Precision Brain-Circuit Therapy - Precision-BCT)” from Innovation Fund Denmark (grant number 9068-00025B) and a collaborative project grant “ADAptive and Precise Targeting of cortex-basal ganglia circuits in Parkinson’s Disease - ADAPT-PD” from Lundbeckfonden (grant number R336-2020-1035). O.P. was supported by a grant from Lundbeckfonden (grant number R360–2021–395). K.H.M. was supported by the Pioneer Centre for AI, DNRF grant number P1.

## Declaration of Competing Interest

H.R.S. has received honoraria as speaker and consultant from Lundbeck AS, Denmark, and as editor (Neuroimage Clinical) from Elsevier Publishers, Amsterdam, The Netherlands. He has received royalties as book editor from Springer Publishers, Stuttgart, Germany, Oxford University Press, Oxford, UK, and from Gyldendal Publishers, Copenhagen, Denmark.

## Acknowledgements

The authors would like to thank Edmund C. Lalor (University of Rochester) for helpful discussions.

## Supplementary Materials

Supplementary material for this article is available with the online version here: https://doi.org/10.1162/imag_a_00155.