## Abstract

Whereas probabilistic models describe the dependence structure between observed variables, causal models go one step further: They predict, for example, how cognitive functions are affected by external interventions that perturb neuronal activity. In this review and perspective article, we introduce the concept of causality in the context of cognitive neuroscience and review existing methods for inferring causal relationships from data. Causal inference is an ambitious task that is particularly challenging in cognitive neuroscience. We discuss two difficulties in more detail: the scarcity of interventional data and the challenge of finding the right variables. We argue for distributional robustness as a guiding principle to tackle these problems. Robustness (or invariance) is a fundamental principle underlying causal methodology. A (correctly specified) causal model of a target variable generalizes across environments or subjects as long as these environments leave the causal mechanisms of the target intact. Consequently, if a candidate model does not generalize, then either it does not consist of the target variable's causes or the underlying variables do not represent the correct granularity of the problem. In this sense, assessing generalizability may be useful when defining relevant variables and can be used to partially compensate for the lack of interventional data.

## INTRODUCTION

Cognitive neuroscience aims to describe and understand the neuronal underpinnings of cognitive functions such as perception, attention, or learning. The objective is to characterize brain activity and cognitive functions and to relate one to the other. The submission guidelines for the *Journal of Cognitive Neuroscience*, for example, state: “The Journal will not publish research reports that bear solely on descriptions of function without addressing the underlying brain events, or that deal solely with descriptions of neurophysiology or neuroanatomy without regard to function.” We think that understanding this relation requires us to relate brain events and cognitive function in terms of the cause–effect relationships that govern their interplay. A causal model could, for example, describe how cognitive functions are affected by external interventions that perturb neuronal activity (cf. When Are Causal Models Important? section). Reid et al. (2019) argue that “the ultimate phenomenon of theoretical interest in all FC (functional connectivity) research is understanding the causal interaction among neural entities.”

Causal inference in cognitive neuroscience is of great importance and perplexity. This motivates our discussion of two pivotal challenges. First, the scarcity of interventional data is problematic as several causal models may be equally compatible with the observed data while making conflicting predictions only about the effects of interventions (cf. Challenge 1: The Scarcity of Targeted Interventional Data section). Second, the ability to understand how neuronal activity gives rise to cognition depends on finding the right variables to represent the neuronal activity (cf. Challenge 2: Finding the Right Variables section). Our starting point is the well-known observation that causal models of a target (or response) variable are distributionally robust and thus generalize across environments, subjects, and interventional shifts (Pearl, 2009; Aldrich, 1989; Haavelmo, 1944). Models that do not generalize either are based on the wrong variables that do not represent causal entities or include variables that are not causes of the target variable. We thus propose to pursue robust (or invariant) models. That way, distributional robustness may serve as a guiding principle toward a causal understanding of cognitive function and may help us tackle both challenges mentioned above.

### Running Examples

We consider the following contrived examples. Assume that the consumption of alcohol affects RTs in a cognitive task. In a randomized controlled trial, we find that drinking alcoholic (vs. nonalcoholic) beer results in slowed RTs hereinafter. Therefore, we may write “alcohol → RT” and call alcohol a cause of RT and RT an effect of alcohol. Intervening on the cause results in a change in the distribution of the effect. In our example, prohibiting the consumption of any alcoholic beers results in faster RTs.

In cognitive neuroscience, one may wish to describe how the neuronal activity is altered upon beer consumption and how this change in turn affects the RT. For this, we additionally require a measurement of neuronal activity, say an fMRI scan and voxel-wise BOLD signals, that can serve as explanans in a description of the phenomenon “alcohol → neuronal activity → RT.” We distinguish the following two scenarios.

#### Running Example A

A so-called treatment or stimulus variable *T* (say, consumption of alcohol) affects neuronal activity as measured by a *d*-dimensional feature vector **X** = [*X*_{1}, …, *X*_{d}]^{⊤}, and the target variable *Y* reflects a cognitive function (say, RT). We may concisely write *T* → **X** → *Y* for a treatment that affects neuronal activity, which in turn maintains a cognitive function (see Figure 1A; this is analogous to the “stimulus → brain activity → response” setup considered in Weichwald et al., 2015; Weichwald, Schölkopf, Ball, & Grosse-Wentrup, 2014).

#### Running Example B

We may wish to describe how neuronal entities cause one another and designate one such entity as the target variable *Y*. In this example, we consider a target variable corresponding to a specific brain signal or region instead of a behavioral or cognitive response (see Figure 1B).

### Existing Work on Causality in the Context of Cognitive Neuroscience

Several methods such as Granger causality or constraint-based methods have been applied to the problem of inferring causality from cognitive neuroscience data. We describe these methods in the Causal Discovery section. In addition, there are ongoing conceptual debates that revolve around the principle of causality in cognitive neuroscience, some of which we now mention. Mehler and Kording (2018) raise concerns about the “lure of causal statements” and expound the problem of confounding when interpreting functional connectivity. Confounders are similarly problematic for multivoxel pattern analyses (Woolgar, Golland, & Bode, 2014; Todd, Nystrom, & Cohen, 2013). The causal interpretation of encoding and decoding (forward and backward, univariate and multivariate) models has received much attention as they are common in the analysis of neuroimaging data: Davis et al. (2014) examine the differences between the model types, Haufe et al. (2014) point out that the weights of linear backward models may be misleading, and Weichwald et al. (2015) extend the latter argument to nonlinear models and clarify which causal interpretations are warranted from either model type. Feature relevance in mass-univariate and multivariate models can be linked to marginal and conditional dependence statements that yield an enriched causal interpretation when both are combined (Weichwald et al., 2015); this consideration yields refined results in neuroimaging analyses (Varoquaux et al., 2018; Bach, Symmonds, Barnes, & Dolan, 2017; Huth et al., 2016) and explains improved functional connectivity results when combining bivariate and partial linear dependence measures (Sanchez-Romero & Cole, this issue). Problems such as indirect measurements and varying temporal delays complicate causal Bayesian network approaches for fMRI (Mumford & Ramsey, 2014; Ramsey et al., 2010). Smith et al. (2011) present a simulation study evaluating several methods for estimating brain networks from fMRI data and demonstrate that identifying the direction of network links is difficult. The discourse on how to leverage connectivity analyses to understand mechanisms in brain networks is ongoing (Mill, Ito, & Cole, 2017; Smith, 2012; Valdes-Sosa, Roebroeck, Daunizeau, & Friston, 2011; Waldorp, Christoffels, & van de Ven, 2011). Many of the above problems and findings are related to the two key challenges that we discuss in the Two Challenges for Causality in Cognitive Neuroscience section.

### Structure of This Work

We begin the Causal Models and Causal Discovery section by formally introducing causal concepts. In the When Are Causal Models Important? section, we outline why we believe there is a need for causal models in cognitive neuroscience by considering what types of questions could be answered by an OraCle Modelling (OCM) approach. We discuss the problem of models that are observationally equivalent yet make conflicting predictions about the effects of interventions in the Equivalences of Models section. In the Causal Discovery section, we review different causal discovery methods and their underlying assumptions. We focus on two challenges for causality in cognitive neuroscience that are expounded in the Two Challenges for Causality in Cognitive Neuroscience section: (1) the scarcity of interventional data and (2) the challenge of finding the right variables. In the Causality and Leveraging Robustness section, we argue that one should seek distributionally robust variable representations and models to tackle these challenges. Most of our arguments in this work are presented in an i.i.d. setting, and we briefly discuss the implications for time-dependent data in the Time Series Data section. We conclude in the Conclusion and Future Work section and outline ideas that we regard as promising for future research.

## CAUSAL MODELS AND CAUSAL DISCOVERY

In contrast to classical probabilistic models, causal models induce not only an observational distribution but also a set of so-called interventional distributions. That is, they predict how a system behaves under interventions. We present an introduction to causal models that is based on pioneer work by Pearl (2009) and Spirtes, Glymour, and Scheines (2000). Our exposition is inspired by Weichwald (2019, Chap. 2), which provides more introductory intuition into causal models viewed as structured sets of interventional distributions. For both simplicity and focus of exposition, we omit a discussion of counterfactual reasoning and other akin causality frameworks such as the potential outcomes formulation of causality (Imbens & Rubin, 2015). We phrase this article within the framework and terminology of structural causal models (SCMs; Pearl, 2009; Bollen, 1989).

An SCM over variables **Z** = [*Z*_{1}, …, *Z*_{d}]^{⊤} consists of structural equations that relate each variable *Z*_{k} to its parents **PA**(*Z*_{k}) ⊆ {*Z*_{1}, …, *Z*_{d}} and a noise variable *N*_{k} via a function *f*_{k} such that *Z*_{k} = *f*_{k}(**PA**(*Z*_{k}), *N*_{k}), and a noise distribution *P*_{N} of the noise variables **N** = [*N*_{1}, …, *N*_{d}]^{⊤}.

We associate each SCM with a directed causal graph where the nodes correspond to the variables *Z*_{1}, …, *Z*_{d}, and we draw an edge from *Z*_{i} to *Z*_{j} whenever *Z*_{i} appears on the right-hand side of the equation *Z*_{j} = *f*_{j}(**PA**(*Z*_{j}), *N*_{j}). That is, if *Z*_{i} ∈ **PA**(*Z*_{j}), the graph contains the edge *Z*_{i} → *Z*_{j}. Here, we assume that this graph is acyclic. The structural equations and noise distributions together induce the observational distribution **P**_{Z} of *Z*_{1}, …, *Z*_{d} as simultaneous solution to the equations. (Bongers, Forré, Peters, Schölkopf, and Mooij [2016] formally define SCMs when the graph includes cycles.)

*N*

_{1},

*N*

_{2}, and

*N*

_{3}. The corresponding graph is and the SCM induces the observational distribution

**P**

_{Z}, which is the multivariate Gaussian distribution

*Z*

_{2}= 0,

*Z*

_{3}= 5) denotes the scenario where we force

*Z*

_{2}and

*Z*

_{3}to take on the values 0 and 5, respectively. The interventional distributions are obtained by (a) replacing the structural equations of the intervened-upon variables by the new assignment and (b) considering the distribution induced by the thus obtained new set of structural equations. For example, the distribution under intervention do(

*Z*

_{1}=

*a*) for

*a*∈ ℝ, denoted by $PZdoZ1\u2254a$, is obtained by changing the equation

*Z*

_{1}=

*f*

_{1}(

**PA**(

*Z*

_{1}),

*N*

_{1}) to

*Z*

_{1}=

*a*. In the above example, we find

*X*∼ 𝒩(

*a*, 0) if and only if

**P**(

*X*=

*a*) = 1. Analogously, for

*b*∈ ℝ and intervention on

*Z*

_{2}, we have

The distribution of *Z*_{1} differs between the observational distribution and the interventional distribution, that is, **P**_{Z1} ≠ $PZ1doZ2\u2254b$. We call a variable *X* an (indirect) cause of a variable *Y* if there exists an intervention on *X* under which the distribution of *Y* is different from its distribution in the observational setting. Thus, *Z*_{2} is a cause of *Z*_{1}. The edge *Z*_{2} → *Z*_{1} in the above causal graph reflects this cause–effect relationship. In contrast, *Z*_{2} remains standard-normally distributed under all interventions do(*Z*_{1} = *a*) on *Z*_{1}. Because the distribution of *Z*_{2} remains unchanged under any intervention on *Z*_{1}, *Z*_{1} is not a cause of *Z*_{2}.

In general, interventional distributions do not coincide with the corresponding conditional distributions. In our example, we have **P**_{Z∣Z1=a} ≠ **P**_{Z}^{do(Z1≔a)} whereas **P**_{Z∣Z2=b} = **P**_{Z}^{do(Z2≔b)}. We further have that the conditional distribution **P**_{Z3∣Z2,Z1} of *Z*_{3} given its parents *Z*_{1} and *Z*_{2} is invariant under interventions on variables other than *Z*_{3}. We thus call a model of *Z*_{3} based on *Z*_{1}, *Z*_{2} invariant (cf. Robustness of Causal Models section).

We have demonstrated how an SCM induces a set of observational and interventional distributions. The interventional distributions predict observations of the system upon intervening on some of its variables. As such, a causal model holds additional content compared to a common probabilistic model that amounts to one distribution to describe future observations of the same unchanged system. Sometimes, we are only interested in modeling certain interventions or cannot perform others as there may be no well-defined corresponding real-world implementation. For example, intervening on a person's sense of humor is not well-defined. In these cases, it may be helpful to explicitly restrict ourselves to a set of interventions of interest. Furthermore, the choice of an intervention set puts constraints on the granularity of the model (cf. Challenge 2: Finding the Right Variables section and Rubenstein et al., 2017).

### When Are Causal Models Important?

We do not always need causal models to answer our research question. For some scientific questions, it suffices to consider probabilistic, that is, observational, models. For example, if we wish to develop an algorithm for early diagnosis of Alzheimer's disease from brain scans, we need to model the conditional distribution of Alzheimer's disease given brain activity. Because this can be computed from the joint distribution, a probabilistic model suffices. If, however, we wish to obtain an understanding that allows us to optimally prevent progression of Alzheimer's disease by, for example, cognitive training or brain stimulation, we are in fact interested in a causal understanding of the Alzheimer's disease and require a causal model.

Distinguishing between these types of questions is important as it informs us about the methods we need to employ to answer the question at hand. To elaborate on this distinction, we now discuss scenarios related to our running examples and the relationship between alcohol consumption and RT (cf. Running Examples section). Assume we have access to a powerful OCM machinery that is unaffected by statistical problems such as model misspecification, multiple testing, or small sample sizes. By asking ourselves what queries must be answered by OCM for us to “understand” the cognitive function, the difference between causal and noncausal questions becomes apparent.

Assume, first, we ran the reaction task experiment with multiple subjects, fed all observations to our OCM machinery, and have Kim visiting our laboratory today. Because, using data from Kim's previous visits, OCM yields us the exact conditional distribution of RTs **P**_{Y|T=t} for Kim having consumed *T* = *t* units of alcoholic beer, we (knowing *t*) may be willing to bet against our colleagues on how Kim will perform in the reaction task experiment they are just about to participate in. No causal model for brain activity is necessary.

Assume, second, that we additionally record BOLD responses **X** = [*X*_{1}, …, *X*_{d}]^{⊤} at certain locations and times during the reaction task experiment. We can query OCM for the distribution of BOLD signals that we are about to record, that is, **P**_{X|T=t}, or the distribution of RTs given we measure Kim's BOLD responses **X** = **x**, that is, **P**_{Y|T=t,X=x}. As before, we may bet against our colleagues on how Kim's BOLD signals will look like in the upcoming reaction task experiment or bet on their RT once we observed the BOLD activity **X** = **x** before a reaction cue. Again, no causal model for brain activity is required.

In both of the above situations, we have learned something useful. Given that the data were obtained in an experiment in which alcohol consumption was randomized, we have learned, in the first situation, to predict RTs after an intervention on alcohol consumption. This may be considered an operational model for alcohol consumption and RT. In the second situation, we have learned how the BOLD signal responds to alcohol consumption. Yet, in none of the above situations have we gained understanding of the neuronal underpinnings of the cognitive function and the RTs. Knowing the conditional distributions **P**_{Y|T=t} and **P**_{Y|T=t,X=x} for any *t* yields no insight into any of the following questions. Which brain regions maintain fast RTs? Where in the brain should we release drugs that excite neuronal activity to counterbalance the effect of alcohol? How do we need to update our prediction if we learnt that Kim just took a new drug that lowers blood pressure in PFC? To answer such questions, we require causal understanding.

If we had a causal model, say in the form of an SCM, we could address the above questions. An SCM offers an explicit way to model the system under manipulations. Therefore, a causal model can help to answer questions about where to release an excitatory drug. It may enable us to predict whether medication that lowers blood pressure in PFC will affect Kim's RT; in general, this is the case if the corresponding variables appear in the structural equations for *Y* or any of *Y*'s ancestors.

Instead of identifying conditional distributions, one may formulate the problem as a regression task with the aim to learn the conditional mean functions *t* → 𝔼[**X**|*T* = *t*] and (*t*, **x**) → 𝔼[*Y*|*T* = *t*, **X** = **x**]. These are functions of *t* or *t* and **x** and are often parameterized. We argue in the Equivalences of Models section, Point (2), that such parameters do not carry a causal meaning and thus do not help to answer the questions above.

Promoted by slogans such as “correlation does not imply causation,” careful and associational language is sometimes used in the presentation of cognitive neuroscience studies. We believe, however, that a clear language that states whether a model should be interpreted causally (i.e., as an interventional model) or noncausally (i.e., as an observational model) is needed. This will help to clarify both the real-world processes the model can be applied to and the purported scientific claims.

Furthermore, causal models may generalize better than noncausal models. We expect systematic differences between subjects and between different trials or recording days of the same subject. These different situations, or environments, are presumably not arbitrarily different. If they were, we could not hope to gain any scientific insight from such experiments. The apparent question is which parts of the model we can expect to generalize between environments. It is well known that causal models capture one such invariance property, which is implicit in the definition of interventions. An intervention on one variable leaves the assignments of the other variables unaffected. Therefore, the conditional distributions of these other variables, given their parents, are also unaffected by the intervention (Aldrich, 1989; Haavelmo, 1944). Thus, causal models may enable us to formulate more clearly which mechanisms we assume to be invariant between subjects. For example, we may assume that the mechanism how alcohol intake affects brain activity differs between subjects, whereas the mechanism from signals in certain brain regions to RT is invariant. We discuss the connection between causality and robustness in the Causality and Leveraging Robustness section.

### Equivalences of Models

Causal models entail strictly more information than observational models. We now introduce the notion of equivalence of models (Peters, Janzing, & Schölkopf, 2017; Bongers et al., 2016; Pearl, 2009). This notion allows us to discuss the falsifiability of causal models, which is important when assessing candidate models and their ability to capture cause–effect relationships that govern a cognitive process under investigation.

We call two models observationally equivalent if they induce the same observational distribution. Two models are said to be interventionally equivalent if they induce the same observational and interventional distributions. As discussed above, for some interventions, there may not be a well-defined corresponding experiment in the real world. We therefore also consider interventional equivalence with respect to a restricted set of interventions.

*N*

_{1},

*N*

_{2}, and

*N*

_{3}are mutually independent, standard-normal noise variables. The two SCMs are observationally equivalent as they induce the same observational distribution, the one shown in Equation 1. The models are not interventionally equivalent, however, because $PZ2doZ1\u22543$ = 𝒩(0, 1) and $PZ2doZ1\u22543$ = 𝒩($32$, $12$) for the left and right models, respectively. The two models can be told apart when interventions on

*Z*

_{1}or

*Z*

_{2}are considered. They are interventionally equivalent with respect to interventions on

*Z*

_{3}.

The existence of observationally equivalent models that are not interventionally equivalent has several implications. (1) Without assumptions, it is impossible to learn causal structure from observational data. This is not exclusive to causal inference from data, and an analogous statement holds true for regression (Györfi, Kohler, Krzyżak, & Walk, 2002). The regression problem is solvable only under certain simplicity assumptions, for example, on the smoothness of the regression function, which have been proven useful in real-world applications. Similarly, there are several assumptions that can be exploited for causal discovery. We discuss some of these assumptions in the Causal Discovery section. (2) As a consequence, without further restrictive assumptions on the data generating process, the estimated parameters do not carry any causal meaning. For example, given any finite sample from the observational distribution, both of the above SCMs yield the same likelihood. Therefore, the above structures cannot be told apart by a method that employs the maximum likelihood estimation principle. Instead, which SCM and thus which parameters are selected in such a situation may depend on starting values, optimization technique, or numerical precision. (3) Assume that we are given a probabilistic (observational) model of a data generating process. To falsify it, we may apply a goodness-of-fit test based on an observational sample from that process. An interventional model cannot be falsified based on observational data alone, and one has to also take into account the outcome of interventional experiments. This requires that we are in agreement about how to perform the intervention in practice (see also Challenge 2: Finding the Right Variables section). Interventional data may be crucial in particular for rejecting some of the observationally equivalent models (cf. the example above). The scarcity of interventional data therefore poses a challenge for causality in cognitive neuroscience (cf. Challenge 1: The Scarcity of Targeted Interventional Data section).

### Causal Discovery

The task of learning a causal model from observational (or a combination of observational and interventional) data is commonly referred to as causal discovery or causal structure learning. We have argued in the preceding section that causal discovery from purely observational data is impossible without any additional assumptions or background knowledge. In this section, we discuss several assumptions that render (parts of) the causal structure identifiable from the observational distribution. In short, assumptions concern how causal links manifest in observable statistical dependences, functional forms of the mechanisms, certain invariances under interventions, or the order of time. We briefly outline how these assumptions can be exploited in algorithms. Depending on the application at hand, one may be interested in learning the full causal structure as represented by its graph or in identifying a local structure such as the causes of a target variable *Y*. The methods described below cover either of the two cases. We keep the description brief focusing on the main ideas and intuition, whereas more details can be found in the respective references.

#### Randomization

The often called “gold standard” to establishing whether *T* causes *Y* is to introduce controlled perturbations, that is, targeted interventions, to a system. Without randomization, a dependence between *T* and *Y* could stem from a confounder between *T* and *Y* or from a causal link from *Y* to *T*. If *T* is randomized, it is not further governed by the outcome of any other variable or mechanism. Instead, it only depends on the outcome of a randomization experiment, such as the roll of a die. If we observe that under the randomization, *Y* depends on *T*, say the higher *T* the higher *Y*, then there must be a (possibly indirect) causal influence from *T* to *Y*. In our running examples, this allows us to conclude that the amount of alcoholic beer consumed causes RTs (cf. Running Examples section). When falsifying interventional models, it suffices to consider randomized experiments as interventions (Peters et al., 2017, Proposition 6.48). In practice, however, performing randomized experiments is often infeasible because of cost or ethical concerns, or impossible as, for example, we cannot randomize sense of humor nor fully control neuronal activity in the temporal lobe. Although it is sometimes argued that the experiment conducted by James Lind in 1747 to identify a treatment for scurvy is among the first randomized controlled trials, the mathematical theory and methodology was popularized by Ronald A. Fisher in the early 20th century (Conniffe, 1991).

#### Constraint-based Methods

Constraint-based methods rely on two assumptions that connect properties of the causal graph with conditional independence statements in the induced distribution. The essence of the first assumption is sometimes described as Reichenbach's common cause principle (Reichenbach, 1956): If *X* and *Y* are dependent, then there must be some cause–effect structure that explains the observed dependence, that is, either *X* causes *Y*, or *Y* causes *X*, or another unobserved variable *H* causes both *X* and *Y*, or some combination of the aforementioned. This principle is formalized by the Markov condition (see, e.g., Lauritzen, 1996). This assumption is considered to be mild. Any distribution induced by an acyclic SCM satisfies the Markov condition with respect to the corresponding graph (Pearl, 2009; Lauritzen, Dawid, Larsen, & Leimer, 1990). The second assumption (often referred to as faithfulness) states that any (conditional) independence between random variables is implied by the graph structure (Spirtes et al., 2000). For example, if two variables are independent, then neither does cause the other nor do they share a common cause. Both assumptions together establish a one-to-one correspondence between conditional independences in the distribution and graphical separation properties between the corresponding nodes.

The backbone of the constraint-based causal discovery algorithms such as the PC algorithm is to test for marginal and conditional (in)dependences in observed data and to find all graphs that encode the same list of separation statements (Pearl, 2009; Spirtes et al., 2000). This allows us to infer a so-called Markov equivalence class of graphs: All of its members encode the same set of conditional independences. It has been shown that two directed acyclic graphs (assuming that all nodes are observed) are Markov equivalent if and only if they have the same skeleton and v-structures → ○ ← (Verma & Pearl, 1990). Allowing for hidden variables, as done by the fast causal inference algorithm, for example, enlarges the class of equivalent graphs, and the output is usually less informative (Spirtes et al., 2000).

The following example further illustrates the idea of a constraint-based search. For simplicity, we assume a linear Gaussian setting, so that (conditional) independence coincides with vanishing (partial) correlation. Say we observe *X*, *Y*, and *Z*. Assume that the partial correlation between *X* and *Z* given *Y* vanishes whereas none of the other correlations and partial correlations vanishes. Under the Markov and faithfulness assumptions, there are multiple causal structures that are compatible with those constraints, such as *X* → *Y* → *Z*, *X* ← *Y* ← *Z*, *X* ← *Y* → *Z*, or where *H* is unobserved. Still, the correlation pattern rules out certain other causal structures. For example, neither *X* → *Y* ← *Z* nor *X* ← *H* → *Y* ← *Z* can be the correct graph structure because either case would imply that *X* and *Z* are uncorrelated (and *X* ⫫ *Z*|*Y* is not satisfied).

Variants of the above setting were considered in neuroimaging where a randomized experimental stimulus or time-ordering was used to further disambiguate between the remaining possible structures (Mastakouri, Schölkopf, & Janzing, 2019; Grosse-Wentrup, Janzing, Siegel, & Schölkopf, 2016; Weichwald, Gretton, Schölkopf, & Grosse-Wentrup, 2016; Weichwald, Grosse-Wentrup, & Gretton, 2016). Constraint-based causal inference methodology also clarifies the interpretation of encoding and decoding analyses in neuroimaging and has informed a refined understanding of the neural dynamics of probabilistic reward prediction and an improved functional atlas (Varoquaux et al., 2018; Bach et al., 2017; Weichwald et al., 2015).

Direct applications of this approach in cognitive neuroscience are difficult, not only because of the key challenges discussed in the Two Challenges for Causality in Cognitive Neuroscience section but also because of indirect and spatially smeared neuroimaging measurements that effectively spoil conditional independences. In the linear setting, there are recent advances that explicitly tackle the problem of inferring the causal structure between latent variables, say the neuronal entities, based on observations of recorded variables (Silva, Scheines, Glymour, & Spirtes, 2006). Further practical challenges include the difficulty of testing for nonparametric conditional independence (Shah & Peters, 2020) and near-faithfulness violations (Uhler, Raskutti, Bühlmann, & Yu, 2013).

#### Score-based Methods

Instead of directly exploiting the (conditional) independences to inform our inference about the causal graph structure, score-based methods assess different graph structures by their ability to fit observed data (see, e.g., Chickering, 2002). This approach is motivated by the idea that graph structures that encode the wrong (conditional) independences will also result in bad model fit. Assuming a parametric model class, we can evaluate the log-likelihood of the data and score different candidate graph structures by the Bayesian information criterion, for example. The number of possible graph structures to search over grows super-exponentially. That combinatorial difficulty can be dealt with by applying greedy search procedures that usually, however, do not come with finite sample guarantees. Alternatively, Zheng, Dan, Aragam, Ravikumar, and Xing (2020) exploit an algebraic characterization of graph structures to maximize a score over acyclic graphs by solving a continuous optimization problem. The score-based approach relies on correctly specifying the model class. Furthermore, in the presence of hidden variables, the search space grows even larger, and model scoring is complicated by the need to marginalize over those hidden variables (Jabbari, Ramsey, Spirtes, & Cooper, 2017).

#### Restricted SCMs

Another possibility is to restrict the class of functions in the structural assignments and the noise distributions. Linear non-Gaussian acyclic models (Shimizu, Hoyer, Hyvärinen, & Kerminen, 2006), for example, assume that the structural assignments are linear and the noise distributions are non-Gaussian. As for independent component analysis (ICA), identifiability of the causal graph follows from the Darmois–Skitovich theorem (Skitovič, 1962; Darmois, 1953). Similar results hold for nonlinear models with additive noise (Bühlmann, Peters, & Ernest, 2014; Peters, Mooij, Janzing, & Schölkopf, 2014; Zhang & Hyvärinen, 2009; Hoyer, Janzing, Mooij, Peters, & Schölkopf, 2008) or linear Gaussian models when the error variances of the different variables are assumed to be equal (Peters & Bühlmann, 2014). The additive noise assumption is a powerful, yet restrictive, assumption that may be violated in practical applications.

#### Dynamic Causal Modeling

We may have prior beliefs about the existence and direction of some of the edges. Incorporating these by careful specification of the priors is an explicit modeling step in dynamic causal modeling (Valdes-Sosa et al., 2011). Given such a prior, we may prefer one model over the other among the two observationally equivalent models presented in the Equivalences of Models section, for example. Because the method's outcome relies on this prior information, any disagreement on the validity of that prior information necessarily yields a discourse about the method's outcome (Lohmann, Erfurth, Müller, & Turner, 2012). Furthermore, a simulation study raised concerns regarding the validity of the model selection procedure in dynamic causal modeling (Breakspear, 2013; Friston, Daunizeau, & Stephan, 2013; Lohmann, Müller, & Turner, 2013; Lohmann et al., 2012; Friston, Harrison, & Penny, 2003).

#### Granger Causality

Granger causality is among the most popular approaches for the analysis of connectivity between time-evolving processes. It exploits the existence of time and the fact that causes precede their effects. Together with its nonlinear extensions, it has been considered for the analysis of neuroimaging data with applications to EEG and fMRI data (Stramaglia, Cortes, & Marinazzo, 2014; Stramaglia, Wu, Pellicoro, & Marinazzo, 2012; Marinazzo, Liao, Chen, & Stramaglia, 2011; Marinazzo, Pellicoro, & Stramaglia, 2008). The idea is sometimes wrongly described as follows: If including the past of *Y*_{t} improves our prediction of *X*_{t} compared to a prediction that is only based on the past of *X*_{t} alone, then *Y* Granger-causes *X*. Granger (1969) himself put forward a more careful definition that includes a reference to all the information in the universe: If the prediction of *X*_{t} based on all the information in the universe up to time *t* is better than the prediction where we use all the information in the universe up to time *t* apart from the past of *Y*_{t}, then *Y* Granger-causes *X*. In practice, we may instead resort to a multivariate formulation of Granger causality. If all relevant variables are observed (often referred to as causal sufficiency), there is a close correspondence between Granger causality and the constraint-based approach (Peters et al., 2017, Chap. 10.3.3). Observing all relevant variables, however, is a strong assumption, which is most likely violated for data sets in cognitive neuroscience. Although Granger causality may be combined with a goodness-of-fit test to at least partially detect the existence of confounders (Peters, Janzing, & Schölkopf, 2013), it is commonly applied as a computationally efficient black box approach that always outputs a result. In the presence of instantaneous effects (e.g., because of undersampling) or hidden variables, these results may be erroneous (see, e.g., Sanchez-Romero et al., 2019).

#### Inferring Causes of a Target Variable

*T*,

*Y*,

*X*

_{1}, …,

*X*

_{d}, where

*Y*denotes the target variable. Assume that there is an unknown SCM that includes the variables

*T*,

*Y*,

*X*

_{1}, …,

*X*

_{d}and that describes the data generating process well. To identify the variables among

*X*

_{1}, …,

*X*

_{d}that cause

*Y*, it does not suffice to regress

*Y*on

*X*

_{1}, …,

*X*

_{d}. The following example of an SCM shows that a good predictive model for

*Y*is not necessarily a good interventional model for

*Y*. Considerwhere

*N*

_{1},

*N*

_{2}, and

*N*

_{Y}are mutually independent, standard-normal noise variables.

*X*

_{2}is a good predictor for

*Y*, but

*X*

_{2}does not have any causal influence on

*Y*: The distribution of

*Y*is unchanged upon interventions on

*X*

_{2}.

Recently, causal discovery methods have been proposed that aim to infer the causal parents of *Y* if we are given data from different environments, that is, from different experimental conditions, repetitions, or different subjects. These methods exploit a distributional robustness property of causal models and are described in the Causality and Leveraging Robustness section.

#### Cognitive Function versus Brain Activity as the Target Variable

When we are interested in inferring direct causes of a target variable *Y*, it can be useful to include background knowledge. Consider our Running Example A (cf. Running Examples section and Figure 1A) with RT as the target variable, and assume we are interested in inferring which of the variables measuring neuronal activity are causal for the RT *Y*. We have argued in the preceding paragraph that, if a variable *X*_{j} is predictive of *Y*, it does not necessarily have to be causal for *Y*. Assuming, however, that we can exclude that the cognitive function “RT” causes brain activity (e.g., because of time ordering), we obtain the following simplification: Every *X*_{j} that is predictive of *Y* must be an indirect or direct cause of *Y*, confounded with *Y*, or a combination of both. This is different if our target variable is a neuronal entity as in Running Example B (cf. Figure 1B). Here, predictive variables can be either ancestors of *Y*, confounded with *Y*, descendants of *Y*, or some combination of the aforementioned (these statements follow from the Markov condition).

## TWO CHALLENGES FOR CAUSALITY IN COGNITIVE NEUROSCIENCE

Performing causal inference on measurements of neuronal activity comes with several challenges, many of which have been discussed in the literature (cf. Existing Work on Causality in the Context of Cognitive Neuroscience section). In the following two subsections, we explicate two challenges that we think deserve special attention. In the Causality and Leveraging Robustness section, we elaborate on how distributional robustness across environments, such as different recording sessions or subjects, can serve as a guiding principle for tackling those challenges.

### Challenge 1: The Scarcity of Targeted Interventional Data

In the Equivalences of Models section, we discussed that different causal models may induce the same observational distribution while they make different predictions about the effects of interventions. That is, observationally equivalent models need not be interventionally equivalent. This implies that some models can only be refuted when we observe the system under interventions that perturb some specific variables in our model. In contrast to broad perturbations of the system, we call targeted interventions those for which the intervention target is known and for which we can list the intervened-upon variables in our model, say “*X*_{1}, *X*_{3}, *X*_{8} have been intervened upon.” Even if some targeted interventions are available, there may still be multiple models that are compatible with all observations obtained under those available interventions. In the worst case, a sequence of up to *d* targeted interventional experiments may be required to distinguish between the possible causal structures over *d* observables *X*_{1}, …, *X*_{d} when the existence of unobserved variables cannot be excluded while assuming Markovianity, faithfulness, and acyclicity (Eberhardt, 2013). In general, the more interventional scenarios are available to us, the more causal models we can falsify and the further we can narrow down the set of causal models compatible with the data.

Therefore, the scarcity of targeted interventional data is a barrier to causal inference in cognitive neuroscience. Our ability to intervene on neural entities such as the BOLD level or oscillatory bandpower in a brain region is limited and so is our ability either to identify the right causal model from interventional data or to test causal hypotheses that are made in the literature. One promising avenue are noninvasive brain stimulation techniques such as TMS or direct/alternating current stimulation, which modulate neural activity by creating a field inside the brain (Kar, Ito, Cole, & Krekelberg, 2020; Bestmann & Walsh, 2017; Herrmann, Rach, Neuling, & Strüber, 2013; Nitsche et al., 2008). Because the stimulation acts broadly and its neurophysiological effects are not yet fully understood, transcranial stimulation cannot be understood as targeted intervention on some specific neuronal entity in our causal model (Vosskuhl, Strüber, & Herrmann, 2018; Antal & Herrmann, 2016). The interindividual variability in response to stimulation further impedes its direct use for probing causal pathways between brain regions (López-Alonso, Cheeran, Río-Rodríguez, & Fernández-del-Olmo, 2014). Bergmann and Hartwigsen (this issue) review the obstacles to inferring causality from noninvasive brain stimulation studies and provide guidelines to attenuate the aforementioned. Invasive stimulation techniques, such as deep brain stimulation relying on electrode implants (Mayberg et al., 2005), may enable temporally and spatially more fine-grained perturbations of neural entities. Dubois et al. (2017) exemplify how to revise causal structures inferred from observational neuroimaging data on a larger cohort through direct stimulation of specific brain regions and concurrent fMRI on a smaller cohort of neurosurgical patients with epilepsy. In nonhuman primates, concurrent optogenetic stimulation with whole-brain fMRI had been used to map the wiring of the medial PFC (Liang et al., 2015; Lee et al., 2010). Yet, there are ethical barriers to large-scale invasive brain stimulation studies, and it may not be exactly clear how an invasive stimulation corresponds to an intervention on, say, the BOLD response measured in some voxels. We thus believe that targeted interventional data will remain a scarcity because of physical and ethical limits to noninvasive and invasive brain stimulation.

Consider the following variant of our Running Example B (cf. Running Examples section). Assume that (a) the consumption of alcoholic beer *T* slows neuronal activity in the brain regions *X*_{1}, *X*_{2}, and *Y*; (b) *X*_{2} is a cause of *X*_{1}; and (c) *X*_{2} is a cause of *Y*. Here, (a) could have been established by randomizing *T*, whereas (b) and (c) may be background knowledge. Nothing is known, however, about the causal relationship between *X*_{1} and *Y* (apart from the confounding effect of *X*_{2}). The following graph summarizes these causal relationships between the variables:

Assume we establish on observational data that there is a dependence between *X*_{1} and *Y* and that we cannot render these variables conditionally independent by conditioning on any combination of the remaining observable variables *T* and *X*_{2}. Employing the widely accepted Markov condition, we can conclude that either *X*_{1} → *Y*, *X*_{1} ← *Y*, *X*_{1} ← *H* → *Y* for some unobserved variable *H*, or some combination of the aforementioned settings. Without any further assumptions, however, these models are observationally equivalent. That is, we cannot refute any of the above possibilities based on observational data alone. Even randomizing *T* does not help: The above models are interventionally equivalent with respect to interventions on *T*. We could apply one of the causal discovery methods described in the Causal Discovery section. All of these methods, however, employ further assumptions on the data generating process that go beyond the Markov condition. We may deem some of those assumptions implausible given prior knowledge about the system. Yet, in the absence of targeted interventions on *X*_{1}, *X*_{2,} or *Y*, we can neither falsify candidate models obtained by such methods nor test all of the underlying assumptions. In the Distributional Robustness and Scarcity of Interventional Data section, we illustrate how we may benefit from heterogeneity in the data, that is, from interventional data where the intervention target is unknown.

### Challenge 2: Finding the Right Variables

Causal discovery often starts by considering observations of some variables *Z*_{1}, …, *Z*_{d} among which we wish to infer cause–effect relationships, thereby implicitly assuming that those variables are defined or constructed in a way that they can meaningfully be interpreted as causal entities in our model. This, however, is not necessarily the case in neuroscience. Without knowing how higher-level causal concepts emerge from lower levels, for example, it is hard to imagine how to make sense and use of a causal model of the 86 billion neurons in a human brain (Herculano-Houzel, 2012). One may hypothesize that a model of averaged neuronal activity in distinct functional brain regions may be pragmatically useful to reason about the effect of different treatments and to understand the brain. For such an approach, we need to find the right transformation of the high-dimensional observed variables to obtain the right variables for a causal explanation of the system.

The problem of relating causal models with different granularity and finding the right choice of variable transformations that enable causal reasoning has received attention in the causality literature also outside neuroscience applications. Eberhardt (2016) fleshes out an instructive two-variable example that demonstrates that the choice of variables for causal modeling may be underdetermined even if interventions were available. For a wrong choice of variables, our ability to causally reason about a system breaks. An example of this is the historic debate about whether a high-cholesterol diet was beneficial or harmful with respect to heart disease. It can be partially explained by an ambiguity of how exactly total cholesterol is manipulated. Today, we know that low-density lipoproteins (LDLs) and high-density lipoproteins (HDLs) have opposing effects on heart disease risk. Merging these variables together to total cholesterol does not yield a variable with a well-defined intervention: Referring to an intervention on total cholesterol does not specify what part of the intervention is because of a change in LDLs versus HDLs. As such, only including total cholesterol instead of LDL and HDL may therefore be regarded as a too coarse-grained variable representation that breaks a model's causal semantics, that is, the ability to map every intervention to a well-defined interventional distribution (Truswell, 2010; Steinberg, 2007; Spirtes & Scheines, 2004).

Yet, we may sometimes prefer to transform microvariables into macrovariables. This can result in a concise summary of the causal information that abstracts away detail, is easier to communicate and operationalize, and more effectively represents the information necessary for a certain task (Weichwald, 2019; Hoel, 2017; Hoel, Albantakis, & Tononi, 2013); for example, a causal model over 86 billion neurons may be unwieldy for a brain surgeon aiming to identify and remove malignant brain tissue guided by the cognitive impairments observed in a patient. Rubenstein et al. (2017) formalize a notion of exact transformations that ensures causally consistent reasoning between two causal models where the variables in one model are transformations of the variables in the other. Roughly speaking, two models are considered causally consistent if the following two ways to reason about how the distribution of the macrovariables changes upon a macro-level intervention agree with one another: (a) find an intervention on the microvariables that corresponds to the considered macro-level intervention, and consider the macro-level distribution implied by the micro-level intervention, and (b) obtain the interventional distribution directly within the macro-level SCM sidestepping any need to refer to the micro-level. If the two resulting distributions agree with one another for all (compositions of) interventions, then the two models are said to be causally consistent and we can view the macro-level as an exact transformation of the micro-level causal model that preserves its causal semantics. A formal exposition of the framework and its technical subtleties can be found in the aforementioned work. Here, we revisit a variant of the cholesterol example for an illustration of what it entails for two causal models to be causally consistent and illustrate a failure mode: Consider variables *L* (LDL), *H* (HDL), and *D* (disease), where *D* = *H* − *L* + *N*_{D} for *L*, *H*, and *N*_{d} mutually independent random variables. Then, a model based on the transformed variables *T* = *L* + *H* and *D* ≡ *D* is in general not causally consistent with the original model: For (*l*_{1}, *h*_{1}) ≠ (*l*_{2}, *h*_{2}) with *l*_{1} + *h*_{1} = *l*_{2} + *h*_{2}, the interventional distributions induced by the micro-level model corresponding to setting *L* = *l*_{1} and *H* = *h*_{1} or, alternatively, *L* = *l*_{2} and *H* = *h*_{2} do in general not coincide because of the differing effects of *L* and *H* on *D*. Both interventions correspond to the same level of *T* and the intervention setting *T* = *t* with *t* = *l*_{1} + *h*_{1} = *l*_{2} + *h*_{2} in the macro-level model. Thus, the distributions obtained from reasoning (a) and (b) above do not coincide. If, on the other hand, we had $D\u02dc$ = *H* + *L* + *N*_{D}, then we could indeed use a macro-level model where we consider *T* = *H* + *L* to reason about the distribution of $D\u02dc$ under the intervention do(*T* = *t*) without running into conflict with the interventional distributions implied by all corresponding interventions in the micro-level model. This example can analogously be considered in the context of our running examples (cf. Running Examples section): Instead of LDL, HDL, and disease, one could alternatively think of some neuronal activity (*L*) that delays motor response, some neuronal activity (*H*) that increases attention levels, and the detected RT (*D*) assessed by subjects performing a button press; the scenario then translates into how causal reasoning about the cause of slowed RTs is hampered once we give up on considering *H* and *L* as two separate neural entities and instead try to reason about the average activity *T*. Janzing, Rubenstein, and Schölkopf (2018) observe similar problems for causal reasoning when aggregating variables and show that the observational and interventional stationary distributions of a bivariate autoregressive processes cannot, in general, be described by a two-variable causal model. A recent line of research focuses on developing a notion of approximate transformations of causal models (Beckers, Eberhardt, & Halpern, 2019; Beckers & Halpern, 2019). Although there exist first approaches to learn discrete causal macrovariables from data (Chalupka, Perona, & Eberhardt, 2015, 2016), we are unaware of any method that is generally applicable and learns causal variables from complex high-dimensional data.

In cognitive neuroscience, we commonly treat large-scale brain networks or brain systems as causal entities and then proceed to infer interactions between those (Power et al., 2011; Yeo et al., 2011). Smith et al. (2011) demonstrate that this should be done with caution: Network identification is strongly susceptible to slightly wrong or different definitions of the ROIs or the so-called atlas. Analyses based on Granger causality depend on the level of spatial aggregation and were shown to reflect the intra-areal properties instead of the interactions among brain regions if an ill-suited aggregation level is considered (Chicharro & Panzeri, 2014). Currently, there does not seem to be consensus as to which macroscopic entities and brain networks are the right ones to (causally) reason about cognitive processes (Uddin, Yeo, & Spreng, 2019). Furthermore, the observed variables themselves are already aggregates: A single fMRI voxel or the local field potential at some cortical location reflects the activity of thousands of neurons (Einevoll, Kayser, Logothetis, & Panzeri, 2013; Logothetis, 2008); EEG recordings are commonly considered a linear superposition of cortical electromagnetic activity, which has spurred the development of blind source separation algorithms that try to invert this linear transformation to recover the underlying cortical variables (Nunez & Srinivasan, 2006).

## CAUSALITY AND LEVERAGING ROBUSTNESS

### Robustness of Causal Models

*Y*and covariates

*X*

_{1}, …,

*X*

_{d}, as described in the running examples in the Running Examples section. Suppose that the system is observed in different environments. Suppose further that the generating process can be described by an SCM, that

**PA**(

*Y*) ⊆ {

*X*

_{1}, …,

*X*

_{d}} are the causal parents of

*Y*, and that the different environments correspond to different interventions on some of the covariates, although we neither (need to) know the interventions' targets nor its precise form. In our RT example, the two environments may represent two subjects (say, a left-handed subject right after having dinner and a trained race car driver just before a race) that differ in the mechanisms for

*X*

_{1},

*X*

_{3}, and

*X*

_{7}. Then, the joint distribution over

*Y*,

*X*

_{1}, …,

*X*

_{d}may be different between the environments, and also the marginal distributions may vary. Yet, if the interventions do not act directly on

*Y*, the causal model is invariant in the following sense: The conditional distribution of

*Y*|

**X**

_{PA(Y)}is the same in all environments. In the RT examples, this could translate to the neuronal causes that facilitate fast (vs. slow) RTs to be the same across subjects. This invariance can be formulated in different ways. For example, we have for all

*k*and

*ℓ*, where

*k*and

*ℓ*denote the indices of two environments, and for almost all

*x*

*E*represents the environment. In practice, we often work with model classes such as linear or logistic regression for modeling the conditional distribution

*Y*|

**X**

_{PA(Y)}. For such model classes, the above statements simplify. In case of linear models, for example, Equations 2 and 3 translate to regression coefficients and error variances being equal across different environments.

For an example, consider a system that, for environment *E* = 1, is governed by the following structural assignments:

with *N*_{1}, *N*_{2}, *N*_{3}, *N*_{4}, and *N*_{Y} mutually independent and standard-normal and where environment *E* = −1 corresponds to an intervention changing the weight of *X*_{1} in the assignment for *X*_{2} to −1. Here, for example, {*X*_{1}, *X*_{2}, *X*_{3}} and {*X*_{1}, *X*_{2}, *X*_{4}} are so-called invariant sets: The conditionals *Y*|*X*_{1}, *X*_{2}, *X*_{3} and *Y*|*X*_{1}, *X*_{2}, *X*_{4} are the same in both environments. The invariant models *Y*|*X*_{1}, *X*_{2}, *X*_{3} and *Y*|*X*_{1}, *X*_{2}, *X*_{4} generalize to a new environment *E* = −2, which changes the same weight to −2, in that they would still predict well. Note that *Y*|*X*_{1}, *X*_{2}, *X*_{4} is a noncausal modeland all invariant models include *X*_{1} and *X*_{2}. The lack of invariance of *Y*|*X*_{2} is illustrated by the different regression lines in the scatter plot above.

The validity of Equations 2 and 3 follows from the fact that the interventions do not act on *Y* directly and can be proved using the equivalence of Markov conditions (Peters, Bühlmann, & Meinshausen, 2016, Section 6.6; Lauritzen, 1996). Here, we try to argue that it also makes sense intuitively. Suppose that someone proposes to have found a complete causal model for a target variable *Y*, using certain covariates **X**_{S} (for *Y*, we may again think of the RT in Example A). Suppose that fitting that model for different subjects yields significantly different model fits—maybe even with different signs for the causal effects from variables in **X**_{S} to *Y* such that *E* ⫫ *Y*|**X**_{S} is violated. In this case, we would become skeptical about whether the proposed model is indeed a complete causal model. Instead, we might suspect that the model is missing an important variable describing how RT depends on brain activity.

In practice, environments can represent different sources of heterogeneity. In a cognitive neuroscience setting, environments may be thought of as different subjects who react differently, yet not arbitrarily so (cf. When Are Causal Models Important? section), to varying levels of alcohol consumption. Likewise, different experiments that are thought to involve the same cognitive processes may be thought of as environments; for example, the relationship “neuronal activity → RT” (cf. Example A., Running Examples section) may be expected to translate from an experiment that compares RTs after consumption of alcoholic versus nonalcoholic beers to another experiment where subjects are exposed to Burgundy wine versus grape juice. The key assumption is that the environments do not alter the mechanism of *Y*—that is, *f*_{Y}(**PA**(*Y*), *N*_{Y})—directly or, more formally, there are no interventions on *Y*. To test whether a set of covariates is invariant, as described in Equations 2 and 3, no causal background knowledge is required.

The above invariance principle is also known as “modularity” or “autonomy.” It has been discussed not only in the field of econometrics (Hoover, 2008; Aldrich, 1989; Haavelmo, 1944) but also in philosophy of science. Woodward (2004) discusses how the invariance idea rejects that “either a generalisation is a law or else is purely accidental.” In our notion, the criteria in Equations 2 and 3 depend on the environment *E*. In particular, a model may be invariant with respect to some changes, but not with respect to others. In this sense, robustness and invariance should always be considered with respect to a certain set of changes. Woodward (2004) introduces the possibility to talk about various degrees of invariance, beyond the mere existence or absence of invariance, while acknowledging that mechanisms that are sensitive even to mild changes in the background conditions are usually considered as not scientifically interesting. Cartwright (2003) analyzes the relationship between invariant and causal relations using linear deterministic systems and draws conclusions analogous to the ones discussed above. In the context of the famous Lucas critique (Lucas, 1976), it is debated to which extent invariance can be used for predicting the effect of changes in economic policy (Cartwright, 2009): Economy consists of many individual players who are capable of adapting their behavior to a change in policy. In cognitive neuroscience, we believe that the situation is different. Cognitive mechanisms do change and adapt, but not necessarily arbitrarily quickly. Some cognitive mechanism of an individual at the same day can be assumed to be invariant with respect to changes in the visual input, say. Depending on the precise setup, however, we may expect moderate changes of the mechanisms, say, for example, the development of cognitive function in children or learning effects. In other settings, where mechanisms may be subject to arbitrary large changes, scientific insight seems impossible (see When Are Causal Models Important? section).

Recently, the principle of invariance has also received increasing attention in the statistics and machine learning community (Arjovsky, Bottou, Gulrajani, & Lopez-Paz, 2019; Peters et al., 2016; Schölkopf et al., 2012). It can also be applied to models that do not have the form of an SCM. Examples include dynamical models that are governed by differential equations (Pfister, Bauer, & Peters, 2019).

### Distributional Robustness and Scarcity of Interventional Data

The idea of distributional robustness across changing background conditions may help us to falsify causal hypotheses, even when interventional data are difficult to obtain, and in this sense, may guide us toward models that are closer to the causal ground truth. For this, suppose that the data are obtained in different environments and that we expect a causal model for *Y* to yield robust performance across these environments (see Robustness of Causal Models section). Even if we lack targeted interventional data in cognitive neuroscience and thus cannot test a causal hypothesis directly, we can test the above implication. We can test the invariance, for example, using conditional independence tests or specialized tests for linear models (Chow, 1960). We can, as a surrogate, hold out one environment, train our model on the remaining environments, and evaluate how well that model performs on the held-out data (cf. Figure 2 for a possible procedure); the reasoning is that a noninvariant model may not exhibit robust predictive performance and instead yield a bad predictive performance for one or more of the folds. If a model fails the above, then either (1) we included the wrong variables, (2) we have not observed important variables, (3) the environment directly affects *Y*, or (4) the model class is misspecified. Tackling (1) and (4), we can try to refine our model and search for different variable representations and variable sets that render our model invariant and robust in the postanalysis. In general, there is no way to recover from (2) and (3), however.

Although a model that is not invariant across environments cannot be the complete causal model (assuming the environments do not act directly on the target variable), it may still have nontrivial prediction performance and predict better than a simple baseline method in a new, unseen environment. The usefulness of a model is questionable, however, if its predictive performance on held-out environments is not significantly better than a simple baseline. Conversely, if our model shows robust performance on the held-out data and is invariant across environments, it has the potential of being a causal model (although it need not be; see Robustness of Causal Models section for an example). Furthermore, a model that satisfies the invariance property is interesting in itself as it may enable predictions in new, unseen environments. For this line of argument, it does not suffice to employ a cross-validation scheme that ignores the environment structure and only assesses predictability of the model on data pooled across environments. Instead, we need to respect the environment structure and assess the distributional robustness of the model across these environments.

For an illustration of the interplay between invariance and predictive performance, consider a scenario in which *X*_{1} → *Y* → *H* → *X*_{2}, where *H* is unobserved. Here, we regard different subjects as different environments and suppose that (unknown to us) the environment acts on *H*: One may think of a variable *E* pointing into *H*. Let us assume that our study contains two subjects, one that we use for training and another one that we use as a held-out fold. We compare a model of the form $Y\u02c6$ = *f*(**X**_{PA(Y)}) = *f*(*X*_{1}) with a model of the form $Y\u02dc$ = *g*(**X**) = *g*(*X*_{1}, *X*_{2}). On a single subject, the latter model including all observed variables has more predictive power than the former model that only includes the causes of *Y*. The reason is that *X*_{2} carries information about *H*, which can be leveraged to predict *Y*. As a result, *g*(*X*_{1}, *X*_{2}) may predict *Y* well (and even better than *f*(*X*_{1})) on the held-out subject if it is similar to the training subject in that the distribution of *H* does not change between the subjects. If, however, *H* was considerably shifted for the held-out subject, then the performance of predicting *Y* by *g*(*X*_{1}, *X*_{2}) may be considerably impaired. Indeed, the invariance is violated, and we have *E* ⫫ *Y*|*X*_{1}, *X*_{2}. In contrast, the causal parent model *f*(*X*_{1}) may have worse accuracy on the training subject but satisfies invariance: Even if the distribution of *H* is different for held-out subjects compared to the training subject, the predictive performance of the model *f*(*X*_{1}) does not change. We have *E* ⫫ *Y*|*X*_{1}.

In practice, we often consider more than two environments. We hence have access to several environments when training our model, even if we leave out one of the environments to test on. In principle, we can thus already during training distinguish between invariant and noninvariant models. Although some methods have been proposed that explicitly make use of these different environments during training time (cf. Existing Methods Exploiting Robustness section), we regard this as a mainly unexplored but promising area of research. In the Exemplary Proof-of-concept EEG Analysis: Leave-One-Environment-Out Cross-validation section, we present a short analysis of classifying motor imagery conditions on EEG data that demonstrates how leveraging robustness may yield models that generalize better to unseen subjects.

In summary, employing distributional robustness as a guiding principle prompts us to reject models as noncausal if they are not invariant or if they do not generalize better than a simple baseline to unseen environments, such as sessions, days, neuroimaging modalities, subjects, or other slight variations to the experimental setup. Models that are distributionally robust and do generalize to unseen environments are not necessarily causal but satisfy the prerequisites for being interesting candidate models when it comes to capturing the underlying causal mechanisms.

#### Exemplary Proof-of-concept EEG Analysis: Leave-One-Environment-Out Cross-validation

Here, we illustrate the proposed cross-validation scheme presented in Figure 2 on motor imagery EEG data due to Tangermann et al. (2012). The data consist of EEG recordings of nine subjects performing multiple trials of four different motor imagery tasks. For each subject, 22-channel EEG recordings at a 250-Hz sampling frequency are available for 2 days with six runs of 48 trials each. We analyzed the publicly available data that are bandpass filtered between 0.5 and 100 Hz and 50-Hz notch filtered. The data were further preprocessed by rereferencing to common average reference (car) and projecting onto the orthogonal complement of the null component. Arguably, the full causal structure in this problem is unknown. Instead of assessing the causal nature of a model directly, we therefore evaluate whether distributional robustness of a model across training subjects may help to find models that generalize better to new unseen subjects.

One hundred twenty models were derived on the training data comprising recordings of the first five subjects. Models relied on six different sets of extracted time series components: the rereferenced EEG channels, three different sets of five PCA components with varying variance-explained ratios, and two sets of five confounding-robust ICA (coroICA) components using neighboring covariance pairs and a partition size of 15 sec (cf. Pfister, Weichwald, Bühlmann, & Schölkopf, 2019, for more details on coroICA). Signals were bandpass filtered in four different frequency bands (8–30, 8–20, 20–30, and 58–80 Hz). For each trial and feature set, bandpower features for classification were obtained as the log-variance of the bandpass-filtered signal during Seconds 3–6 of each trial. For each of the 6 × 4 = 24 configurations of trial features, we fitted five different linear discriminant analysis classifiers without shrinkage, with automatic shrinkage based on the Ledoit–Wolf lemma and shrinkage parameter settings of 0.2, 0.5, and 0.8. These 120 pipelines were fitted once on the entire training data and classification accuracies and areas under the receiver operating curve (AUC) scores obtained on four held-out subjects (*y* axes in Figure 3). Classifier performance was cross-validated on the training data following three different cross-validation schemes (cross-validation scores are shown on the *x* axes in Figure 3):

**loso-cv:***Leave-one-subject-out cross-validation*is the proposed cross-validation scheme. We hold out data corresponding to each training subject once, fit an linear discriminant analysis classifier on the remaining training data, and assess the model's accuracy on the held-out training subject. The average of those cross-validation scores reflects how robustly each of the 120 classifier models performs across environments (here, subjects).**lobo-cv:***Leave-one-block-out cross-validation*is a sevenfold cross-validation scheme that is similar to the above loso-cv scheme, where the training data are split into seven random blocks of (roughly) equal size. Not respecting the environment structure within the training data, this cross-validation scheme does not capture a model's robustness across environments.**looo-cv:***Leave-one-observation-out cross-validation*leaves out a single observation and is equivalent to lobo-cv with a block size of one.

In Figure 3, we display the results of the different cross-validation schemes and the Kendall's τ rank correlation between the different cv-scores derived on the training data and a model's classification performance on the four held-out subjects. The loso-cv scores correlate slightly more strongly with held-out model performance and thereby slightly better resolve the relative model performance. Considering the held-out performance for the models with top cv-scores, we observe that selecting models based on the loso-cv score may indeed select models that tend to perform slightly better on new unseen subjects. Furthermore, comparing the displacement of the model scores from the diagonal shows that the loso-cv scheme's estimates are less biased than the lobo and looo cross-validation scores, when used as an estimate for the performance on held-out subjects; this is in line with Varoquaux et al. (2017).

### Robustness and Variable Constructions

*Y*is caused by the two brain signals

*X*

_{1}and

*X*

_{2}via

*N*

_{Y}. Assume further that the environment influences the covariates

*X*

_{1}and

*X*

_{2}via

*X*

_{1}=

*X*

_{2}+

*E*+

*N*

_{1}and

*X*

_{2}=

*E*+

*N*

_{2}but does not influence

*Y*directly. Here,

*X*

_{1}and

*X*

_{2}may represent neuronal activity in two brain regions that are causal for RTs, whereas

*E*may indicate the time of day or respiratory activity. We then have the invariance property

*X*

_{1}+

*X*

_{2}, then whenever α ≠ β, we would find that

*X*

_{1},

*X*

_{2}) leading to the same value of $X\u02dc$: Given $X\u02dc$ = $x\u02dc$, the value of

*E*holds information about whether say (

*X*

_{1},

*X*

_{2}) = ($x\u02dc$, 0) or (

*X*

_{1},

*X*

_{2}) = (0, $x\u02dc$) is more probable and thus—because

*X*

_{1}and

*X*

_{2}enter

*Y*with different weights—holds information about

*Y*;

*E*and

*Y*are conditionally dependent given $X\u02dc$. Thus, the invariance may break down when aggregating variables in an ill-suited way. This example is generic in that the same conclusions hold for all assignments

*X*

_{1}=

*f*

_{1}(

*X*

_{2},

*E*,

*N*

_{1}) and

*X*

_{2}=

*f*

_{2}(

*E*,

*N*

_{2}), as long as causal minimality, a weak form of faithfulness, is satisfied (Spirtes et al., 2000).

Rather than taking the lack of robustness as a deficiency, we believe that this observation has the potential to help us finding the right variables and granularity to model our system of interest. If we are given several environments, the guiding principle of distributional robustness can nudge our variable definition and ROI definition toward the construction of variables that are more suitable for causally modeling some cognitive function. If some ROI activity or some EEG bandpower feature does not satisfy any invariance across environments, then we may conclude that our variable representation is misaligned with the underlying causal mechanisms or that important variables have not been observed (assuming that the environments do not act on *Y* directly).

This idea can be illustrated by a thought experiment that is a variation of the LDL–HDL example in the Challenge 2: Finding the Right Variables section: Assume we wish to aggregate multiple voxel activities and represent them by the activity of an ROI defined by those voxels. For example, let us consider the RT scenario (Example A, Running Examples section) and voxels *X*_{1}, …, *X*_{d}. Then, we may aggregate the voxels $X\xaf$ = $\Sigma i=1d$*X*_{i} to obtain a macro-level model in which we can still sensibly reason about the effect of an intervention on the treatment variable *T* onto the distribution of $X\xaf$, the ROIs' average activity. Yet, the model is in general not causally consistent with the original model. First, our ROI may be chosen too coarsely, such that, for $x\u02c6$ ≠ $x\u02dc$ ∈ ℝ^{d} with $\Sigma i=1d$$x\u02c6$_{i} = $\Sigma i=1d$$x\u02dc$_{i} = $x\xaf$, the interventional distributions induced by the micro-level model corresponding to setting all *X*_{i} = $x\u02dc$_{i} or, alternatively, to *X*_{i} = $x\u02c6$_{i} do not coincide—for example, an ROI that effectively captures the global average voxel activity cannot resolve whether a higher activity is because of increased RT-driving neuronal entities or because of some upregulation of other neuronal processes unrelated to the RT, such as respiratory activity. This ROI would be ill-suited for causal reasoning and nonrobust as there are two micro-level interventions that imply different distributions on the RTs while corresponding to the same intervention setting in the macro-level model. Second, our ROI may be defined too fine grained such that, for example, the variable representation does only pick up on the left-hemisphere hub of a distributed neuronal process relevant for RTs. If the neuronal process has different laterality in different subjects, then predicting the effects of interventions on only the left-hemispherical neuronal activity cannot be expected to translate to all subjects. Here, a macrovariable that averages more voxels, say symmetric of both hemispheres, may be more robust to reason about the causes of RTs than the more fine-grained unilateral ROI. In this sense, seeking for variable constructions that enable distributionally robust models across subjects may nudge us to meaningful causal entities. The spatially refined and finer resolved cognitive atlas obtained by Varoquaux et al. (2018), whose map definition procedure was geared toward an atlas that would be robust across multiple studies and 196 different experimental conditions, may be seen as an indicative manifestation of the above reasoning.

### Existing Methods Exploiting Robustness

We now present some existing methods that explicitly consider the invariance of a model. Although many of these methods are still in their infancy when considering real-world applications, we believe that further development in that area could play a vital role when tackling causal questions in cognitive neuroscience.

#### Robust ICA

ICA is commonly performed in the analysis of magnetoencephalography and EEG data to invert the inevitable measurement transformation that leaves us with observations of a linear superposition of underlying cortical (and noncortical) activity. The basic ICA model assumes our vector of observed variables *X* is being generated as *X* = *AS* where *A* is a mixing matrix and *S* = [*S*_{1}, …, *S*_{d}]^{⊤} is a vector of unobserved mutually independent source signals. The aim is to find the unmixing matrix *V* = *A*^{−1}. If we perform ICA on individual subjects' data separately, the resulting unmixing matrices will often differ between subjects. This not only hampers the interpretation of the resulting sources as some cortical activity that we can identify across subjects, it also hints—in light of the above discussion—at some unexplained variation that is due to shifts in background conditions between subjects such as different cap positioning or neuroanatomical variation. Instead of simply pooling data across subjects, Pfister, Weichwald, et al. (2019) propose a methodology that explicitly exploits the existence of environments, that is, the fact that EEG samples can be grouped by subjects they were recorded from. This way, the proposed coroICA procedure identifies an unmixing of the signals that generalizes to new subjects. The additional robustness resulted, for their considered example, in improved classification accuracies on held-out subjects and can be viewed as a first-order adjustment for subject-specific differences. The application of ICA procedures to pooled data will generally result in components that do not robustly transfer to new subjects and are thus necessarily variables that do not lend themselves for a causal interpretation. The coroICA procedure aims to exploit the environments to identify unmixing matrices that generalize across subjects.

#### Causal Discovery with Exogenous Variation

*Y*within a set of covariates

*X*

_{1}, …,

*X*

_{d}. We have argued that the true causal model is invariant across environments (see Equation 2), if the data are obtained in different environments and the environment does not directly influence

*Y*. That is, when enumerating all invariant models by searching through subsets of

*X*

_{1}, …,

*X*

_{d}, one of these subsets must be the set of causal parents of

*Y*. As a result, the intersection $S\u02c6$ = ∩

_{S:Sinvariant}

*S*of all sets of covariates that yield invariant models is guaranteed to be a subset of the causes

**PA**(

*Y*) of

*Y*. (Here, we define the intersection over the empty index set as the empty set.) Testing invariance with a hypothesis test to the level α, say α = 0.05, one obtains that $S\u02c6$ is contained in the set of parents of

*Y*with high probability

Under faithfulness, the method can be shown to be robust against violation of the above assumptions. If the environment acts on *Y* directly, for example, there is no invariant set, and in the presence of hidden variables, the intersection $S\u02c6$ of invariant models can still be shown to be a subset of the ancestors of *Y* with large probability.

It is further possible to model the environment as a random variable (using an indicator variable, for example), which is often called a context variable. One can then exploit the background knowledge of its exogeneity to identify the full causal structure instead of focusing on identifying the causes of a target variable. Several approaches have been suggested (e.g., Mooij, Magliacane, & Claassen, 2020; Zhang, Huang, Zhang, Glymour, & Schölkopf, 2017; Eaton & Murphy, 2007; Spirtes et al., 2000). Often, these methods first identify the target of the intervention and then exploit known techniques of constraint- or score-based methods. Some of the above methods also make use of time as a context variable or environment (Pfister, Bühlmann, & Peters, 2019; Zhang et al., 2017).

#### Anchor Regression

*Y*, predictor variables

**X**, and so-called anchor variables

**A**= [

*A*

_{1}, …,

*A*

_{q}]

^{⊤}that represent the different environments and are normalized to have unit variance, the anchor regression coefficients are obtained as solutions to the following minimization problem

Higher parameters γ steer the regression toward more invariant predictions (converging against the two-stage least squares solutions in identifiable instrumental variable settings). For γ = 0, we recover the ordinary least square solution. The solution $b\u02c6\gamma $ can be shown to have the best predictive power under shift interventions up to a certain strength that depends on γ. As before, the anchor variables can code time, environments, subjects, or other factors, and we thus obtain a regression that is distributionally robust against shifts in those factors.

### Time Series Data

So far, we have mainly focused on the setting of i.i.d. data. Most of the causal inference literature dealing with time dependency considers discrete-time models. This comes with additional complications for causal inference. For example, there are ongoing efforts to adapt causal inference algorithms and account for subsampling or supsampling and temporal aggregation (Hyttinen, Plis, Järvisalo, Eberhardt, & Danks, 2016; Danks & Plis, 2013). Problems of temporal aggregation relate to Challenge 2 of finding the right variables, which is a conceptual problem in time series models that requires us to clarify our notion of intervention for time-evolving systems (Rubenstein et al., 2017). When we observe time series with nonstationarities, we may consider these as resulting from some unknown interventions. That is, nonstationarities over time may be due to shifts in the background conditions and, as such, can be understood as shifts in environments analogous to the i.i.d. setting. This way, we may again leverage the idea of distributional robustness for inference on time-evolving systems for which targeted interventional data are scarce. Extensions of invariant causal prediction to time series data that aim to leverage such variation have been proposed by Christiansen and Peters (2020) and Pfister, Bühlmann, et al. (2019), and the ICA procedure described in the Existing Methods Exploiting Robustness section also exploits nonstationarity over time. SCMs extend to continuous-time models (Peters, Bauer, & Pfister, 2020), where the idea to trade off prediction and invariance has been applied to the problem of inferring chemical reaction networks (Pfister, Bauer, et al., 2019).

A remark is in order if we wish to describe time-evolving systems by one causal summary graph where each time series component is collapsed into one node: For this to be reasonable, we need to assume a time-homogeneous causal structure. Furthermore, it requires us to carefully clarify its causal semantics: Although summary graphs can capture the existence of cause–effect relationships, they do in general not correspond to an SCM that admits causal semantics or enables interventional predictions that are consistent with the underlying time-resolved SCM (Janzing et al., 2018; Rubenstein et al., 2017). That is, the wrong choice of time-agnostic variables and corresponding interventions may be ill-suited to represent the cause–effect relationships of a time-evolving system (cf. Challenge 2 and Janzing et al., 2018; Rubenstein et al., 2017).

## CONCLUSION AND FUTURE WORK

Causal inference in cognitive neuroscience is ambitious. It is important to continue the open discourse about the many challenges, some of which are mentioned above. Thanks to the open and critical discourse, there is great awareness and caution when interpreting neural correlates (Rees, Kreiman, & Koch, 2002). Yet, “FC [functional connectivity] researchers already work within a causal inference framework, whether they realise it or not” (Reid et al., 2019).

In this article, we have provided our view on the numerous obstacles to a causal understanding of cognitive function. If we, explicitly or often implicitly, ask causal questions, we need to employ causal assumptions and methodology. We advocate distributional robustness as a guiding principle for causality in cognitive neuroscience. In particular, we propose to exploit that causal models using the right variables are distributionally robust. Although causal inference in general and in cognitive neuroscience in particular is a challenging task, we can at least exploit the rational to refute models and variables as noncausal that are frail to shifts in the environment. This guiding principle does not necessarily identify causal variables or causal models, but it nudges our search into the right direction away from frail models and noncausal variables. Although we presented first attempts that aim to leverage observations obtained in different environments (cf. Existing Methods Exploiting Robustness section), this article poses more questions for future research than it answers.

We believe that procedures that exploit environments during training are a promising avenue for future research. Although we saw mildly positive results in our case study, further research needs to show whether this trend persists in studies with many subjects. It may be possible to obtain improvements when combining predictive scores on held-out subjects with other measures of invariance and robustness. The development of such methods may be spurred and guided by field-specific benchmarks (or competitions) that assess models' distributional robustness across a wide range of scenarios, environments, cognitive tasks, and subjects.

When considering robustness or invariance across training environments, the question arises how the ability to infer causal structure and the generalization performance to unseen environments depends on the number of training environments. Although first attempts have been made to theoretically understand that relation (Christiansen, Pfister, Jakobsen, Gnecco, & Peters, 2020; Rothenhäusler et al., 2018), most of the underlying questions are still open. We believe that an answer would depend on the strength of the interventions, the sample size, the number of available training environments, the complexity of the model class, and, possibly, properties of the (meta-)distribution of environments.

We believe that advancements regarding the errors-in-variable problem may have important implications for cognitive neuroscience. Nowadays, we can obtain neuroimaging measurements at various spatial and temporal resolutions using, among others, fMRI, magnetoencephalography and EEG, PET, or near-infrared spectroscopy (Poldrack, 2018; Filler, 2009). Yet, all measurement modalities are imperfect and come with different complications. One general problem is that the observations are corrupted by measurement noise. The errors-in-variable problem complicates even classical regression techniques where we wish to model *Y* ≈ β*X** + ϵ but only have access to observations of a noise-corrupted *X* = *X** + η (Schennach, 2016). This inevitably carries over and hurdles causal inference as the measurement noise spoils conditional independence testing, biases any involved regression steps, and troubles additive noise approaches that aim to exploit noise properties for directing causal edges and methods testing for invariance. First steps addressing these issues in the context of causal discovery have been proposed by (Blom, Klimovskaia, Magliacane, & Mooij, 2018; Zhang et al., 2018; Scheines & Ramsey, 2016).

Summarizing, we believe that there is a need for causal models if we aim to understand the neuronal underpinnings of cognitive function. Only causal models equip us with concepts that allow us to explain, describe, predict, manipulate, deal and interact with, and reason about a system and that allow us to generalize to new, unseen environments. A merely associational model suffices to predict naturally unfolding disease progression, for example. We need to obtain understanding in form of a causal model if our goal is to guide rehabilitation after cognitive impairment or to inform the development of personalized drugs that target specific neuronal populations. Distributional robustness and generalizability to unseen environments is an ambitious goal, in particular, in biological systems and even more so in complex systems such as the human brain. Yet, it may be the most promising way forward.

## Acknowledgments

The authors thank the anonymous reviewers for their constructive and helpful feedback on an earlier version of this article. S. W. was supported by the Carlsberg Foundation.

Reprint requests should be sent to Sebastian Weichwald, Department of Mathematical Sciences, University of Copenhagen, Universitetsparken 5, Copenhagen, Denmark 2100, or via e-mail: sweichwald@math.ku.dk.