## Abstract

In this letter, we study the confounder detection problem in the linear model, where the target variable $Y$ is predicted using its $n$ potential causes $Xn=(x1,…,xn)T$. Based on an assumption of a rotation-invariant generating process of the model, recent study shows that the spectral measure induced by the regression coefficient vector with respect to the covariance matrix of $Xn$ is close to a uniform measure in purely causal cases, but it differs from a uniform measure characteristically in the presence of a scalar confounder. Analyzing spectral measure patterns could help to detect confounding. In this letter, we propose to use the first moment of the spectral measure for confounder detection. We calculate the first moment of the regression vector–induced spectral measure and compare it with the first moment of a uniform spectral measure, both defined with respect to the covariance matrix of $Xn$. The two moments coincide in nonconfounding cases and differ from each other in the presence of confounding. This statistical causal-confounding asymmetry can be used for confounder detection. Without the need to analyze the spectral measure pattern, our method avoids the difficulty of metric choice and multiple parameter optimization. Experiments on synthetic and real data show the performance of this method.

## 1  Introduction

In many real-world applications, we often face problems of estimating the causal effects of a set of sources $Xn=(x1,…,xn)T$ on one target variable $Y$. We restrict our attention to linear causal models (Hoyer, Shimizu, Kerminen, & Palviaen, 2008; Hyvärinen, Zhang, Shimizu, & Hoyer, 2010; Shijmizu, Hoyer, Hyvärinen, & Kerminen, 2006; Shimizu et al., 2011; Hyvärinen & Smith, 2013). To achieve this, one can build a linear regression model given their observations, and check the coefficients. For variables that are statistically dependent, we would, in most of the cases, get nonzero regression coefficients between the observations.1 If $xj$ has a significant regression coefficient, it is believed to have a large causal influence on $Y$. However, the correctness of this is based on the causal sufficiency assumption that there is no hidden confounder of $Xn$ and $Y$, which cannot be verified from the regression procedure. Simply checking the coefficient vector cannot give us enough information for identifying a confounder. Given the correctness of the causal sufficiency assumption unverified, estimating the causal effects by regression could be problematic. One never knows if the coefficients purely describe the influence of $Xn$ on $Y$ or if it is significant because they share a hidden common driving force. Thus, confounder detection is important. It basically acts as a verification procedure of the causal sufficiency assumption. For further analysis, we write a mathematical model and denote the nonobservable confounder as $Z$. Following Janzing and Schölkopf (2017), we assume that the $Z$ is a one-dimensional variable and consider the model
$Xn=bnZ+En,$
(1)
$Y=anTXn+cZ+F,$
(2)
where $an,bn$ are $n$-dimensional vectors and $En$ is the $n$-dimensional noise. $F$ is the one-dimensional noise, and $c$ is a scalar. When $∥bn∥$ and $c$ are both nonzero, $Z$ is a confounder of $Xn$ and $Y$ (Janzing & Schölkopf, 2017).2 Consider a least squares regression of $Y$ on $Xn$ to get the regression coefficient as
$a~n=ΣXn-1ΣXnY.$
(3)
The covariance matrices are
$ΣXnY=(ΣEn+bnbnT)an+cbn,ΣXn=ΣEn+bnbnT.$
Notice that we assume that the variance of variable $Z$ is 1 here. We consider this assumption justified since we can always make this true by rescaling $bn$ and $c$. Then we get
$a~n=an+c(ΣEn+bnbnT)-1bn.$
(4)

The regression coefficient basically consists of two parts: one part describing the causal influences of $Xn$ on $Y$ and the other part describing confounding effects. As this decomposition reveals, the regression coefficient in confounding and nonconfounding cases could be clearly different. Consider the following points:

• Purely causal cases: $∥bn∥$ or $c$ should be 0. In this case,
$a~n=an.$
• Confounding cases: $∥bn∥$ and $c$ are not 0. In this case,
$a~n=an+c(ΣEn+bnbnT)-1bn.$

For ease of explanation, we denote $a~$ as the composition of the causal part and confounding part:
$a~n=an⏟causalpart+c(ΣEn+bnbnT)-1bn⏟confoundingpart.$
When one conducts a regression, the coefficients obtained may contain both parts. Our goal is to tell if there is a confounder.

Janzing and Schölkopf (2017) propose a method to achieve this goal. The core idea is based on the so-called generic orientation theory, motivated by recent advances in causal discovery that discuss certain independence between cause and mechanisms (Liu & Chan, 2016b, in press; Janzing, Hoyer, & Schölkopf, 2010; Lemeire & Janzing, 2013). The method is built on a core term, the vector-induced spectral measure with respect to $ΣXn$, which intuitively describes the squared length of the components of a vector projected into the eigenspace of $ΣXn$. Later we mention spectral measure multiple times, and, by default, the spectral measure is induced with respect to $ΣXn$. Based on a rotation-invariant model generating assumption and the concentration of a measure phenomenon in high-dimensional spheres (Marton, 1996; Talagrand, 1995; Shiffman & Zelditch, 2003; Popescu, Short, & Winter, 2006), Janzing and Schölkopf (2017) post two asymptotic statements. First, the $an$-induced spectral measure and the $c(ΣEn+bnbnT)-1bn$-induced spectral measure have their respective patterns. Second, the $a~n$-induced spectral measure is a direct summation of the two measures in the first point. Given the observed joint distribution of $Y$ and $Xn$, we can compute the $a~n$-induced spectral measure. Then we use a convex combination of two spectral measures, one approximating the $an$-induced spectral measure and the other approximating the $c(ΣEn+bnbnT)-1bn$-induced spectral measure, to match the observed measure. We tune the weights of the two measures and record the weights of the part approximating the $c(ΣEn+bnbnT)-1bn$-induced spectral measure in the best match. The weight, then, in a certain sense, records the amount of confounding. Although the confounding strength can be quantitatively estimated by this method, the drawback is still clear.

• The two asymptotic statements are justified by weak convergence only, and the pattern approximations, as well as measure decomposition, should be interpreted in a sufficiently loose sense. As a consequence, the total variation distance may fail to serve as a good metric when one compares the reconstructed spectral measure with the observed one. However, the optimal choice of metrics stays vague, and the method (Janzing & Schölkopf, 2017) depends on unjustified heuristic choices (kernel smoothing), which may lead to a wrong “equal or not” conclusion.

• The method needs to tune two parameters: the weight in reconstruction and a parameter related to approximations of the $c(ΣEn+bnbnT)-1bn$-induced spectral measure. Optimizing over two-parameter space requires a very finely grained search, and error control is not easy.

These points reduce the reliability of the conclusions drawn by the method.

Can we identify confounding without reconstructing the whole spectral measure? This letter will provide an answer to the question. Recall that an important characteristic of a measure is its moment. We propose to directly use moment information for confounder detection. We will focus on the first moment and show that the first moment of the spectral measure induced by $a~n$ already behaves differently in causal and confounding scenarios. To access its “behavior” in a quantitatively concise sense, we later design a deviation measurement to quantify the difference between the first moment of the induced spectral measure and that of a uniform measure. The moment “behaves differently” is then justified by different asymptotic values of the deviation measurement in causal and confounding cases. This statistic is easy to compute and provides us enough information for confounding detection. Our method clearly avoids the drawbacks of the method of Janzing and Schölkopf (2017):

• Without the need of matching spectral measure patterns, we do not need to tackle the vagueness of “interpreting the approximations loosely” and the difficulty of metric choice. Instead, we compare the first moment of the spectral measure induced by $a~n$ with respect to $ΣXn$ with the first moment of the uniform (tracial) spectral measure on $ΣXn$, and make conclusions based on their differences.

• The parameter we need is a threshold of the deviation measurement. Simultaneous optimizing two parameters, as the spectral measure pattern matching method (Janzing & Schölkopf, 2017) does, is avoided.

These justify the usability of our proposed method, and it might provide a better solution than the existing one. We detail our method in the following sections and present theoretical and empirical analysis. We begin by describing related work.

## 2  Related Work

We describe the method by Janzing and Schölkopf (2017) for confounder detection. The basic idea is that the causal part and confounding part have their own features in induced spectral measures with respect to the covariance matrix of the cause, such that each part can be approximated and combined. To understand this, we give some basic definitions.

Definition 1
(eigendecomposition). The eigendecomposition of $ΣXn$ is
$ΣXn=UnΛnUnT=∑i=1nλiuiuiT,$
(2.1)
where the matrices
$Un=[u1,…,un],$
(2.2)
$Λn=diag(λ1,…,λn),$
(2.3)
are matrices of eigenvectors and the diagonal matrix containing all eigenvalues, respectively.
Definition 2
(vector-induced spectral measure). A spectral measure induced by an $n$-dimensional vector $ϕn$ with respect to covariance matrix $ΣXn$ in equation 2.1 is defined as
$μΣXn,ϕn=∑i=1n〈ϕn,ui〉2δλi,$
(2.4)
where $λi$ belonging to the $Λn$ in equation 2.3 is the $i$th eigenvalue and $ui$ is the respective eigenvector. $δs$ is the point measure defined on the $s∈R$.

Later we will also need a tracial (uniform) spectral measure, and we here define it formally.

Definition 3
(tracial spectral measure). A normalized tracial spectral measure defined on the covariance matrix $ΣXn$ in equation 2.1 is as
$μΣXnτ=1n∑i=1nδλi,$
(2.5)
where $δs$ is the point measure on $s∈R$.
Remark 1.

We also call the tracial spectral measure a “uniform spectral measure.” It should be noted that we no longer make the nondegenerate assumption on $ΣXn$, which is made by Janzing and Schölkopf (2017). Thus, the statement of “uniform” should be interpreted in a more general sense instead of uniformly spreading over a domain in $R$: the weight of each point measure is equal while the point measures are allowed to overlap with each other.

For numerical computations (matching spectral measure pattern), the induced spectral measure can be represented using two vectors: a vector containing its support $λΣXn=(λ1,…,λn)T$ and a vector $ωΣXn,ϕn$ containing the values of the spectral measure at those support points. We formally give the definitions.

Definition 4
(vectorized representation of spectral measure). For the vector-induced spectral measure $μΣXn,ϕn$ in equation 2.4, we use two vectors to represent it:
$λΣXn=(λ1,…,λn)T,$
(2.6)
$ωΣXn,ϕn=(〈ϕn,u1〉2,…,〈ϕn,un〉2)T.$
(2.7)
For tracial spectral measure, we can use
$λΣXn=(λ1,…,λn)T,$
(2.8)
$ωΣXnτ=1n,…,1nT.$
(2.9)
The $i$th element $ωΣXn,ϕn(i)$ in the vector records the value of the spectral measure at $λi$. One can compute the vector as
$ωΣXn,ϕn=UnTϕn○UnTϕn,$
where $○$ denotes the Hadamard product. An intuitive understanding is that it is a vector describing the squared coordination of the vector $ϕn$ with respect to the eigenspace of $ΣXn$. This vector records the pattern of the spectral measure and is used for computational tasks like pattern matching. Now consider $μΣXn,a~n$, the spectral measure induced by the regression vector $a~n$ with respect to the covariance matrix $ΣXn$. Janzing and Schölkopf (2017) show that given $n$ is large, this spectral measure can be decomposed as
$μΣXn,a~n≈μΣXn,an⏟causalpart+μΣXn,c(ΣEn+bnbnT)-1bn⏟confoundingpart,$
with so-called general orientation theory. As a consequence, their vectorized representations have the property
$ωΣXn,a~n≈ωΣXn,an⏟causalpart+ωΣXn,c(ΣEn+bnbnT)-1bn⏟confoundingpart.$
Then given the observations, the confounding can be estimated using the following steps:
1. Approximate the spectral measure induced by causal part as $ωΣXn,an=ωΣXnτ$ in equation 2.9.

2. Approximate the spectral measure induced by confounding part as
$ωΣXn,c(ΣEn+bnbnT)-1bnν=1∥Hν-11n∥2ωHν,Hν-11n,Hν=Λn+νn1n1nT,$
and $ωHν,Hν-11n$ is the vectorized representation of the spectral measure induced by $Hν-11n$ with respect to $Hν$. $Λn$ is the matrix in equation 2.3. $1n$ is the vector of all 1s, and $ν$ is a parameter.
3. Compute $1∥a~^n∥2ω^ΣXn,a~n$ using the observations, and find the parameters $β*,ν*$ that minimize a reconstruction error as
$(β*,ν*)=argminβ,ν∥1∥a~^n∥2ω^ΣXn,a~n-(1-β)ωΣXn,an-βωΣXn,c(ΣEn+bnbnT)-1bnν∥K,$
where $∥·∥K$ is a kernel smoothed metric (Janzing & Schölkopf, 2017).

The “confounding or not” conclusion then relies on $β*$. If $β*$ is significant, it shows a clear confounding effect. As we mentioned, the method has clear drawbacks. The weak convergence property in infinite dimensions makes the practical success of this method heavily depend on a good distance metric. To make it easy to understand, consider an example of purely causal models. We generate the coefficients vector $an$ of dimension 10, with each entry uniformly drawn from $[-0.5,0.5]$. Then we normalize it to unit norm and calculate the induced spectral measure (vectorized representation) with respect to a random covariance matrix.3 One can see that the practical spectral measure has a large difference from the uniform one. When one wants to match the two patterns (vectorized representation) and concludes “purely causal,” one should adjust weights on different dimensions. However, the optimal choice of the metric remains unknown in the pattern-matching method (Janzing & Schölkopf, 2017). It relies only on eigenvalue-gap-related heuristic kernel smoothing. In this example, the kernel smoothing matrix would not be a good choice, since the eigenvalue gaps are quite random compared to the spectral pattern. One may get wrong conclusions—that the pattern in Figure 1a differs a lot from a uniform measure, and it is a confounding model.

Figure 1:

Practical spectral measure induced by causal part and the uniform one.

Figure 1:

Practical spectral measure induced by causal part and the uniform one.

In summary, Janzing and Schölkopf's (2017) method relies on analyzing the patterns of the spectral measure. In purely causal cases, it is a uniform one. But the presence of a confounding vector modifies the pattern in a characteristic way and can be detected. However, the weak convergence property makes it very hard to choose a metric for comparing the reconstructed measure with the practical one, thus hindering a good understanding of the pattern. Since reconstructing confounding by measure approximation and combination is a hard task, why not directly check the moment information? In this way, we avoid the hardness of metric choice and multiple parameter optimization that one has to face when the pattern-matching method is used. We would later focus on the first moment and show that checking the first moment is enough for us to identify a confounder. This is because, asymptotically, the first moment of the measure induced by the regression vector coincides with that of a uniform measure in purely causal cases, while it does not in confounding cases. This causal-confounding asymmetry could help us identify the confounding. We put down present thorough discussions about the identifiability of the confounder using the first moment. We start with definitions related to first moment of spectral measures.

## 3  First Moments of Spectral Measures

We define the first moment of the measures here, and design a moment deviation measurement to test the moment behavior of a vector-induced spectral measure. We start with some definitions.

Definition 5
(first moment). The first moment of the vector-induced spectral measure $μΣXn,ϕn$ in equation 2.4 is defined as
$M(μΣXn,ϕn)=∫RsdμΣXn,ϕn(s),$
(3.1)
and the first moment of tracial spectral measure $μΣXnτ$ in equation 2.5 is defined as
$M(μΣXnτ)=∫RsdμΣXnτ(s).$
(3.2)

For practical computations, one can use vectorized representation of the measures to compute the moments.

Definition 6
(first moment using vectorized representation). The first moments of the vector-induced spectral measure $μΣXn,ϕn$ in equation 2.4 and the tracial spectral measure $μΣXnτ$ in equation 2.5 can be written using vectorized representation as
$M(μΣXn,ϕn)=λΣXnTωΣXn,ϕn=∑i=1nλi〈ϕn,ui〉2,$
(3.3)
$M(μΣXnτ)=λΣXnTωΣXnτ=1n∑i=1nλi,$
(3.4)
respectively.
From equation 3.3, we can rewrite it as
$∑i=1nλi〈ϕn,ui〉2=∑i=1nλiϕnTuiuiTϕn=ϕnT∑i=1nλiuiuiTϕn.$
In other words, we can write the moments as a multiplication of vector and matrix. We have the following definition:
Definition 7
(first moment using $ϕn$ and $ΣXn$). The first moment of the vector-induced spectral measure in equation 3.3 can be written as
$M(μΣXn,ϕn)=ϕnTΣXnϕn.$
(3.5)
If we then define a renormalized trace as
$τn(·)=1ntr(·),$
(3.6)
then equation 3.4 can be written as
$M(μΣXnτ)=τn(ΣXn).$
(3.7)

As we previously sketched, we want to design a measure to quantify the difference between the first moment of a vector-induced spectral measure and the first moment of a tracial spectral measure. Since the tracial spectral measure is a normalized one while the vector-induced spectral measure enlarges with the norm of the vector, we also enlarge the tracial measure with the norm. We have the following definition:

Definition 8
(first moment deviation). The deviation of the first moment of the $ϕn$-induced spectral measure on $ΣXn$ from that of a tracial spectral measure on $ΣXn$ is defined as
$D(ϕn,ΣXn)=|M(μΣXn,ϕn)-∥ϕn∥2M(μΣXnτ)|.$
(3.8)
Using equation 3.5, one can also write it as
$D(ϕn,ΣXn)=|ϕnTΣXnϕn-∥ϕn∥2τn(ΣXn)|.$
(3.9)
Since we measure only first-moment deviations of induced measures defined on $ΣXn$, without causing confusion, we write
$D(ϕn)=D(ϕn,ΣXn).$
(3.10)

Later we mainly study the asymptotic behavior of the deviation measurement. For ease of representation, we define the asymptotic deviation measurement:

Definition 9
(asymptotic $D(ϕn)$). For a sequence of vectors ${ϕn{n∈N$ and a sequence of positive semidefinite $n×n$ symmetric matrices ${ΣXn{n∈N$, the asymptotic $D(ϕn)$ is defined as
$D(ϕ∞)=limn→∞D(ϕn),$
given the limit exists.

Now we get everything ready to proceed. Recall the linear confounding model defined in equations 1.1 and 1.2. Given observational data, we can compute the covariance matrix and the regression vector $a~n$, and thus the induced spectral measure on $ΣXn$ and the tracial one. We want to show the performance of the deviation measure in equation 3.10 in causal and confounding cases. We begin with causal cases.

## 4  $D(a~∞)$ in Causal Cases

In this section, we describe our method starting from the properties of the causal cases. In this case, $c∥bn∥=0$. Then $a~n=an$ and it is independently chosen with respect to $ΣXn$. This concept of independence between cause and mechanism is then realized statistically by an assumption of generative model. In functional models, one often considers the property of the noise, like the independence between noise and cause (Shimizu, Hoyer, Hyvärinen, & Kerminen, 2006), or certain invariance of the number of support points of the conditional distributions (Liu & Chan, 2016a). To begin, we put down this generative model assumption.

Assumption 1. Let ${an{n∈N$ be a sequence of vectors drawn uniformly at random from a sphere in $Rn$ with fixed radius $ra$. ${ΣXn{n∈N$ is a uniformly bounded sequence of positive semidefinite $n×n$ symmetric matrices, and the tracial spectral measure converges weakly as
$limn→∞μΣXnτ=μ∞.$
(4.1)

Now we quote a lemma from Janzing and Schölkopf (2017).

Lemma 1
(weak measure convergence). Let assumption 1 be satisfied, and the tracial measure converges as described by equation 4.1. Then for the model described in equations 1.1 and 1.2, we have
$limn→∞μΣXn,an=ra2μ∞$
(4.2)
weakly in probability.

Now we can have a theorem of the asymptotic behavior of the deviation measure in causal cases:

Theorem 1
(asymptotic first moment coincidence). Let assumption 1 be satisfied. For the model described in equations 1.1 and 1.2, we have
$D(a∞)=0$
(4.3)
weakly in probability.
Proof.
Using lemma 11, we have
$D(a∞)=limn→∞|M(μΣXn,an)-∥an∥2M(μΣXnτ)|,=|M(ra2μ∞)-ra2M(μ∞)|,=|ra2M(μ∞)-ra2M(μ∞)|,=0.$
$□$
Interestingly, if we write the deviation measurement using equation 3.9, as
$D(an)=|anTΣXnan-∥an∥2τn(ΣXn)|,$
we can draw a link from this to trace condition in the following remark.
Remark 2.
The condition in theorem 12 is equivalent to the trace condition (Janzing & Schölkopf, 2010) when restricting $Y$ to be one-dimensional. The trace condition is
$limn→∞τ1(anTΣXnan)=limn→∞τn(ΣXn)τ1(ananT)=limn→∞∥an∥2τn(ΣXn).$
Assumption 1 in the trace method is stated as: $ΣXn$ is drawn from a distribution that is invariant under orthogonal transformation $ΣXn↦VnΣXnVnT$ where $Vn$ is any $n$-dimensional unitary matrix. Consequently, $VnTan$ is a random point uniformly drawn on some sphere.

We have analyzed the behavior of the deviation measurement in causal cases. It is of interest how this deviation measurement would behave in the presence of a confounder. We present analysis about this in the next section.

## 5  $D(a~∞)$ in Confounding Cases

Recall the equation of the regression coefficient in section 1 as
$a~n=an+c(ΣEn+bnbnT)-1bn.$
In confounding cases, the second part is no longer 0. We want to know the value of the deviation measure $D(a~n)$ in this case. If $D(a~n)$ is no longer 0, then we already get the feature for confounding detection. In fact, this is true. When no confounder exists, the regression coefficient $a~n=an$, and it is assumed to be drawn uniformly at random on some high-dimensional sphere. As a consequence, the spectral measure induced by the vector has exactly the same asymptotic behavior as the tracial measure in terms of moments, and thus it justifies a generic orientation of the causal part in the eigenspace of $ΣXn$. However, when we have a confounder, this “independence” no longer holds. The expression
$a~n=an⏟causalpart+c(ΣEn+bnbnT)-1bn⏟confoundingpart$
tells us clearly that the second part typically spoils the generic orientation of the vector $a~n$ with respect to the eigenspace of $ΣXn$ (Janzing & Schölkopf, 2017). We analyze more about its property, starting with model assumptions.

### 5.1  Rotation-Invariant Generating Model

The independence assumption between $a~n$ and $ΣXn$ no longer holds. In analogy to assumption 1, we could still make some assumptions on the confounding model. Note that later, we will need to define the vector-induced and tracial spectral measure on $ΣEn$, which is the same as defining them on $ΣXn$. Consider the two points that follow:

Assumption 2.${ΣEn{n∈N$ and its inverse sequences ${ΣEn-1{n∈N$ and ${ΣEn-2{n∈N$ are uniformly bounded sequences of positive semidefinite $n×n$ symmetric matrices, and their tracial spectral measures converge weakly as
$limn→∞μΣEnτ=μ1∞,$
(5.1)
$limn→∞μΣEn-1τ=μ-1∞,$
(5.2)
$limn→∞μΣEn-2τ=μ-2∞,$
(5.3)
respectively.

Assumption 3.${an{n∈N$ is a sequence of vectors drawn uniformly at random from a sphere in $Rn$ with fixed radius $ra$. ${bn{n∈N$ is a sequence of vectors drawn uniformly at random from a sphere in $Rn$ with fixed radius $rb$.

We would use these to help us refine $D(a~∞)$. Before we proceed, we list some core lemmas that are useful in the derivation of the asymptotic form of $D(a~n)$.

### 5.2  Core Lemmas

We here list some core lemmas that are useful for our analysis. They are true when the above model assumptions are satisfied. Some are directly from Janzing and Schölkopf (2017).

Lemma 2
(asymptotic norm decomposition). Let assumptions 2 and 3 be satisfied, and we have
$limn→∞∥a~n∥2=limn→∞∥an∥2+limn→∞∥cΣXn-1bn∥2$
almost surely.
Proof.
$limn→∞∥a~n∥2=limn→∞∥an+cΣXn-1bn∥2,=limn→∞∥an∥2+limn→∞∥cΣXn-1bn∥2+limn→∞2canTΣXn-1bn,=limn→∞∥an∥2+limn→∞∥cΣXn-1bn∥2.$
The last item vanishes because $an$ is a point sequence uniformly chosen at random, and it is independently chosen with $ΣXn-1bn$. Thus, they are asymptotically orthogonal almost surely. $□$
Lemma 3
(asymptotic moment decomposition). Let assumptions 2 and 3 be satisfied, and we have
$limn→∞M(μΣXn,a~n)=limn→∞M(μΣXn,an)+limn→∞M(μΣXn,cΣXn-1bn),$
(5.4)
weakly in probability.

Lemma 15 is a direct result of the measure decomposition by Janzing and Schölkopf (2017). We do not need to prove it.

Lemma 4
(asymptotic moment of tracial spectral measure of rank one perturbation). Let assumptions 2 and 3 be satisfied. Recall the equation
$ΣXn=ΣEn+bnbnT,$
and we have
$limn→∞M(μΣXnτ)=M(μ1∞)+τ∞(rb2),$
(5.5)
weakly in probability.
Proof.
$limn→∞M(μΣXnτ)=limn→∞τn(ΣEn+bnbnT),=limn→∞τn(ΣEn)+limn→∞τn(bnbnT),=M(μ1∞)+τ∞(rb2).$
$□$

Note that $τ∞(rb2)$ alone should be 0 here. However, we keep this term because later, it might multiply with an unbounded term in a case study, and it cannot be simply ignored.

Lemma 5
(moment convergence). Let assumptions 2 and 3 be satisfied. We have
$limn→∞M(μΣXn,an)=limn→∞∥an∥2M(μΣXnτ)=ra2M(μ1∞)+ra2τ∞(rb2),$
(5.6)
$limn→∞M(μΣEn-1,bn)=limn→∞∥bn∥2M(μΣEn-1τ)=rb2M(μ-1∞),$
(5.7)
$limn→∞M(μΣEn-2,bn)=limn→∞∥bn∥2M(μΣEn-2τ)=rb2M(μ-2∞),$
(5.8)
weakly in probability.

Lemma 17 is a direct result of lemmas 11 and 16.

Remark 3.

These lemmas are based on a rotation-invariant generating process of the model, which may be violated in some practical scenarios. Bear in mind that weaker assumptions may still lead to the same identities in these statements. This is because the geometry of the high-dimensional sphere makes a majority of the vectors close to their center, which admits “moment concentration.” This is also mentioned in Janzing & Schölkopf (2017).

Finally we need another formula to help us in the proof of the theorem later.

Lemma 6
(Sherman-Morrison formula). The Sherman-Morrison formula for expressing an inverse of rank one perturbation of a matrix (Bunch, Nielsen, & Sorensen, 1978) is
$(ΣEn+bnbnT)-1=ΣEn-1-ΣEn-1bnbnTΣEn-11+bnTΣEn-1bn.$

Now we get everything ready. We then show how these lemmas will help us refine the asymptotic expression of the first-moment deviation to a concise form, which could be used for other analyses.

### 5.3  $D(a~∞)$

In this section, we give the asymptotic $D(a~n)$ in confounding cases, with a detailed proof. We formalize it using theorem 20:

Theorem 2
(asymptotic first moment deviation). Consider the confounding model described by equations 1.1 and 1.2. Let assumptions 2 and 3 be satisfied. We have
$D(a~∞)=c2rb2(1+rb2M(μ-1∞))2|M(μ-1∞)-M(μ-2∞)M(μ1∞)+rb2M(μ-1∞)2-τ∞(rb2)M(μ-2∞)|$
(5.9)
weakly in probability.
Proof.
We will use those lemmas to proceed.
$limn→∞D(a~n)=limn→∞|M(μΣXn,a~n)-∥a~n∥2M(μΣXnτ)|,=|limn→∞M(μΣXn,a~n)-limn→∞∥a~n∥2M(μΣXnτ)|=lemma(3)|limn→∞M(μΣXn,an)+limn→∞M(μΣXn,cΣXn-1bn)-limn→∞∥a~n∥2M(μΣXnτ)|,$
where “lemma(15)” refers to the fact that lemma 15 is used here. Then we proceed to refine $D(a~n)$:
$limn→∞D(a~n)=|limn→∞M(μΣXn,an)+limn→∞M(μΣXn,cΣXn-1bn)-limn→∞∥a~n∥2M(μΣXnτ)|,=lemma(5)|limn→∞M(μΣXn,cΣXn-1bn)+limn→∞∥an∥2M(μΣXnτ)-limn→∞∥a~n∥2M(μΣXnτ)|,=|limn→∞M(μΣXn,cΣXn-1bn)-limn→∞(∥a~n∥2-∥an∥2)M(μΣXnτ)|,=|limn→∞M(μΣXn,cΣXn-1bn)-limn→∞(∥a~n∥2-∥an∥2)limn→∞M(μΣXnτ)|,=lemma(2)|limn→∞M(μΣXn,cΣXn-1bn)-limn→∞(∥an∥2+∥cΣXn-1bn∥2-∥an∥2)limn→∞M(μΣXnτ)|,=|limn→∞M(μΣXn,cΣXn-1bn)-limn→∞∥cΣXn-1bn∥2limn→∞M(μΣXnτ)|.$
Recall the equation
$ΣXn=ΣEn+bnbnT.$
Using definition 8, we have
$M(μΣXn,cΣXn-1bn)=(cΣXn-1bn)TΣXncΣXn-1bn,=c2bnTΣXn-1bn,=c2bnT(ΣEn+bnbnT)-1bn,=lemma(6)c2bnTΣEn-1-ΣEn-1bnbnTΣEn-11+bnTΣEn-1bnbn=c2bnTΣEn-1bn1+bnTΣEn-1bn,=c2M(μΣEn-1,bn)1+M(μΣEn-1,bn).$
Now we analyze the second term. Apply lemma 19 for expressing the inverse again:
$(ΣEn+bnbnT)-1bn=ΣEn-1bn-ΣEn-1bnbnTΣEn-1bn1+bnTΣEn-1bn=ΣEn-1bn1+bnTΣEn-1bn,∥cΣXn-1bn∥2=c2bnTΣEn-2bn(1+bnTΣEn-1bn)2=c2M(μΣEn-2,bn)(1+M(μΣEn-1,bn))2.$
Then we can get the final form of the deviation measurement as
$limn→∞D(a~n)=|limn→∞M(μΣXn,cΣXn-1bn)-limn→∞∥cΣXn-1bn∥2limn→∞M(μΣXnτ)|,$
(5.10)
$=|c2limn→∞M(μΣEn-1,bn)1+M(μΣEn-1,bn)-c2limn→∞M(μΣEn-2,bn)(1+M(μΣEn-1,bn))2limn→∞M(μΣXnτ)|,=c2limn→∞|M(μΣEn-1,bn)+M(μΣEn-1,bn)2-M(μΣEn-2,bn)M(μΣXnτ)(1+M(μΣEn-1,bn))2|.$
(5.11)
Using lemma 17, we get
$limn→∞M(μΣEn-1,bn)=rb2M(μ-1∞),limn→∞M(μΣEn-1,bn)2=rb4M(μ-1∞)2,limn→∞M(μΣEn-2,bn)M(μΣXnτ)=rb2M(μ-2∞)(M(μ1∞)+τ∞(rb2)).$
Finally we get
$D(a~∞)=c2rb2(1+rb2M(μ-1∞))2|M(μ-1∞)-M(μ-2∞)M(μ1∞)+rb2M(μ-1∞)2-τ∞(rb2)M(μ-2∞)|.$
$□$
Remark 4.

We provide some discussions of the proof:

1. In the derivation process, we use the fact multiple times that the limit of summation and multiplications can be calculated separately. This, known as algebraic limit theorem, applies because we assume that all limits exist by assumptions 2 and 3.

2. One can see from the equation 5.10 that the $M(μΣXn,an)$ coincides with $∥an∥2M(μΣXnτ)$, and they play no role in the final asymptotic form of the deviation. The first moment deviation is determined by how the first moment of spectral measure induced by the confounding part differs from that of a uniform reference measure.

Now we have its asymptotic value. Since in causal cases it is 0, the condition that the confounder is not identifiable by this method is that the deviation in the confounding case is still 0 here. From equation 5.9, the deviation measurement heavily depends on the asymptotic spectral measures of the noise. In the next section, we study it thoroughly.

### 5.4  Identifiability of Confounding

In this section, we study the identifiability of the confounder using our method. The nonidentifiable models are those with $D(a~∞)=0$. It is related to the eigenvalues of the covariance matrix of the noise. To start, we consider if one can determine the sign of the absolute part. We also assume the eigenvalues of covariance matrix of the noise are $∞>σ1≥σ2≥…≥σ∞$.

#### 5.4.1  Impossibility of Universal Identifiability

One ideal situation is that we can universally determine the sign of the absolute part regardless of the model parameters. In this way, $D(a~∞)$ might always be nonzero. We then show that this is not possible. We consider the following decomposition:
$D(a~∞)=c2rb2(1+rb2M(μ-1∞))2|M(μ-1∞)-M(μ-2∞)M(μ1∞)⏟part1+rb2M(μ-1∞)2-τ∞(rb2)M(μ-2∞)⏟part2|.$
Now we analyze the properties of the two parts:
1. $M(μ-1∞)-M(μ-2∞)M(μ1∞)≤0$. This is because
$M(μΣEn-1τ)=τn(ΣEn-1),=1n∑i=1nσi-1,=1n∑i=1nσi-2σi,≤1n∑i=1nσi-21n∑i=1nσi,=τn(ΣEn-2)τn(ΣEn),=M(μΣEn-2τ)M(μΣEnτ),$
by Chebyshev's sum inequality. Then we have
$M(μ-1∞)=limn→∞M(μΣEn-1τ)≤limn→∞M(μΣEn-2τ)M(μΣEnτ)=M(μ-2∞)M(μ1∞)$
by order limit theorem.
2. $rb2M(μ-1∞)2-τ∞(rb2)M(μ-2∞)≥0$. This is because
$M(μΣEn-1τ)2=τn(ΣEn-1)2=1n∑i=1nσi-12,=1n2∑i=1nσi-12,>1n2∑i=1nσi-2,=1nτn(ΣEn-2),=1nM(μΣEn-2τ).$
We have
$M(μ-1∞)2=limn→∞M(μΣEn-1τ)2≥limn→∞1nM(μΣEn-2τ)=τ∞(1)M(μ-2∞).$
Thus, it is not possible to give an “always identifiable” conclusion, since summation of the two parts could be either larger or smaller than 0, depending on $rb$. This means we could possibly meet nonidentifiable models. In the next section, we study the nonidentifiable conditions.

#### 5.4.2  General Nonidentifiable Condition

Without any assumptions on the distribution of the eigenvalues, here we give a general nonidentifiable condition. The confounder is not identifiable for models with $D(a~∞)$ being 0. This would lead to
$M(μ-1∞)1+rb2M(μ-1∞)-M(μ-2∞)M(μ1∞)(1+rb2M(μ-1∞))2-τ∞(rb2)M(μ-2∞)(1+rb2M(μ-1∞))2=0.$
Assuming boundness of the tracial moments as
$M(μ-1∞)<∞,$
we can proceed to analyze the equation as
$M(μ-1∞)-M(μ-2∞)M(μ1∞)+rb2M(μ-1∞)2-τ∞(rb2)M(μ-2∞)=0.$
We then express the $rb$ using the quantity related to the moments:
$rb2(M(μ-1∞)2-τ∞(1)M(μ-2∞))=M(μ-2∞)M(μ1∞)-M(μ-1∞),rb2=M(μ-2∞)M(μ1∞)-M(μ-1∞)M(μ-1∞)2-τ∞(1)M(μ-2∞).$
(5.12)
If we then assume the boundness of another tracial moments as
$M(μ-2∞)<∞,$
it can be further refined as
$rb2=M(μ-2∞)M(μ1∞)-M(μ-1∞)M(μ-1∞)2=M(μ-2∞)M(μ1∞)M(μ-1∞)2-1M(μ-1∞).$
(5.13)
Since $rb$ is the radius of the vector describing the confounding effect and $M(μi∞),(i∈{-2,-1,1{)$ is the asymptotic moment of the tracial spectral measure of the noise, they are generated independently in nature. Thus, one should not expect that the $rb$ can be described by the moment of noise in such a sophisticated way. This condition may be rarely satisfied unless one encounters very special models.

The above condition is a general one without any assumptions on the eigenvalue distribution of $ΣEn$. One may also be interested in the $D(a~n)$ when the eigenvalues follow some typical distributions. We then consider three typical distributions of eigenvalues: constant, polynomial decay, and exponential decay.

#### 5.4.3  Eigenvalues Are Constant

We first consider the eigenvalues being constant $σ1$. It covers some special cases like $ΣEn=σ1In$ ($In$ is the $n×n$ identity matrix). It is obvious that in this case, all moments are bounded, and we can check whether the general nonidentifiable condition can be satisfied:
$rb2=M(μ-2∞)M(μ1∞)M(μ-1∞)2-1M(μ-1∞)=1σ1-1-1σ1-1=0.$
The nonidentifiable condition is that the $rb$ is 0 here. It is never satisfied in confounding cases where $rb>0$. Another way is to directly compute the $D(a~n)$ when $n$ approaches infinity:
$limn→∞D(a~n)=c2rb2|M(μ-1∞)1+rb2M(μ-1∞)-M(μ-2∞)M(μ1∞)(1+rb2M(μ-1∞))2-τ∞(rb2)M(μ-2∞)(1+rb2M(μ-1∞))2|,=c2rb2|σ1-11+σ1-1rb2-σ1-1(1+σ1-1rb2)2-σ1-2τ∞(rb2)(1+σ1-1rb2)2|,=c2rb2|σ1-1+σ1-2rb2(1+σ1-1rb2)2-σ1-1(1+σ1-1rb2)2|,=c2rb2σ1-2rb2(1+σ1-1rb2)2,=c2rb4σ1-2(1+σ1-1rb2)2.$
Thus, we have
$D(a~∞)=c2rb4σ12+2σ1rb2+rb4.$
It is clear that the confounder is always identifiable here.

#### 5.4.4  Eigenvalues Decay Polynomially

We study the case where the eigenvalues decay polynomially as
$σi=σ1i-1.$
We write the traces of the covariance matrices here:
$M(μΣEnτ)=τn(ΣEn)=σ1n∑i=1ni-1=σ1n(lnn+κ+∊(n)),M(μΣEn-1τ)=τn(ΣEn-1)=σ1-1n∑i=1ni=σ1-12(n+1),M(μΣEn-2τ)=τn(ΣEn-2)=σ1-2n∑i=1ni2=σ1-26(n+1)(2n+1).$
Here we write the Harmonic series as
$∑i=1ni-1=lnn+κ+∊(n),∊(n)=O1n.$
$κ$ is the Euler-Mascheroni constant. Then we proceed to the analysis:
$limn→∞D(a~n)=c2rb2|M(μ-1∞)1+rb2M(μ-1∞)⏟θ1∞-M(μ-2∞)M(μ1∞)(1+rb2M(μ-1∞))2⏟θ2∞-τ∞(rb2)M(μ-2∞)(1+rb2M(μ-1∞))2⏟θ3∞|,$
Here we include three thetas for ease of representation:
$θ1∞=limn→∞τn(ΣEn-1)1+rb2τn(ΣEn-1),θ2∞=limn→∞τn(ΣEn-2)τn(ΣEn)(1+rb2τn(ΣEn-1))2,θ3∞=limn→∞rb2τn(ΣEn-2)n(1+rb2τn(ΣEn-1))2.limn→∞τn(ΣEn-2)τn(ΣEn)τn(ΣEn-1)2=limn→∞2σ1(lnn+κ+∊(n))3n(n+1)(2n+1)(n+1)2=0,limn→∞τn(ΣEn-2)nτn(ΣEn-1)2=limn→∞2(n+1)(2n+1)3n(n+1)2=0.θ1∞=1rb2,θ2∞=limn→∞τn(ΣEn-2)τn(ΣEn)(1+rb2τn(ΣEn-1))2=0.θ3∞=limn→∞rb2τn(ΣEn-2)n(1+rb2τn(ΣEn-1))2=0.$
Then we have
$D(a~∞)=c2rb2|1rb2-0-0|=c2.$

So asymptotically, we still have the condition that the $D(a~n)>0$. Thus, we claim that the confounder is identifiable by our method in the polynomial decay cases.

Remark 5.

1. Since the $σ1$ is assumed to be bounded (e.g., a constant), the largest eigenvalue of $ΣEn-1$ would be unbounded in the limit case. We are still doing analysis based on the convergence results of lemma 17. The justification should come from postulate 1 in Janzing and Schölkopf (2017) with the boundness assumption dropped. Then we could perform an analysis because we know exactly how the eigenvalue grows, and thus the support of $μ-1∞$.

2. Although moments $M(μ-1∞)$ and $M(μ-2∞)$ do not exist in the limit case, the ratios $θ1∞$ and $θ2∞+θ3∞$ do exist. To understand what they represent, recall equation 5.10:
$rb2θ1∞=limn→∞M(μΣXn,ΣXn-1bn),rb2(θ2∞+θ3∞)=limn→∞∥ΣXn-1bn∥2M(μΣXnτ).$
Note that we canceled the effect of the scalar $c$ in the equation. Thus, we have
1. $θ1∞<∞$ directly follows the existence of the first moment of the spectral measure induced by the confounding part.

2. $θ2∞+θ3∞<∞$ directly follows the existence of the first moment of the normalized tracial spectral measure enlarged by the norm of the confounding part.

#### 5.4.5  Eigenvalues Decay Exponentially

We study the case where the eigenvalues decay exponentially as
$σi=σ1e-(i-1).$
We analyze the traces here:
$τn(ΣEn)=σ1n∑i=1ne-(i-1)=σ1(1-e-n)n(1-e-1).τn(ΣEn-1)=σ1-1n∑i=1ne(i-1)=σ1-1(1-en)n(1-e).τn(ΣEn-2)=σ1-2n∑i=1ne(2i-2)=σ1-2(1-e2n)n(1-e2).$
We then analyze the thetas here:
$limn→∞τn(ΣEn-2)τn(ΣEn)τn(ΣEn-1)2=limn→∞(e2n-1)(e-e-n+1)(1-en)2(1+e)σ1=limn→∞(en+1)(e-e-n+1)(en-1)(e+1)σ1,=limn→∞en+1-e-n+1en+1+en-e-1σ1=σ11+e-1limn→∞τn(ΣEn-2)nτn(ΣEn-1)2=limn→∞(1+en)(1-e)(1-en)(1+e)=e-1e+1.θ1∞=1rb2,θ2∞=limn→∞τn(ΣEn-2)τn(ΣEn)(1+rb2τn(ΣEn-1))2=σ1(1+e-1)rb4,θ3∞=limn→∞rb2τn(ΣEn-2)n(1+rb2τn(ΣEn-1))2=e-1(e+1)rb2.$
Then we have
$D(a~∞)=c2rb2|1rb2-σ1(1+e-1)rb4-e-1(e+1)rb2|=c2|2e+1-eσ1(e+1)rb2|.$
If we let it be 0, we get
$rb2=eσ12.$
Now we have the nonidentifiable condition at hand. Clearly the confounder is not always identifiable. When the $rb2$ is $0.5eσ1$, in the confounding case we get an asymptotic 0 of the deviation measurement. However, we do not consider this a frequent case: the $bn$ is independently generated with respect to $ΣEn$, and $bn$ should rarely be with a norm that aligns with the largest eigenvalue of $ΣEn$, although we do not exclude the possibility of nonidentifiable models.

After we study those cases, we return to the general analysis. In fact, the deviation measurement of the first moment is asymptotically 0 in purely causal cases. It already reaches the lower bound of absolute values. Thus, no matter what the covariance matrix of the noise is, we are still able to claim that the deviation measurement in confounding cases is not less than that in nonconfounding cases. The “not less” condition is, in most of the situations, enough for our method to work. By analyzing the nonidentifiable condition, we gain confidence. When the eigenvalues of the covariance matrix of the noise follow some typical distributions, the confounder is always identifiable. In other cases, it is almost identifiable, since the nonidentifiable condition is hard to satisfy. In the next section, we describe the method and discuss the empirical estimations.

## 6  Methodology

The practical deviation measurement is based on the data observed from the joint distribution of $(Xn,Y)$. We now show how to estimate the empirical deviation. We first provide algorithm 1. Our method is based on a threshold of the empirical deviations. If it is close to 0, we treat it as a nonconfounding case; otherwise, we report the existence of a confounder. The inclusion of the threshold $γ$ is not only because of the estimation error caused by the finite sampling size. Since the justifications are made in infinite dimensions, practical moment coincidence should also be interpreted in a loose sense, as approximate coincidence. This is realized by including a threshold $γ$ when checking the deviation measurement being “0 or not.” Here we also claim that our method can consistently estimate the true value of $D(a~n)$. To understand this, let the sample size be $L$. We rewrite $D(a~n)$ using the estimators as
$D^(a~n)=|Σ^YXnΣ^Xn-1Σ^XnY-∥Σ^Xn-1Σ^XnY∥2τn(Σ^Xn)|$
(6.1)
The hatted symbols are standard estimators of the covariance matrices. The general covariance matrix estimations are known to be consistent with a rate $L-12$. This guarantees the consistency of our estimator, since accurate estimation of the covariance matrices leads to accurate deviation measurement.

In the next section, we conduct various experiments to test the performance of our method.

## 7  Experiments

In this section, we test the proposed confounding detection method in various ways. If not specified, all model-related variables are drawn from standard normal distributions (either one or multidimensional). We record the distribution of $D(a~n)$ (or $D^(a~n)$) based on 200 runs. $n$ is the dimensionality. Note that $D(a~n)$ here is the deviation computed using the true model parameters and $ΣXn$, and $D^(a~n)$ is the deviation computed from the samples, as shown in algorithm 1. What we mean “distribution” later in the figures is the empirical probability of $D(a~n)$ (or $D^(a~n)$) exceeding a certain value, calculated as the number of experiments with $D(a~n)$ greater than or equal to a value divided by the total number of experiments. We here distinguish causal and confounding models of
$Xn=bnZ+En,Y=anTXn+cZ+F,$
by setting $c=0$ and $c$ being nonzero instead of changing $bn$.

### 7.1  First-Moment Deviation and Dimensionality

All the justifications of the asymptotic forms are done in infinite-dimensional models. However, practical data have finite dimensions. The moment deviation behavior in practical cases is important. We first show the empirical first-moment concentration results for randomly generated $an$ and $ΣXn$. For random matrix $ΣXn=ΣEn+bnbnT$, we generate it using this method. We use the same method as described in note 3 to generate $ΣEn$. When sampling the entries of $Γn$, we use a uniform distribution defined on $[0.5,1]$. Then we add $bnbnT$ to get $ΣXn$. Here we consider the nonconfounding cases by setting $c=0$ in the model described by equations 1.1 and 1.2 such that
$a~n=an.$
The probabilities for $D(a~n)$ exceeding certain values are plotted in Figure 2. Besides normal coefficients, we add a test generating the coefficients using a uniform distribution on $[-0.5,0.5]$. We also normalize the norm of $an$ and $bn$ to 1. These results show that although the theoretic moment concentration happens only when the dimension is very high, the empirical results are much better. The first moment of induced spectral measure almost coincides with that of a tracial measure when we have only 10 dimensions.
Figure 2:

Distribution of $D(a~n)$ in causal cases.

Figure 2:

Distribution of $D(a~n)$ in causal cases.

We then study the deviations when there is a confounder. In this case,
$a~n=an+c(ΣEn+bnbnT)-1bn.$
The whole term $D(a~n)$ is larger than 0, as the previous theoretic results show. The experimental results are in Figure 3. Two observations are consistent with our previous theoretical analysis:
1. When there is a confounder, $D(a~n)$ is no longer 0.

2. The larger the $c$ is, the larger the $D(a~n)$ tends to be. Clear evidence is that when $c$ is uniform on $[2,3]$ rather than gaussian, the $D(a~n)$ becomes larger in general.

Figure 3:

Distribution of $D(a~n)$ in confounding cases.

Figure 3:

Distribution of $D(a~n)$ in confounding cases.

These experiments deliver important messages. It shows that the differences between $D(a~n)$ in confounding and nonconfounding cases are obvious. In the confounding cases, first-moment deviations are clear. This indicates a behavior difference of the deviation measurement in confounding and nonconfounding cases. The next message is about the estimation from observations. Note that all the experiments here are using the true model parameters $an,bn,c$ and the true $ΣXn$. In practice, we can only estimate these from observations. We study this in the next section.

### 7.2  Empirical Estimations

In practice, we have only observational points from the model rather than the true model parameters. What if we estimate all the values using observations? We study the $D^(a~n)$ estimated from the samples. In this section, we first generate model parameters and then sample from the model. Algorithm 1 is applied on the observations. We first study the nonconfounding cases where
$a~n=an.$
The plots showing the curves are in Figure 4. We also show the curves of $D^(a~n)$ when there is a confounder and
$a~n=an+c(ΣEn+bnbnT)-1bn.$
The plots showing the curves are in Figures 5a to 5c for $c$ being uniformly distributed on $[1,2]$, and in Figures 5d to 5f for $c$ being uniformly distributed on $[2,3]$). From the figures, we can see that the distribution of $D^(a~n)$ is similar to that of its true value. This may be due to the fact that we are able to consistently estimate the true $D(a~n)$ from observations when the sample size gets large. The empirical error of estimations seems to be acceptable.
Figure 4:

Distribution of $D^(a~n)$ in causal cases.

Figure 4:

Distribution of $D^(a~n)$ in causal cases.

Figure 5:

Distribution of $D^(a~n)$ in confounding cases.

Figure 5:

Distribution of $D^(a~n)$ in confounding cases.

Another observation is that when $c$ becomes larger, the confounding effects are more obvious and the deviations tend to be larger. This matches with our previous analysis about the $D(a~n)$—that it enlarges with the scalar $c$.

### 7.3  Comparative Study

We compare our method with the method of Janzing and Schölkopf (2017), which is mainly based on the estimation of $β*$. We adopt the default setting of the algorithm given in Janzing and Schölkopf (2017) (denoted as J&S in the tables). For our method, the threshold $γ$ for concluding a confounder is set to be 0.5. For the method J&S, we report a confounder if $β*>0.5$. The sample size is fixed to be 500. $c$ is 0 or uniform on $[2,3]$. Table 1 shows the results of applying algorithms on data that are samples from models with normal coefficients (entries of $an$ and $bn$) and uniform coefficients (on $[-0.5,0.5]$). The noises are standard normally distributed. Table 2 shows the results on models with normal coefficients and different noise distributions. Noise distributions are set to be (1) normal distribution, (2) multidimensional student-$t$ distribution with degree of freedom 10; (3) log-normal distribution, and (4) mixture of two normal distribution, with equal probability and mean uniformly drawn from $[-0.5,0.5]$. Note that for multivariate distributions, we are feeding a randomly generated covariance matrix (using the same method as that of note 3, with the entries of the diagonal matrix $Γn$ sampled from a uniform distribution on $(0,1)$). Table 3 shows the results on models with normal noise, and eigenvalues of $ΣEn$ follow specified distributions. In the exponential decay cases, we use a rate $e-15$ instead of $e-1$ to avoid decay that is too fast. Based on the observations, we note the followings:

1. In finite-dimensional cases, when vector $an$ does not lie perfectly on the “center position” of the eigenspace of $ΣXn$, Janzing and Schölkopf's (2017) method tends to include a confounding part because of the variational pattern of the spectral measure.

2. When noises are with random covariance matrices, the performance of our algorithm decreases in cases with log-normal noises but is still good in other cases. The distribution of the noise seems to have an impact on the results. The performance of Janzing and Schölkopf's (2017) method is generally acceptable.

3. Our method performs well when eigenvalues of $ΣEn$ follow typical spectral decay patterns. This matches with our previous theoretic analysis that the confounder is almost identifiable in these cases. However, an exception is found in exponential decay cases with $n=30$. This is because of the unstable estimation caused by eigenvalues that are too small.

Table 1:
Accuracy of Algorithms (Standard Normal Noise).
 Normal Coefficients Uniform Coefficients $n$ = 10 $n$ = 20 $n$ = 30 $n$ = 10 $n$ = 20 $n$ = 30 $c=0$ Ours 98% 100% 100% 99% 99% 100% J&S 35% 31% 37% 28% 34% 35% $c∈[2,3]$ Ours 84% 97% 93% 88% 94% 95% J&S 73% 75% 67% 79% 66% 68%
 Normal Coefficients Uniform Coefficients $n$ = 10 $n$ = 20 $n$ = 30 $n$ = 10 $n$ = 20 $n$ = 30 $c=0$ Ours 98% 100% 100% 99% 99% 100% J&S 35% 31% 37% 28% 34% 35% $c∈[2,3]$ Ours 84% 97% 93% 88% 94% 95% J&S 73% 75% 67% 79% 66% 68%
Table 2:
Accuracy of Algorithms (Random $ΣEn$).
 $n$ = 10 $n$ = 20 $n$ = 30 $n$ = 10 $n$ = 20 $n$ = 30 $c=0$ Normal Noise Student-$t$ Distributed Noise Ours 97% 96% 94% 86% 89% 93% J&S 64% 88% 85% 77% 91% 96% $c∈[2,3]$ Ours 86% 87% 88% 80% 87% 90% J&S 85% 95% 94% 74% 76% 82% $c=0$ Log-Normal Noise Mixture of Two Normal Noise Ours 88% 98% 99% 99% 99% 100% J&S 64% 75% 81% 39% 52% 73% $c∈[2,3]$ Ours 72% 58% 58% 89% 96% 99% J&S 62% 58% 60% 85% 84% 91%
 $n$ = 10 $n$ = 20 $n$ = 30 $n$ = 10 $n$ = 20 $n$ = 30 $c=0$ Normal Noise Student-$t$ Distributed Noise Ours 97% 96% 94% 86% 89% 93% J&S 64% 88% 85% 77% 91% 96% $c∈[2,3]$ Ours 86% 87% 88% 80% 87% 90% J&S 85% 95% 94% 74% 76% 82% $c=0$ Log-Normal Noise Mixture of Two Normal Noise Ours 88% 98% 99% 99% 99% 100% J&S 64% 75% 81% 39% 52% 73% $c∈[2,3]$ Ours 72% 58% 58% 89% 96% 99% J&S 62% 58% 60% 85% 84% 91%
Table 3:
Accuracy of Algorithms ($ΣEn$ with Specified Eigenvalue Distributions).
 Constant Polynomial Decay Exponential Decay $n$ = 10 $n$ = 20 $n$ = 30 $n$ = 10 $n$ = 20 $n$ = 30 $n$ = 10 $n$ = 20 $n$ = 30 $c=0$ Ours 99% 100% 100% 99% 100% 100% 99% 99% 33% J&S 38% 40% 42% 40% 56% 53% 36% 45% 12% $c∈[2,3]$ Ours 84% 96% 96% 98% 100% 100% 93% 93% 88% J&S 77% 67% 66% 93% 95% 98% 75% 67% 65%
 Constant Polynomial Decay Exponential Decay $n$ = 10 $n$ = 20 $n$ = 30 $n$ = 10 $n$ = 20 $n$ = 30 $n$ = 10 $n$ = 20 $n$ = 30 $c=0$ Ours 99% 100% 100% 99% 100% 100% 99% 99% 33% J&S 38% 40% 42% 40% 56% 53% 36% 45% 12% $c∈[2,3]$ Ours 84% 96% 96% 98% 100% 100% 93% 93% 88% J&S 77% 67% 66% 93% 95% 98% 75% 67% 65%

Note that in our experiment, we compute the $D^(a~n)$ and conclude confounding when it exceeds a certain threshold. The choice of the threshold plays an important role. One question of interest is how large it should be. We study this in the next section.

### 7.4  Threshold $γ$

Our algorithm concludes that confounding is based on the rule that the computed $D^(a~n)$ exceeds certain threshold $γ$. Different thresholds lead to different errors. If the threshold is too small, we have a high false-positive rate. But if it is too large, we have a low true-positive rate. To study this, we conduct some experiments. We use the settings of experiments in Table 2 (data dimension 10 and 20) with normal noise and vary the threshold from 0 to 1. For each threshold, we conduct 100 experiments on confounding cases and 100 experiments on nonconfounding cases and record the true-positive and false-positive rates. We plot the results in Figure 6.

Figure 6:

True-positive rate and false-positive rate versus thresholds.

Figure 6:

True-positive rate and false-positive rate versus thresholds.

It shows that both the true-positive rate and false-positive rate decrease as the threshold becomes larger. The largest gap between them occurs when the threshold is around 0.5. The false-positive rate becomes almost 0 while we still have an acceptable true-positive rate. That justifies the threshold settings in our previous experiments. In the next section, we conduct experiments on real-world data to show the capability of our method for solving real-world confounder detection problems.

### 7.5  Real-World Data

We test the method on data sets from the UCI Machine Learning Repository. Notice that we include a preprocessing step to normalize all variables to unit variance.

Remark 6.

We normalize data to deal with the scale variations across features. This could avoid dominant values in the covariance estimation. However, Janzing and Schölkopf (2017) do not recommended it since this normalization jointly changes the covariance matrix and the regression vector. It might violate the independence assumption.

The first data set is the wine taste data set. The data contain 11 features and 1 score of the wine as $x1$: fixed acidity, $x2$: volatile acidity, $x3$: citric acid, $x4$: residual sugar, $x5$: chlorides, $x6$: free sulfur dioxide, $x7$: total sulfur dioxide, $x8$: density, $x9$: pH, $x10$: sulphates, and $x11$: alcohol. $Y$ is the score. We sample 500 points from the full data set and do 200 deviation tests. We have two settings: one including all features, and the other dropping $x11$. The results are plotted in Figure 7. An observation is that dropping $x11$ has caused a clear enlargement of the deviation measure, which indicates that $x11$ is the confounder of the system. This is consistent with the conclusions of Janzing and Schölkopf (2017), who find the same thing via spectral evidence.

Figure 7:

$D^(a~n)$ on taste of wine data set.

Figure 7:

$D^(a~n)$ on taste of wine data set.

Another data set is the compressive strength and ingredients of the concrete data set. The target $Y$ is the strength in megapascals. There are eight features ${x1,…,x8{$ to predict $Y$. $x1$: cement, $x2$: blast furnace slag, $x3$: fly ash, $x4$: water, $x5$: superplasticizer, $X6$: coarse aggregate, $x7$: fine aggregate, and $x8$: age. We sample 500 points from the full data set and do 200 deviation tests. The results are plotted in Figure 8. It shows clear evidence that this data set has a hidden confounder between $Xn$ and $Y$. The deviations are not small in general, which may indicate an obvious confounding effect. This is, in some sense, consistent with the findings by applying the method of Janzing and Schölkopf (2017), who report a significant $β*$ as evidence for clear confounding.

Figure 8:

$D^(a~n)$ on concrete compressive strength data set.

Figure 8:

$D^(a~n)$ on concrete compressive strength data set.

We also test our method on the Indian liver patient data set. The target $Y$ is the indicator of a liver or nonliver patient. There are 10 features ${x1,…,x10{$ to predict $Y$: $x1$: age of the patient, $x2$: gender, $x3$: total bilirubin, $x4$: direct bilirubin, $x5$: alkaline Phosphotase, $x6$: alamine aminotransferase, $x7$: aspartate aminotransferase, $x8$: total proteins, $x9$: albumin, and $x10$: albumin and globulin ratio. We sample 500 points from the full data set and do 200 deviation tests. We have two settings: one including all features and the other dropping $x1$ to $x4$. The results are plotted in Figure 9. Dropping the features results in a slightly larger deviation, which may indicate that $x1$ to $x4$ weakly confound $Y$ and other features.

Figure 9:

$D^(a~n)$ on Indian liver patient data set.

Figure 9:

$D^(a~n)$ on Indian liver patient data set.

## 8  Conclusion

In this letter, we propose a confounder detection method for high-dimensional linear models. It relies on the property that the first moment of $a~n$-induced spectral measure coincides with that of a uniform spectral measure (both on $ΣXn$) in purely causal cases, while the two moments often differ in the presence of a confounder. We hope that our method, modified from spectral measure pattern matching, will provide a simplified yet effective approach for confounding detection. Future work could be extending the method to work in small sample size cases where estimations of regression vectors and covariance matrix are inaccurate and in nonlinear models.

## Notes

1

“Regression coefficient” here refers to the correlation coefficient between variables. It is known that dependent variables could also be uncorrelated, and in that case the regression coefficient is 0.

2

By default, $∥·∥$ stands for the $L2$ norm $∥·∥2$

3

Here the random covariance matrix is generated by $Σ=0.5*(An+AnT)$, where $An$ is an $n$-dimensional matrix and the elements in $An$ are drawn from a uniform distribution on $[-0.5,0.5]$. Then we extract its orthogonal bases $Vn$ and generate a diagonal matrix $Γn$ with entries sampled from a uniform $(0,1)$. Finally, output $VnΓnVnT$.

## Acknowledgments

We thank the editor and the anonymous reviewers for helpful comments. The work described in this letter was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administration Region, China.

## References

Bunch
,
J. R.
,
Nielsen
,
C. P.
, &
Sorensen
,
D. C.
(
1978
).
Rank-one modification of the symmetric eigenproblem
.
Numerische Mathematik
,
31
(
1
),
31
48
.
Hoyer
,
P. O.
,
Shimizu
,
S.
,
Kerminen
,
A. J.
, &
Palviainen
,
M.
(
2008
).
Estimation of causal effects using linear non-gaussian causal models with hidden variables
.
International Journal of Approximate Reasoning
,
49
(
2
),
362
378
.
Hyvärinen
,
A.
, &
Smith
,
S. M.
(
2013
).
Pairwise likelihood ratios for estimation of non-gaussian structural equation models
.
Journal of Machine Learning Research
,
14
,
111
152
.
Hyvärinen
,
A.
,
Zhang
,
K.
,
Shimizu
,
S.
, &
Hoyer
,
P. O.
(
2010
).
Estimation of a structural vector autoregression model using non-gaussianity
.
Journal of Machine Learning Research
,
11
,
1709
1731
.
Janzing
,
D.
,
Hoyer
,
P.
, &
Schölkopf
,
B.
(
2010
).
Telling cause from effect based on high-dimensional observations
. In
Proceedings of the 27th International Conference on Machine Learning
(pp.
479
486
).
Omnipress
.
Janzing
,
D.
, &
Schölkopf
,
B.
(
2010
).
Causal inference using the algorithmic Markov condition
.
IEEE Transactions on Information Theory
,
56
(
10
),
5168
5194
.
Janzing
,
D.
, &
Schölkopf
,
B.
(
2017
).
Detecting confounding in multivariate linear models via spectral analysis
.
Journal of Causal Inference
,
6
,
20170013
.
arXiv:1704.01430
.
Lemeire
,
J.
, &
Janzing
,
D.
(
2013
).
Replacing causal faithfulness with algorithmic independence of conditionals
.
Minds and Machines
,
23
(
2
),
227
249
.
Liu
,
F.
, &
Chan
,
L.
(
2016a
).
Causal discovery on discrete data with extensions to mixture model
.
ACM Transactions on Intelligent Systems and Technology
,
7
(
2
),
art. 21
.
Liu
,
F.
, &
Chan
,
L.
(
2016b
).
Causal inference on discrete data via estimating distance correlations
.
Neural Computation
,
28
(
5
),
801
814
.
Liu
,
F.
, &
Chan
,
L.
(
in press
).
Causal inference on multidimensional data using free probability theory
.
IEEE Transactions on Neural Networks and Learning Systems
. https://ieeexplore.ieee.org/document/7983426
Marton
,
K.
(
1996
).
Bounding $d$ distance by informational divergence: A method to prove measure concentration
.
Annals of Probability
,
24
(
2
),
857
866
.
Popescu
,
S.
,
Short
,
A. J.
, &
Winter
,
A.
(
2006
).
Entanglement and the foundations of statistical mechanics
.
Nature Physics
,
2
(
11
),
754
758
.
Shiffman
,
B.
, &
Zelditch
,
S.
(
2003
).
Random polynomials of high degree and levy concentration of measure
.
arXiv:math/0303335
.
Shimizu
,
S.
,
Hoyer
,
P. O.
,
Hyvärinen
,
A.
, &
Kerminen
,
A.
(
2006
).
A linear non-gaussian acyclic model for causal discovery
.
Journal of Machine Learning Research
,
7
,
2003
2030
.
Shimizu
,
S.
,
Inazumi
,
T.
,
Sogawa
,
Y.
,
Hyvärinen
,
A.
,
Kawahara
,
Y.
,
Washio
,
T.
, …
Bollen
,
K.
(
2011
).
Directlingam: A direct method for learning a linear non-gaussian structural equation model
.
Journal of Machine Learning Research
,
12
,
1225
1248
.
Talagrand
,
M.
(
1995
).
Concentration of measure and isoperimetric inequalities in product spaces
.
Publications Mathématiques de l'IHES
,
81
(
1
),
73
205
.