## Abstract

In this letter, we study the confounder detection problem in the linear model, where the target variable $Y$ is predicted using its $n$ potential causes $Xn=(x1,\u2026,xn)T$. Based on an assumption of a rotation-invariant generating process of the model, recent study shows that the spectral measure induced by the regression coefficient vector with respect to the covariance matrix of $Xn$ is close to a uniform measure in purely causal cases, but it differs from a uniform measure characteristically in the presence of a scalar confounder. Analyzing spectral measure patterns could help to detect confounding. In this letter, we propose to use the first moment of the spectral measure for confounder detection. We calculate the first moment of the regression vector–induced spectral measure and compare it with the first moment of a uniform spectral measure, both defined with respect to the covariance matrix of $Xn$. The two moments coincide in nonconfounding cases and differ from each other in the presence of confounding. This statistical causal-confounding asymmetry can be used for confounder detection. Without the need to analyze the spectral measure pattern, our method avoids the difficulty of metric choice and multiple parameter optimization. Experiments on synthetic and real data show the performance of this method.

## 1 Introduction

^{1}If $xj$ has a significant regression coefficient, it is believed to have a large causal influence on $Y$. However, the correctness of this is based on the causal sufficiency assumption that there is no hidden confounder of $Xn$ and $Y$, which cannot be verified from the regression procedure. Simply checking the coefficient vector cannot give us enough information for identifying a confounder. Given the correctness of the causal sufficiency assumption unverified, estimating the causal effects by regression could be problematic. One never knows if the coefficients purely describe the influence of $Xn$ on $Y$ or if it is significant because they share a hidden common driving force. Thus, confounder detection is important. It basically acts as a verification procedure of the causal sufficiency assumption. For further analysis, we write a mathematical model and denote the nonobservable confounder as $Z$. Following Janzing and Schölkopf (2017), we assume that the $Z$ is a one-dimensional variable and consider the model

^{2}Consider a least squares regression of $Y$ on $Xn$ to get the regression coefficient as

The regression coefficient basically consists of two parts: one part describing the causal influences of $Xn$ on $Y$ and the other part describing confounding effects. As this decomposition reveals, the regression coefficient in confounding and nonconfounding cases could be clearly different. Consider the following points:

- •Purely causal cases: $\u2225bn\u2225$ or $c$ should be 0. In this case,$a~n=an.$
- •Confounding cases: $\u2225bn\u2225$ and $c$ are not 0. In this case,$a~n=an+c(\Sigma En+bnbnT)-1bn.$

Janzing and Schölkopf (2017) propose a method to achieve this goal. The core idea is based on the so-called generic orientation theory, motivated by recent advances in causal discovery that discuss certain independence between cause and mechanisms (Liu & Chan, 2016b, in press; Janzing, Hoyer, & Schölkopf, 2010; Lemeire & Janzing, 2013). The method is built on a core term, the vector-induced spectral measure with respect to $\Sigma Xn$, which intuitively describes the squared length of the components of a vector projected into the eigenspace of $\Sigma Xn$. Later we mention spectral measure multiple times, and, by default, the spectral measure is induced with respect to $\Sigma Xn$. Based on a rotation-invariant model generating assumption and the concentration of a measure phenomenon in high-dimensional spheres (Marton, 1996; Talagrand, 1995; Shiffman & Zelditch, 2003; Popescu, Short, & Winter, 2006), Janzing and Schölkopf (2017) post two asymptotic statements. First, the $an$-induced spectral measure and the $c(\Sigma En+bnbnT)-1bn$-induced spectral measure have their respective patterns. Second, the $a~n$-induced spectral measure is a direct summation of the two measures in the first point. Given the observed joint distribution of $Y$ and $Xn$, we can compute the $a~n$-induced spectral measure. Then we use a convex combination of two spectral measures, one approximating the $an$-induced spectral measure and the other approximating the $c(\Sigma En+bnbnT)-1bn$-induced spectral measure, to match the observed measure. We tune the weights of the two measures and record the weights of the part approximating the $c(\Sigma En+bnbnT)-1bn$-induced spectral measure in the best match. The weight, then, in a certain sense, records the amount of confounding. Although the confounding strength can be quantitatively estimated by this method, the drawback is still clear.

- •
The two asymptotic statements are justified by weak convergence only, and the pattern approximations, as well as measure decomposition, should be interpreted in a sufficiently loose sense. As a consequence, the total variation distance may fail to serve as a good metric when one compares the reconstructed spectral measure with the observed one. However, the optimal choice of metrics stays vague, and the method (Janzing & Schölkopf, 2017) depends on unjustified heuristic choices (kernel smoothing), which may lead to a wrong “equal or not” conclusion.

- •
The method needs to tune two parameters: the weight in reconstruction and a parameter related to approximations of the $c(\Sigma En+bnbnT)-1bn$-induced spectral measure. Optimizing over two-parameter space requires a very finely grained search, and error control is not easy.

These points reduce the reliability of the conclusions drawn by the method.

Can we identify confounding without reconstructing the whole spectral measure? This letter will provide an answer to the question. Recall that an important characteristic of a measure is its moment. We propose to directly use moment information for confounder detection. We will focus on the first moment and show that the first moment of the spectral measure induced by $a~n$ already behaves differently in causal and confounding scenarios. To access its “behavior” in a quantitatively concise sense, we later design a deviation measurement to quantify the difference between the first moment of the induced spectral measure and that of a uniform measure. The moment “behaves differently” is then justified by different asymptotic values of the deviation measurement in causal and confounding cases. This statistic is easy to compute and provides us enough information for confounding detection. Our method clearly avoids the drawbacks of the method of Janzing and Schölkopf (2017):

- •
Without the need of matching spectral measure patterns, we do not need to tackle the vagueness of “interpreting the approximations loosely” and the difficulty of metric choice. Instead, we compare the first moment of the spectral measure induced by $a~n$ with respect to $\Sigma Xn$ with the first moment of the uniform (tracial) spectral measure on $\Sigma Xn$, and make conclusions based on their differences.

- •
The parameter we need is a threshold of the deviation measurement. Simultaneous optimizing two parameters, as the spectral measure pattern matching method (Janzing & Schölkopf, 2017) does, is avoided.

These justify the usability of our proposed method, and it might provide a better solution than the existing one. We detail our method in the following sections and present theoretical and empirical analysis. We begin by describing related work.

## 2 Related Work

We describe the method by Janzing and Schölkopf (2017) for confounder detection. The basic idea is that the causal part and confounding part have their own features in induced spectral measures with respect to the covariance matrix of the cause, such that each part can be approximated and combined. To understand this, we give some basic definitions.

Later we will also need a tracial (uniform) spectral measure, and we here define it formally.

We also call the tracial spectral measure a “uniform spectral measure.” It should be noted that we no longer make the nondegenerate assumption on $\Sigma Xn$, which is made by Janzing and Schölkopf (2017). Thus, the statement of “uniform” should be interpreted in a more general sense instead of uniformly spreading over a domain in $R$: the weight of each point measure is equal while the point measures are allowed to overlap with each other.

For numerical computations (matching spectral measure pattern), the induced spectral measure can be represented using two vectors: a vector containing its support $\lambda \Sigma Xn=(\lambda 1,\u2026,\lambda n)T$ and a vector $\omega \Sigma Xn,\varphi n$ containing the values of the spectral measure at those support points. We formally give the definitions.

Approximate the spectral measure induced by causal part as $\omega \Sigma Xn,an=\omega \Sigma Xn\tau $ in equation 2.9.

- Approximate the spectral measure induced by confounding part asand $\omega H\nu ,H\nu -11n$ is the vectorized representation of the spectral measure induced by $H\nu -11n$ with respect to $H\nu $. $\Lambda n$ is the matrix in equation 2.3. $1n$ is the vector of all 1s, and $\nu $ is a parameter.$\omega \Sigma Xn,c(\Sigma En+bnbnT)-1bn\nu =1\u2225H\nu -11n\u22252\omega H\nu ,H\nu -11n,H\nu =\Lambda n+\nu n1n1nT,$
- Compute $1\u2225a~^n\u22252\omega ^\Sigma Xn,a~n$ using the observations, and find the parameters $\beta *,\nu *$ that minimize a reconstruction error aswhere $\u2225\xb7\u2225K$ is a kernel smoothed metric (Janzing & Schölkopf, 2017).$(\beta *,\nu *)=argmin\beta ,\nu \u22251\u2225a~^n\u22252\omega ^\Sigma Xn,a~n-(1-\beta )\omega \Sigma Xn,an-\beta \omega \Sigma Xn,c(\Sigma En+bnbnT)-1bn\nu \u2225K,$

The “confounding or not” conclusion then relies on $\beta *$. If $\beta *$ is significant, it shows a clear confounding effect. As we mentioned, the method has clear drawbacks. The weak convergence property in infinite dimensions makes the practical success of this method heavily depend on a good distance metric. To make it easy to understand, consider an example of purely causal models. We generate the coefficients vector $an$ of dimension 10, with each entry uniformly drawn from $[-0.5,0.5]$. Then we normalize it to unit norm and calculate the induced spectral measure (vectorized representation) with respect to a random covariance matrix.^{3} One can see that the practical spectral measure has a large difference from the uniform one. When one wants to match the two patterns (vectorized representation) and concludes “purely causal,” one should adjust weights on different dimensions. However, the optimal choice of the metric remains unknown in the pattern-matching method (Janzing & Schölkopf, 2017). It relies only on eigenvalue-gap-related heuristic kernel smoothing. In this example, the kernel smoothing matrix would not be a good choice, since the eigenvalue gaps are quite random compared to the spectral pattern. One may get wrong conclusions—that the pattern in Figure 1a differs a lot from a uniform measure, and it is a confounding model.

In summary, Janzing and Schölkopf's (2017) method relies on analyzing the patterns of the spectral measure. In purely causal cases, it is a uniform one. But the presence of a confounding vector modifies the pattern in a characteristic way and can be detected. However, the weak convergence property makes it very hard to choose a metric for comparing the reconstructed measure with the practical one, thus hindering a good understanding of the pattern. Since reconstructing confounding by measure approximation and combination is a hard task, why not directly check the moment information? In this way, we avoid the hardness of metric choice and multiple parameter optimization that one has to face when the pattern-matching method is used. We would later focus on the first moment and show that checking the first moment is enough for us to identify a confounder. This is because, asymptotically, the first moment of the measure induced by the regression vector coincides with that of a uniform measure in purely causal cases, while it does not in confounding cases. This causal-confounding asymmetry could help us identify the confounding. We put down present thorough discussions about the identifiability of the confounder using the first moment. We start with definitions related to first moment of spectral measures.

## 3 First Moments of Spectral Measures

We define the first moment of the measures here, and design a moment deviation measurement to test the moment behavior of a vector-induced spectral measure. We start with some definitions.

For practical computations, one can use vectorized representation of the measures to compute the moments.

As we previously sketched, we want to design a measure to quantify the difference between the first moment of a vector-induced spectral measure and the first moment of a tracial spectral measure. Since the tracial spectral measure is a normalized one while the vector-induced spectral measure enlarges with the norm of the vector, we also enlarge the tracial measure with the norm. We have the following definition:

Later we mainly study the asymptotic behavior of the deviation measurement. For ease of representation, we define the asymptotic deviation measurement:

Now we get everything ready to proceed. Recall the linear confounding model defined in equations 1.1 and 1.2. Given observational data, we can compute the covariance matrix and the regression vector $a~n$, and thus the induced spectral measure on $\Sigma Xn$ and the tracial one. We want to show the performance of the deviation measure in equation 3.10 in causal and confounding cases. We begin with causal cases.

## 4 $D(a~\u221e)$ in Causal Cases

In this section, we describe our method starting from the properties of the causal cases. In this case, $c\u2225bn\u2225=0$. Then $a~n=an$ and it is independently chosen with respect to $\Sigma Xn$. This concept of independence between cause and mechanism is then realized statistically by an assumption of generative model. In functional models, one often considers the property of the noise, like the independence between noise and cause (Shimizu, Hoyer, Hyvärinen, & Kerminen, 2006), or certain invariance of the number of support points of the conditional distributions (Liu & Chan, 2016a). To begin, we put down this generative model assumption.

**Assumption 1.**Let ${an{n\u2208N$ be a sequence of vectors drawn uniformly at random from a sphere in $Rn$ with fixed radius $ra$. ${\Sigma Xn{n\u2208N$ is a uniformly bounded sequence of positive semidefinite $n\xd7n$ symmetric matrices, and the tracial spectral measure converges weakly as

Now we quote a lemma from Janzing and Schölkopf (2017).

Now we can have a theorem of the asymptotic behavior of the deviation measure in causal cases:

^{11}, we have

^{12}is equivalent to the trace condition (Janzing & Schölkopf, 2010) when restricting $Y$ to be one-dimensional. The trace condition is

We have analyzed the behavior of the deviation measurement in causal cases. It is of interest how this deviation measurement would behave in the presence of a confounder. We present analysis about this in the next section.

## 5 $D(a~\u221e)$ in Confounding Cases

### 5.1 Rotation-Invariant Generating Model

The independence assumption between $a~n$ and $\Sigma Xn$ no longer holds. In analogy to assumption 1, we could still make some assumptions on the confounding model. Note that later, we will need to define the vector-induced and tracial spectral measure on $\Sigma En$, which is the same as defining them on $\Sigma Xn$. Consider the two points that follow:

**Assumption 2.**${\Sigma En{n\u2208N$ and its inverse sequences ${\Sigma En-1{n\u2208N$ and ${\Sigma En-2{n\u2208N$ are uniformly bounded sequences of positive semidefinite $n\xd7n$ symmetric matrices, and their tracial spectral measures converge weakly as

**Assumption 3.**${an{n\u2208N$ is a sequence of vectors drawn uniformly at random from a sphere in $Rn$ with fixed radius $ra$. ${bn{n\u2208N$ is a sequence of vectors drawn uniformly at random from a sphere in $Rn$ with fixed radius $rb$.

We would use these to help us refine $D(a~\u221e)$. Before we proceed, we list some core lemmas that are useful in the derivation of the asymptotic form of $D(a~n)$.

### 5.2 Core Lemmas

We here list some core lemmas that are useful for our analysis. They are true when the above model assumptions are satisfied. Some are directly from Janzing and Schölkopf (2017).

Lemma ^{15} is a direct result of the measure decomposition by Janzing and Schölkopf (2017). We do not need to prove it.

Note that $\tau \u221e(rb2)$ alone should be 0 here. However, we keep this term because later, it might multiply with an unbounded term in a case study, and it cannot be simply ignored.

Lemma ^{17} is a direct result of lemmas ^{11} and ^{16}.

These lemmas are based on a rotation-invariant generating process of the model, which may be violated in some practical scenarios. Bear in mind that weaker assumptions may still lead to the same identities in these statements. This is because the geometry of the high-dimensional sphere makes a majority of the vectors close to their center, which admits “moment concentration.” This is also mentioned in Janzing & Schölkopf (2017).

Finally we need another formula to help us in the proof of the theorem later.

Now we get everything ready. We then show how these lemmas will help us refine the asymptotic expression of the first-moment deviation to a concise form, which could be used for other analyses.

### 5.3 $D(a~\u221e)$

In this section, we give the asymptotic $D(a~n)$ in confounding cases, with a detailed proof. We formalize it using theorem ^{20}:

^{15})” refers to the fact that lemma

^{15}is used here. Then we proceed to refine $D(a~n)$:

^{8}, we have

^{19}for expressing the inverse again:

^{17}, we get

We provide some discussions of the proof:

In the derivation process, we use the fact multiple times that the limit of summation and multiplications can be calculated separately. This, known as algebraic limit theorem, applies because we assume that all limits exist by assumptions 2 and 3.

One can see from the equation 5.10 that the $M(\mu \Sigma Xn,an)$ coincides with $\u2225an\u22252M(\mu \Sigma Xn\tau )$, and they play no role in the final asymptotic form of the deviation. The first moment deviation is determined by how the first moment of spectral measure induced by the confounding part differs from that of a uniform reference measure.

Now we have its asymptotic value. Since in causal cases it is 0, the condition that the confounder is not identifiable by this method is that the deviation in the confounding case is still 0 here. From equation 5.9, the deviation measurement heavily depends on the asymptotic spectral measures of the noise. In the next section, we study it thoroughly.

### 5.4 Identifiability of Confounding

In this section, we study the identifiability of the confounder using our method. The nonidentifiable models are those with $D(a~\u221e)=0$. It is related to the eigenvalues of the covariance matrix of the noise. To start, we consider if one can determine the sign of the absolute part. We also assume the eigenvalues of covariance matrix of the noise are $\u221e>\sigma 1\u2265\sigma 2\u2265\u2026\u2265\sigma \u221e$.

#### 5.4.1 Impossibility of Universal Identifiability

- $M(\mu -1\u221e)-M(\mu -2\u221e)M(\mu 1\u221e)\u22640$. This is becauseby Chebyshev's sum inequality. Then we have$M(\mu \Sigma En-1\tau )=\tau n(\Sigma En-1),=1n\u2211i=1n\sigma i-1,=1n\u2211i=1n\sigma i-2\sigma i,\u22641n\u2211i=1n\sigma i-21n\u2211i=1n\sigma i,=\tau n(\Sigma En-2)\tau n(\Sigma En),=M(\mu \Sigma En-2\tau )M(\mu \Sigma En\tau ),$by order limit theorem.$M(\mu -1\u221e)=limn\u2192\u221eM(\mu \Sigma En-1\tau )\u2264limn\u2192\u221eM(\mu \Sigma En-2\tau )M(\mu \Sigma En\tau )=M(\mu -2\u221e)M(\mu 1\u221e)$
- $rb2M(\mu -1\u221e)2-\tau \u221e(rb2)M(\mu -2\u221e)\u22650$. This is because$M(\mu \Sigma En-1\tau )2=\tau n(\Sigma En-1)2=1n\u2211i=1n\sigma i-12,=1n2\u2211i=1n\sigma i-12,>1n2\u2211i=1n\sigma i-2,=1n\tau n(\Sigma En-2),=1nM(\mu \Sigma En-2\tau ).$

#### 5.4.2 General Nonidentifiable Condition

The above condition is a general one without any assumptions on the eigenvalue distribution of $\Sigma En$. One may also be interested in the $D(a~n)$ when the eigenvalues follow some typical distributions. We then consider three typical distributions of eigenvalues: constant, polynomial decay, and exponential decay.

#### 5.4.3 Eigenvalues Are Constant

#### 5.4.4 Eigenvalues Decay Polynomially

So asymptotically, we still have the condition that the $D(a~n)>0$. Thus, we claim that the confounder is identifiable by our method in the polynomial decay cases.

We offer some comments:

Since the $\sigma 1$ is assumed to be bounded (e.g., a constant), the largest eigenvalue of $\Sigma En-1$ would be unbounded in the limit case. We are still doing analysis based on the convergence results of lemma

^{17}. The justification should come from postulate 1 in Janzing and Schölkopf (2017) with the boundness assumption dropped. Then we could perform an analysis because we know exactly how the eigenvalue grows, and thus the support of $\mu -1\u221e$.- Although moments $M(\mu -1\u221e)$ and $M(\mu -2\u221e)$ do not exist in the limit case, the ratios $\theta 1\u221e$ and $\theta 2\u221e+\theta 3\u221e$ do exist. To understand what they represent, recall equation 5.10:Note that we canceled the effect of the scalar $c$ in the equation. Thus, we have$rb2\theta 1\u221e=limn\u2192\u221eM(\mu \Sigma Xn,\Sigma Xn-1bn),rb2(\theta 2\u221e+\theta 3\u221e)=limn\u2192\u221e\u2225\Sigma Xn-1bn\u22252M(\mu \Sigma Xn\tau ).$
$\theta 1\u221e<\u221e$ directly follows the existence of the first moment of the spectral measure induced by the confounding part.

$\theta 2\u221e+\theta 3\u221e<\u221e$ directly follows the existence of the first moment of the normalized tracial spectral measure enlarged by the norm of the confounding part.

#### 5.4.5 Eigenvalues Decay Exponentially

After we study those cases, we return to the general analysis. In fact, the deviation measurement of the first moment is asymptotically 0 in purely causal cases. It already reaches the lower bound of absolute values. Thus, no matter what the covariance matrix of the noise is, we are still able to claim that the deviation measurement in confounding cases is not less than that in nonconfounding cases. The “not less” condition is, in most of the situations, enough for our method to work. By analyzing the nonidentifiable condition, we gain confidence. When the eigenvalues of the covariance matrix of the noise follow some typical distributions, the confounder is always identifiable. In other cases, it is almost identifiable, since the nonidentifiable condition is hard to satisfy. In the next section, we describe the method and discuss the empirical estimations.

## 6 Methodology

In the next section, we conduct various experiments to test the performance of our method.

## 7 Experiments

### 7.1 First-Moment Deviation and Dimensionality

When there is a confounder, $D(a~n)$ is no longer 0.

The larger the $c$ is, the larger the $D(a~n)$ tends to be. Clear evidence is that when $c$ is uniform on $[2,3]$ rather than gaussian, the $D(a~n)$ becomes larger in general.

These experiments deliver important messages. It shows that the differences between $D(a~n)$ in confounding and nonconfounding cases are obvious. In the confounding cases, first-moment deviations are clear. This indicates a behavior difference of the deviation measurement in confounding and nonconfounding cases. The next message is about the estimation from observations. Note that all the experiments here are using the true model parameters $an,bn,c$ and the true $\Sigma Xn$. In practice, we can only estimate these from observations. We study this in the next section.

### 7.2 Empirical Estimations

Another observation is that when $c$ becomes larger, the confounding effects are more obvious and the deviations tend to be larger. This matches with our previous analysis about the $D(a~n)$—that it enlarges with the scalar $c$.

### 7.3 Comparative Study

We compare our method with the method of Janzing and Schölkopf (2017), which is mainly based on the estimation of $\beta *$. We adopt the default setting of the algorithm given in Janzing and Schölkopf (2017) (denoted as J&S in the tables). For our method, the threshold $\gamma $ for concluding a confounder is set to be 0.5. For the method J&S, we report a confounder if $\beta *>0.5$. The sample size is fixed to be 500. $c$ is 0 or uniform on $[2,3]$. Table 1 shows the results of applying algorithms on data that are samples from models with normal coefficients (entries of $an$ and $bn$) and uniform coefficients (on $[-0.5,0.5]$). The noises are standard normally distributed. Table 2 shows the results on models with normal coefficients and different noise distributions. Noise distributions are set to be (1) normal distribution, (2) multidimensional student-$t$ distribution with degree of freedom 10; (3) log-normal distribution, and (4) mixture of two normal distribution, with equal probability and mean uniformly drawn from $[-0.5,0.5]$. Note that for multivariate distributions, we are feeding a randomly generated covariance matrix (using the same method as that of note 3, with the entries of the diagonal matrix $\Gamma n$ sampled from a uniform distribution on $(0,1)$). Table 3 shows the results on models with normal noise, and eigenvalues of $\Sigma En$ follow specified distributions. In the exponential decay cases, we use a rate $e-15$ instead of $e-1$ to avoid decay that is too fast. Based on the observations, we note the followings:

In finite-dimensional cases, when vector $an$ does not lie perfectly on the “center position” of the eigenspace of $\Sigma Xn$, Janzing and Schölkopf's (2017) method tends to include a confounding part because of the variational pattern of the spectral measure.

When noises are with random covariance matrices, the performance of our algorithm decreases in cases with log-normal noises but is still good in other cases. The distribution of the noise seems to have an impact on the results. The performance of Janzing and Schölkopf's (2017) method is generally acceptable.

Our method performs well when eigenvalues of $\Sigma En$ follow typical spectral decay patterns. This matches with our previous theoretic analysis that the confounder is almost identifiable in these cases. However, an exception is found in exponential decay cases with $n=30$. This is because of the unstable estimation caused by eigenvalues that are too small.

Normal Coefficients | Uniform Coefficients | |||||

$n$ = 10 | $n$ = 20 | $n$ = 30 | $n$ = 10 | $n$ = 20 | $n$ = 30 | |

$c=0$ | ||||||

Ours | 98% | 100% | 100% | 99% | 99% | 100% |

J&S | 35% | 31% | 37% | 28% | 34% | 35% |

$c\u2208[2,3]$ | ||||||

Ours | 84% | 97% | 93% | 88% | 94% | 95% |

J&S | 73% | 75% | 67% | 79% | 66% | 68% |

Normal Coefficients | Uniform Coefficients | |||||

$n$ = 10 | $n$ = 20 | $n$ = 30 | $n$ = 10 | $n$ = 20 | $n$ = 30 | |

$c=0$ | ||||||

Ours | 98% | 100% | 100% | 99% | 99% | 100% |

J&S | 35% | 31% | 37% | 28% | 34% | 35% |

$c\u2208[2,3]$ | ||||||

Ours | 84% | 97% | 93% | 88% | 94% | 95% |

J&S | 73% | 75% | 67% | 79% | 66% | 68% |

$n$ = 10 | $n$ = 20 | $n$ = 30 | $n$ = 10 | $n$ = 20 | $n$ = 30 | |

$c=0$ | Normal Noise | Student-$t$ Distributed Noise | ||||

Ours | 97% | 96% | 94% | 86% | 89% | 93% |

J&S | 64% | 88% | 85% | 77% | 91% | 96% |

$c\u2208[2,3]$ | ||||||

Ours | 86% | 87% | 88% | 80% | 87% | 90% |

J&S | 85% | 95% | 94% | 74% | 76% | 82% |

$c=0$ | Log-Normal Noise | Mixture of Two Normal Noise | ||||

Ours | 88% | 98% | 99% | 99% | 99% | 100% |

J&S | 64% | 75% | 81% | 39% | 52% | 73% |

$c\u2208[2,3]$ | ||||||

Ours | 72% | 58% | 58% | 89% | 96% | 99% |

J&S | 62% | 58% | 60% | 85% | 84% | 91% |

$n$ = 10 | $n$ = 20 | $n$ = 30 | $n$ = 10 | $n$ = 20 | $n$ = 30 | |

$c=0$ | Normal Noise | Student-$t$ Distributed Noise | ||||

Ours | 97% | 96% | 94% | 86% | 89% | 93% |

J&S | 64% | 88% | 85% | 77% | 91% | 96% |

$c\u2208[2,3]$ | ||||||

Ours | 86% | 87% | 88% | 80% | 87% | 90% |

J&S | 85% | 95% | 94% | 74% | 76% | 82% |

$c=0$ | Log-Normal Noise | Mixture of Two Normal Noise | ||||

Ours | 88% | 98% | 99% | 99% | 99% | 100% |

J&S | 64% | 75% | 81% | 39% | 52% | 73% |

$c\u2208[2,3]$ | ||||||

Ours | 72% | 58% | 58% | 89% | 96% | 99% |

J&S | 62% | 58% | 60% | 85% | 84% | 91% |

Constant | Polynomial Decay | Exponential Decay | |||||||

$n$ = 10 | $n$ = 20 | $n$ = 30 | $n$ = 10 | $n$ = 20 | $n$ = 30 | $n$ = 10 | $n$ = 20 | $n$ = 30 | |

$c=0$ | |||||||||

Ours | 99% | 100% | 100% | 99% | 100% | 100% | 99% | 99% | 33% |

J&S | 38% | 40% | 42% | 40% | 56% | 53% | 36% | 45% | 12% |

$c\u2208[2,3]$ | |||||||||

Ours | 84% | 96% | 96% | 98% | 100% | 100% | 93% | 93% | 88% |

J&S | 77% | 67% | 66% | 93% | 95% | 98% | 75% | 67% | 65% |

Constant | Polynomial Decay | Exponential Decay | |||||||

$n$ = 10 | $n$ = 20 | $n$ = 30 | $n$ = 10 | $n$ = 20 | $n$ = 30 | $n$ = 10 | $n$ = 20 | $n$ = 30 | |

$c=0$ | |||||||||

Ours | 99% | 100% | 100% | 99% | 100% | 100% | 99% | 99% | 33% |

J&S | 38% | 40% | 42% | 40% | 56% | 53% | 36% | 45% | 12% |

$c\u2208[2,3]$ | |||||||||

Ours | 84% | 96% | 96% | 98% | 100% | 100% | 93% | 93% | 88% |

J&S | 77% | 67% | 66% | 93% | 95% | 98% | 75% | 67% | 65% |

Note that in our experiment, we compute the $D^(a~n)$ and conclude confounding when it exceeds a certain threshold. The choice of the threshold plays an important role. One question of interest is how large it should be. We study this in the next section.

### 7.4 Threshold $\gamma $

Our algorithm concludes that confounding is based on the rule that the computed $D^(a~n)$ exceeds certain threshold $\gamma $. Different thresholds lead to different errors. If the threshold is too small, we have a high false-positive rate. But if it is too large, we have a low true-positive rate. To study this, we conduct some experiments. We use the settings of experiments in Table 2 (data dimension 10 and 20) with normal noise and vary the threshold from 0 to 1. For each threshold, we conduct 100 experiments on confounding cases and 100 experiments on nonconfounding cases and record the true-positive and false-positive rates. We plot the results in Figure 6.

It shows that both the true-positive rate and false-positive rate decrease as the threshold becomes larger. The largest gap between them occurs when the threshold is around 0.5. The false-positive rate becomes almost 0 while we still have an acceptable true-positive rate. That justifies the threshold settings in our previous experiments. In the next section, we conduct experiments on real-world data to show the capability of our method for solving real-world confounder detection problems.

### 7.5 Real-World Data

We test the method on data sets from the UCI Machine Learning Repository. Notice that we include a preprocessing step to normalize all variables to unit variance.

We normalize data to deal with the scale variations across features. This could avoid dominant values in the covariance estimation. However, Janzing and Schölkopf (2017) do not recommended it since this normalization jointly changes the covariance matrix and the regression vector. It might violate the independence assumption.

The first data set is the wine taste data set. The data contain 11 features and 1 score of the wine as $x1$: fixed acidity, $x2$: volatile acidity, $x3$: citric acid, $x4$: residual sugar, $x5$: chlorides, $x6$: free sulfur dioxide, $x7$: total sulfur dioxide, $x8$: density, $x9$: pH, $x10$: sulphates, and $x11$: alcohol. $Y$ is the score. We sample 500 points from the full data set and do 200 deviation tests. We have two settings: one including all features, and the other dropping $x11$. The results are plotted in Figure 7. An observation is that dropping $x11$ has caused a clear enlargement of the deviation measure, which indicates that $x11$ is the confounder of the system. This is consistent with the conclusions of Janzing and Schölkopf (2017), who find the same thing via spectral evidence.

Another data set is the compressive strength and ingredients of the concrete data set. The target $Y$ is the strength in megapascals. There are eight features ${x1,\u2026,x8{$ to predict $Y$. $x1$: cement, $x2$: blast furnace slag, $x3$: fly ash, $x4$: water, $x5$: superplasticizer, $X6$: coarse aggregate, $x7$: fine aggregate, and $x8$: age. We sample 500 points from the full data set and do 200 deviation tests. The results are plotted in Figure 8. It shows clear evidence that this data set has a hidden confounder between $Xn$ and $Y$. The deviations are not small in general, which may indicate an obvious confounding effect. This is, in some sense, consistent with the findings by applying the method of Janzing and Schölkopf (2017), who report a significant $\beta *$ as evidence for clear confounding.

We also test our method on the Indian liver patient data set. The target $Y$ is the indicator of a liver or nonliver patient. There are 10 features ${x1,\u2026,x10{$ to predict $Y$: $x1$: age of the patient, $x2$: gender, $x3$: total bilirubin, $x4$: direct bilirubin, $x5$: alkaline Phosphotase, $x6$: alamine aminotransferase, $x7$: aspartate aminotransferase, $x8$: total proteins, $x9$: albumin, and $x10$: albumin and globulin ratio. We sample 500 points from the full data set and do 200 deviation tests. We have two settings: one including all features and the other dropping $x1$ to $x4$. The results are plotted in Figure 9. Dropping the features results in a slightly larger deviation, which may indicate that $x1$ to $x4$ weakly confound $Y$ and other features.

## 8 Conclusion

In this letter, we propose a confounder detection method for high-dimensional linear models. It relies on the property that the first moment of $a~n$-induced spectral measure coincides with that of a uniform spectral measure (both on $\Sigma Xn$) in purely causal cases, while the two moments often differ in the presence of a confounder. We hope that our method, modified from spectral measure pattern matching, will provide a simplified yet effective approach for confounding detection. Future work could be extending the method to work in small sample size cases where estimations of regression vectors and covariance matrix are inaccurate and in nonlinear models.

## Notes

^{1}

“Regression coefficient” here refers to the correlation coefficient between variables. It is known that dependent variables could also be uncorrelated, and in that case the regression coefficient is 0.

^{2}

By default, $\u2225\xb7\u2225$ stands for the $L2$ norm $\u2225\xb7\u22252$

^{3}

Here the random covariance matrix is generated by $\Sigma =0.5*(An+AnT)$, where $An$ is an $n$-dimensional matrix and the elements in $An$ are drawn from a uniform distribution on $[-0.5,0.5]$. Then we extract its orthogonal bases $Vn$ and generate a diagonal matrix $\Gamma n$ with entries sampled from a uniform $(0,1)$. Finally, output $Vn\Gamma nVnT$.

## Acknowledgments

We thank the editor and the anonymous reviewers for helpful comments. The work described in this letter was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administration Region, China.