## Abstract

Quantifying the degree of spatial dependence for linguistic variables is a key task for analyzing dialectal variation. However, existing approaches have important drawbacks. First, they are based on parametric models of dependence, which limits their power in cases where the underlying parametric assumptions are violated. Second, they are not applicable to all types of linguistic data: Some approaches apply only to frequencies, others to boolean indicators of whether a linguistic variable is present. We present a new method for measuring geographical language variation, which solves both of these problems. Our approach builds on Reproducing Kernel Hilbert Space (RKHS) representations for nonparametric statistics, and takes the form of a test statistic that is computed from pairs of individual geotagged observations without aggregation into predefined geographical bins. We compare this test with prior work using synthetic data as well as a diverse set of real data sets: a corpus of Dutch tweets, a Dutch syntactic atlas, and a data set of letters to the editor in North American newspapers. Our proposed test is shown to support robust inferences across a broad range of scenarios and types of data.

## 1. Introduction

Figure 1 shows the geographical location of 1,000 Twitter posts containing the word *hella*, an intensifier used in expressions like *I got hella studying to do* and *my eyes got hella big* (Eisenstein et al. 2014). Although the word appears in major population centers throughout the United States, the map suggests that it enjoys a particularly high level of popularity on the west coast, in the area around San Francisco. But does this represent a real geographical difference in American English, or is it the result of chance fluctuation in a finite data set?

Regional variation of language has been extensively studied in sociolinguistics and dialectology (Lee and Kretzschmar, Jr. 1993; Chambers and Trudgill 1998; Grieve, Speelman, and Geeraerts 2011, 2013; Szmrecsanyi 2012; Nerbonne and Kretzschmar, Jr. 2013). A common approach involves mapping the geographic distribution of a linguistic variable (e.g., the choice of *soda*, *pop*, or *coke* to refer to a soft drink) and identifying boundaries between regions based on the data. The identification of linguistic variables that exhibit regional variation is therefore the first step in many studies of regional dialects. Traditionally, this step has been based on the manual judgment of the researcher; depending on the quality of the researcher's intuitions, the most interesting or important variables might be missed.

The increasing amount of data available to study dialectal variation suggests a turn towards data-driven alternatives for variable selection. For example, researchers can mine social media data such as Twitter (Eisenstein et al. 2010; Doyle 2014; Huang et al. 2016) or product reviews (Hovy, Johannsen, and Søgaard 2015) to identify and test thousands of dialectal variables. Despite the large scale of available data, the well-known “long tail” phenomenon of language ensures that there will be many potential variables with low counts. A statistical metric for comparing the strength of geographical associations across potential linguistic variables would allow linguists to determine whether finite geographical samples—such as the one shown in Figure 1—reveal a statistically meaningful association.

The use of statistical methods to analyze spatial dependence has been only lightly studied in sociolinguistics and dialectology. Existing approaches use classical statistics such as Moran's I (e.g., Grieve, Speelman, and Geeraerts 2011), join count analysis (e.g., Lee and Kretzschmar Jr. 1993), and the Mantel Test (e.g., Scherrer 2012); we review these statistics in Section 2. These classical approaches suffer from a common problem: Each type of test can capture only a specific parametric form of spatial linguistic variation. As a result, these tests can incorrectly fail to reject the null hypothesis if the nature of the geolinguistic dependence does not match the underlying assumptions of the test.

To address these limitations, we propose a new test statistic that builds on a rich and growing literature on kernel embeddings for nonparametric statistics (Shawe-Taylor and Cristianini 2004). In these methods, probability distributions, such as the distribution over geographical locations for each linguistic variable, are embedded in a Reproducing Kernel Hilbert Space (RKHS). Specifically, we use the Hilbert-Schmidt Independence Criterion (HSIC; Gretton et al. 2005a). Because of its ability to compare arbitrarily high-order moments of probability distributions, HSIC can be used to compare arbitrary probability measures by computing kernel functions on finite samples. Unlike prior approaches, HSIC is statistically consistent: In the limit of a sufficient amount of data, it will correctly determine whether the distribution of a linguistic feature is geographically dependent. As a further convenience, because it is built on kernel similarity functions, HSIC can be applied with equal ease to any type of linguistic data, as long as an appropriate kernel function can be constructed.

To validate this approach, we compare it against three alternative spatial statistics: Moran's I, the Mantel test, and join count analysis. For a controlled comparison, we use synthetic data to simulate different types of regional variation, and different types of linguistic data. This allows us to measure the capability of each approach to recover true geolinguistic associations, and to avoid type I errors even in noisy and sparse data. Next, we apply these approaches to three real linguistic data sets: a corpus of Dutch tweets, a Dutch syntactic atlas, and letters to the editor in North American newspapers.

To summarize, the contributions of this article are:

- •
We show how the Hilbert-Schmidt Independence Criterion can be applied to linguistic data. HSIC is a nonparametric test statistic that can handle both frequency and categorical data. It requires no discretization of geographic data, and is capable of detecting arbitrary geolinguistic dependencies (Section 3).

- •
We use synthetic data to compare the power and calibration of HSIC against three alternatives: Moran's I, the Mantel Test, and join count analysis (Section 4).

- •
We apply these methods to analyze dialectal variation in three empirical data sets, in both English and Dutch, across a variety of registers (Section 5).

## 2. Prior Work

This section describes prior work on global methods for quantifying the degree of spatial dependence in a geotagged corpus.^{1} Although other global spatial statistics exist, we focus on the following three methods because they have been used in previous work on dialect analysis: Moran's I (Grieve, Speelman, and Geeraerts 2011), join count analysis (Lee and Kretzschmar, Jr. 1993), and the Mantel test (Scherrer 2012).

We define a consistent notation across methods. Let *x*_{i} represent a scalar linguistic observation for unit *i* ∈ {1 … *n*} (typically, the presence or frequency of a linguistic variable), and let *y*_{i} represent a corresponding geolocation. For convenience, we define *d*_{ij} as the spatial distance between *y*_{i} and *y*_{j}. Suppose we have *n* observations, so that the data $D={(x1,y1),(x2,y2),\u2026,(xn,yn)}$. Our goal is to test the strength of association between *X* and *Y*, against the null hypothesis that there is no association.

### 2.1 Moran's I

*W*= {

*w*

_{ij}}

_{i,j∈{1…n}}represent a

*spatial weighting matrix*, such that larger values of

*w*

_{ij}indicate greater proximity, and

*w*

_{ii}= 0. In their application of Moran's I to a corpus of newspaper letters to the editor, Grieve et al. define

*W*as

*x*

_{i}and

*x*

_{j}are more similar when

*w*

_{ij}= 1 than when

*w*

_{ij}= 0.

^{2}

Moran's I is based on a hypothesized autoregressive process *X* = ρ*WX* + ϵ, where *X* is a vector of the linguistic observations *x*_{1}, … *x*_{n}, and ϵ is a vector of uncorrelated noise. Because *X* and *W* are given, the estimation problem is to find ρ so as to minimize the magnitude of ϵ. To take a probabilistic interpretation, it is typical to assume that ϵ consists of independent and identically distributed normal random variables with zero mean (Ord 1975). Under the null hypothesis of no spatial dependence between the observations in *X*, we would have ρ = 0. Note, however, that we may fail to reject the possibility that ρ = 0 even in the presence of spatial dependence, if the form of this dependence is not monotonic or nonlinear in *W*.

*X*; the ratio on the right corresponds to the covariance between points

*i*and

*j*that are spatially similar. Thus, the statistic rescales a spatially reweighted covariance (the ratio on the right of Equation (2)) by the overall variance (the ratio on the left of Equation (2)), giving an estimate of the overall spatial dependence of

*X*. A compact alternative notation is to rewrite the statistic in terms of the vector of

*residuals*

*R*= {

*r*

_{i}}

_{i∈1…n}, where the residual $ri=xi\u2212x\xaf$. This yields the form $I=R\u22a4WRR\u22a4R$, with

*R*

^{⊤}indicating the transpose of the column vector

*R*, and with

*W*assumed to be normalized so that $\u2211i,jwij=n$. Moran's I values often lie between −1 and 1, but the exact range depends on the weight matrix

*W*, and is theoretically unbounded (de Jong, Sprenger, and van Veen 1984).

In hypothesis testing, our goal is to determine the p-value representing the likelihood that a value of Moran's I at least as extreme as the observed value would arise by chance under the null hypothesis. The expected value of Moran's I in the case of no spatial dependence is $\u22121n\u22121$. Grieve et al. compute p-values from a closed-form approximation of the variance under the null hypothesis of total randomization. A nonparametric alternative is to perform a **permutation test**, calculating the empirical p-value by comparing the observed test statistic against the values that arise across multiple random permutations of the original data.

In either case, Moran's I does not test the null hypothesis of no statistical dependence between the linguistic features *X* and the geo-coordinates *Y*. Rather, it tests whether the estimated value of ρ would be likely to arise if there were no such dependence. If the nature of the geolinguistic dependence defies the assumptions of the statistic, however, then we risk incorrectly failing to reject the null hypothesis, a type II error. Put another way, there are forms of strong spatial dependence for which ρ = 0, such as non-monotonic spatial relationships. This risk can be somewhat ameliorated by careful choice of the spatial weighting matrix *W*, which could in theory account for non-linear or even non-monotonic dependencies. However, an exhaustive search for some *W* that obtains a low p-value would clearly be an invalid hypothesis test, and so *W* must be fixed *before* any test is performed. In some cases, the researcher may bring substantive insights to the determination of *W*, and so the flexibility of Moran's I in this sense could be regarded as a positive feature. But there is little theoretical guidance, and a poor selection of *W* will result in inflated type II error rates.

From a practical standpoint, Moran's I is applicable to only some types of linguistic data. In the study of dialect, *X* typically represents the frequency or presence of some linguistic variable, such as the use of *soda* versus *pop*. We are unaware of applications of Moran's I to variables with more than two possibilities (e.g., *soda*, *pop*, *coke*). One possible solution would be to perform multiple tests, with each alternant pitted against all the others. But it is not clear how the p-values from these multiple tests should be combined. For example, selecting the minimum p-value across the alternants would mean that the null hypothesis would be more likely to be rejected for variables with more alternants; averaging the p-values across alternants would have the opposite problem.

### 2.2 Join Count Analysis

*X*consist of discrete observations,

**join count analysis**is another approach for detecting spatial dependence. For each pair of points (

*i*,

*j*), we compute

*w*

_{ij}

*δ*(

*x*

_{i}=

*x*

_{j}), where

*δ*(

*x*

_{i}=

*x*

_{j}) returns a value of 1 if

*x*

_{i}and

*x*

_{j}are identical, and 0 otherwise. As in Moran's I,

*w*

_{ij}is an element of a spatial weighting matrix, which could be binary or continuous. The global sum of the counts is computed as

*X*

^{⊤}indicating the transpose of the column vector

*X*. Note the similarity to the numerator of Moran's I, which can be written as

*R*

^{⊤}

*WR*. The number of agreements can be compared with its expectation under the null hypothesis, yielding a hypothesis test for global autocorrelation (Cliff and Ord 1981).

Join count analysis has been applied to the study of dialect by Lee and Kretzschmar, Jr. (1993), who take each linguistic observation *x*_{i} ∈ {0,1} to be a binary variable indicating the presence or absence of a dialect feature. They then build a binary spatial weighting matrix by performing a Delaunay triangulation over the geolocations of participants in dialect interviews, with *w*_{ij} = 1 if the edge (*i*,*j*) appears in the Delaunay triangulation. A nice property of Delaunay triangulation is that points tend to be connected to their closest neighbors, regardless of how distant or near those neighbors are: In high-density regions, the edges will tend to be short, whereas in low-density regions, the edges will be long. The method is therefore arguably more suitable to data in which the density of observations is highly variable—for example, between densely populated cities and sparsely populated hinterlands.

Because join count statistics are based on counts of agreements, this form of analysis requires that each *x*_{i} is a categorical variable—possibly non-binary—rather than a frequency. In this sense, it is the complement of Moran's I, which can be applied to frequencies, but not to non-binary discrete variables. Thus, join count analysis is best suited to cases where observations correspond to individual utterances (e.g., Twitter data, dialect interviews), rather than cases where observations correspond to longer texts (e.g., newspaper corpora).

### 2.3 The Mantel Test

The Mantel test can in principle be used to measure the dependence between any two arbitrary signals. In this test, we compute *distances* for each pair of linguistic variables, *d*_{x}(*x*_{i},*x*_{j}), and each pair of spatial locations, *d*_{y}(*y*_{i},*y*_{j}), forming a pair of distance matrices *D*_{x} and *D*_{y}. We then estimate the element-wise correlation (usually, the Pearson correlation) between these two matrices. Scherrer (2012) uses the Mantel test to correlate linguistic distance with geographical distance, and Gooskens and Heeringa (2006) correlate perceptual distance with linguistic distance. The Mantel test has also been applied to non-human dialect analysis, revealing regional differences in the call structures of Amazonian parrots by computing a linguistic distance matrix *D*_{x} directly from spectral measurements (Wright 1996).

Because it is built around distance functions, the Mantel test is applicable to binary, categorical, and frequency data—any kind of data for which a distance function can be constructed. For spatial locations, a typical choice is to compute the distance matrix based on the Euclidean distance between each pair of points. For binary or categorical linguistic data, the entries of the linguistic distance matrix can be set to 0 if *x*_{i} = *x*_{j}, and 1 otherwise. For linguistic frequency data, we use the absolute difference between the frequency values.

The role of hypothesis testing in the Mantel test is to determine the likelihood that the observed test statistic—in this case, the correlation between the distance matrices *D*_{x} and *D*_{y}—could have arisen by chance under the null hypothesis. In the ideal case of perfect correlation, twice as much geographical distance should imply twice as much as linguistic distance. But this situation is highly unlikely to obtain in practice. In fact, as noted by Grieve (2014), such a perfect correlation *cannot* arise from any linguistic data involving a single variable: Even if a linguistic variable obeys a perfect dialect continuum (e.g., varying in frequency from east to west), the distances in the orthogonal north–south direction would diminish the resulting correlation. In realistic settings in which the geolinguistic dependence is obscured by noise, this can dramatically diminish the power of the test. Note that even non-linear transformations of the distance metric would not correct this issue. The key problem, as identified by Legendre, Fortin, and Borcard (2015), is that the Mantel test is not designed to test for independence between *X* and *Y*, but rather, the correlation between *distances* on *X* and *Y*. When distances are the primary units of analysis—as, for example, in the work of Gooskens and Heeringa (2006)—the test is applicable. For the task of determining whether a specific linguistic variable is geographically dependent, however, the test is incorrectly applied; as we show in Section 4, this results in inflated type II error rates.

### 2.4 Other Related Work

Several computational studies have attempted to characterize linguistic differences across geographical regions, although most of these studies do not perform hypothesis testing on geographical dependence. A common approach is to aggregate geotagged social media content into geographical bins. Some studies rely on politically defined units such as nations and states (Hovy, Johannsen, and Søgaard 2015); however, **isoglosses** (the geographical boundaries between linguistic features) need not align with politically defined geographical units (Nerbonne and Kretzschmar, Jr. 2013). Other approaches rely on automatically defined geographical units, induced by computational methods such as geodesic grids (Wing and Baldridge 2011), KD-trees (Roller et al. 2012), Gaussians (Eisenstein et al. 2010), and mixtures of Gaussians (Hong et al. 2012). While these approaches offer insights about the nature of geographical language variation, they do not provide test statistics that allow us to quantify the geographical dependence of various linguistic features.

As described in the next section, our approach is based on Reproducing Kernel Hilbert Spaces, which enable us to nonparametrically compare probability distributions. Another way in which kernel methods can be applied to spatial analysis is in Gaussian Processes, which are often used to represent spatial data (Cressie 1988; Ecker and Gelfand 1997). Specifically, we can define a kernel over space, so that a response variable is distributed as a Gaussian with covariance defined by the kernel function. For example, it might be possible to model the popularity of linguistic features as a Gaussian Process, using the spatial covariance kernel to make smooth predictions at unknown locations. Our approach in this article is different, as we are interested in hypothesis testing, rather than modeling and prediction. Another difference is that we apply kernels to both the geographical and linguistic data sources, whereas a Gaussian Process approach would make the parametric assumption that the linguistic signal is Gaussian distributed with covariance defined by the spatial covariance kernel.

## 3. Hilbert-Schmidt Independence Criterion

Moran's I, join count analysis, and the Mantel test share an important drawback: They do not directly test the independence of language and geography, but rather, they test for autocorrelation between *X* and *Y*, or between distances on these variables. Moran's I tests whether the parameter of a linear autoregressive model is nonzero; the Mantel test is performed on the correlations between pairwise distances; join count statistics enable tests of whether spatially adjacent units tend to have the same linguistic features. In each case, rejection of the null hypothesis implies dependence between the geographical and linguistic signals. However, each test can incorrectly fail to reject the null hypothesis if its assumptions are violated, even if given an arbitrarily large amount of data.

We propose an alternative approach: directly test for the independence of georaphical and linguistic variables, $PXY=?PXPY$. Our approach, which is based on the HSIC (Gretton et al. 2005a, 2008), makes no parametric assumptions about the form of these distributions. The proposed test is *consistent*, in the sense that it will always reach the right decision, if provided enough data (Fukumizu et al. 2007).^{3}

To test independence for arbitrary distributions *P*_{XY}, *P*_{X}, and *P*_{Y}, HSIC uses the framework of RKHS. This framework will be familiar to the computational linguistics community through its application to support vector machines (Collins and Duffy 2001; Lodhi et al. 2002), where **kernel similarity functions** between pairs of instances are used to induce classifiers with highly non-linear decision boundaries. HSIC is a kernelized independence test, and it offers an analogous advantage: By computing kernel similarity functions on pairs of observations, it is possible to implicitly compare probability distributions across high-order moments, enabling nonparametric independence tests that are statistically consistent. An additional advantage of the RKHS framework is that it can be applied to arbitrary linguistic data—including dichotomous, polytomous, continuous, and vector-valued observations—as long as an appropriate kernel similarity function can be defined.

Intuitively, HSIC tests independence by approximating a measure of the discrepancy between the joint geolinguistic distribution *P*_{XY} and the product of independent distributions *P*_{X}*P*_{Y}. The forms of these distributions are unknown; for example, *P*_{Y} might be Gaussian, or it might be some complicated multimodal distribution. The maximum mean discrepancy is a scalar function of the discrepancy between a pair of distributions, which makes no assumption about the distributions' parametric forms. The maximum mean discrepancy will be large when linguistic similarity tends to co-occur with geographical similarity, indicating an association between language and geography that is unlikely to arise by chance. The key insight is that it is possible to approximate the maximum mean discrepancy from a finite sample of observations, by rewriting it as a sum of kernel similarity functions. These kernel functions should quantify the similarity between each pair of instances as a scalar; they must also obey some more technical properties, enumerated in Section 3.3. If the kernel functions are appropriately chosen, then the approximation is asymptotically consistent, meaning that it will approach the exact maximum mean discrepancy in the limit of infinite data, regardless of the forms of *P*_{X}, *P*_{Y}, and *P*_{XY}. (This property is not shared by the Mantel test, which is superficially similar in that it operates on distances between pairs of observations.) We now present the mathematical details of the method.

### 3.1 Comparing Probability Distributions

**maximum mean discrepancy**(mmd) is a nonparametric statistic that compares two arbitrary probability distributions. In the HSIC test, this statistic is used to compare the joint distribution

*P*

_{XY}with the product of marginal distributions

*P*

_{X}

*P*

_{Y}. The mmd is defined as

*f*over a set of possible functions, and compute the difference in the expected values under the distributions

*P*and

*Q*. Clearly, if

*P*=

*Q*, then mmd = 0, but the challenge is to estimate mmd for arbitrary

*P*and

*Q*, based only on finite samples from these distributions.

To explain how to do this, we introduce some concepts from RKHS. Let $k:X\xd7X\u21a6R+$ denote a kernel function, mapping from pairs of observations (*x*_{i},*x*_{j}) to reals. A classical example is the **radial basis function** (RBF) kernel on vectors, where $k(x\u2192i,x\u2192j)=exp\u2212\gamma ||x\u2192i\u2212x\u2192j||22$. Many other kernel functions are possible; the conditions for valid kernels are enumerated in Section 3.3.

*x*, the kernel function

*k*defines a corresponding

**feature map**$k\xb7x:X\u21a6R+$, which is the function that arises by fixing one of the arguments of the kernel function to the value

*x*. The “reproducing” property of RKHS implies an identity between kernel functions and inner products of feature maps:

*x*, we can compute inner products of feature maps by computing the kernel similarity function over the associated instances. The mmd can be expressed in terms of such inner products, and thus, can be computed in terms of kernel similarity functions.

*P*, the

**mean element**of

*P*is defined as the expected feature map, $\mu P=EPk\xb7x$. The mmd can then be computed in terms of kernel functions of the mean elements,

If μ_{P} = μ_{Q}, then the mmd is zero. The key observation is that μ_{P} = μ_{Q} if and only if *P* = *Q*, so long as an appropriate kernel similarity function is chosen (Fukumizu et al. 2007); see Section 3.3 for more on the choice of kernel functions.

*x*

_{1},

*x*

_{2}, … ,

*x*

_{m}} and {

*y*

_{1},

*y*

_{2}, … ,

*y*

_{n}}:

### 3.2 Derivation of HSIC

*X*and

*Y*, we test the mmd between the joint distribution

*P*

_{XY}and the product of marginals

*P*

_{X}

*P*

_{Y}. In this setting, each observation

*i*corresponds to a pair (

*x*

_{i},

*y*

_{i}), so we require a kernel function on paired observations,

*k*((

*x*

_{i},

*y*

_{i}), (

*x*

_{j},

*y*

_{j})). We define this as a

**product kernel**,

Using the product kernel, we can define mean embeddings for the distributions *P*_{XY} and *P*_{X}*P*_{Y}, enabling the application of the mmd estimator from Equation (9). The HSIC is precisely this application of maximum mean discrepancy to compare the joint distribution against the product of marginal distributions.

**Gram matrix**

*K*

_{x}so that $Kxi,j=kXxixj$ for all pairs

*i*,

*j*in the sample. Analogously, $(Ky)i,j=kY(yi,yj)$. Then the HSIC can be estimated from a finite sample of

*m*observations as

**tr**indicates the matrix trace and

*H*is a centering matrix, $H=Im\u22121n11\u22a4$. With this definition of

*H*, we have

These two terms can be seen as mean-centered Gram matrices. By computing the trace of their matrix product, we obtain a cross-covariance between the Gram matrices. This trace is directly proportional to the maximum mean discrepancy between *P*_{XY} and *P*_{X}*P*_{Y}. If *X* and *Y* are dependent, then large values of $kXxixj$ will imply large values of $kYyiyj$—similar geography implies similar language—and so the cross-covariance will be greater than zero. If *X* and *Y* are independent, then large values of $kXxixj$ are not any more likely to correspond to large values of $kYyiyj$, and so the expectation of this cross-covariance will be zero.

### 3.3 Kernel Functions

To apply HSIC to the problem of detecting geolinguistic dependence, we must define the kernel functions $kX$ and $kY$. In the RKHS framework, valid kernel functions must be symmetric and positive semi-definite. To ensure consistency of the kernel-based estimator for mmd, the kernel must also be **characteristic**, meaning that it induces an injective mapping between probability measures and their corresponding mean elements (Fukumizu et al. 2007): Each probability measure *P* must correspond to a single unique mean element μ_{P}. Muandet et al. elaborate these and other properties of several well-known kernels (Muandet et al., 2016, Table 3.1).

For the spatial kernel $kY$, we use a Gaussian RBF, which is a widely used choice for vector data. Specifically, we define $kY(yi,yj;\gamma )=exp(\u2212\gamma di,j2)$, where $dij2$ is the squared Euclidean distance between *y*_{i} and *y*_{j}, and γ is a parameter of the kernel function. We also use the RBF in $kX$ when the linguistic observations take on continuous values, such as frequencies or phonetic data. The RBF kernel is symmetric, positive semi-definite, and characteristic.^{4}

The parameter γ corresponds to the “length-scale” of the kernel. Intuitively, as this parameter increases, the kernel similarity drops off more quickly with distance. In this article, we follow the popular heuristic of setting γ to the median of the data (*y*_{1}, … *y*_{n}), as proposed by Gretton et al. (2005b). We empirically test the sensitivity of HSIC to this parameter in Section 4. More recent work offers optimization-based approaches for setting this parameter (Gretton et al. 2012), but we do not consider this possibility here.

Linguistic data are often binary or categorical. In this case, we use a **Delta kernel** (also sometimes called a Dirac kernel). This kernel is simply defined as $kXxixj=1$ if *x*_{i} = *x*_{j} and 0 otherwise. The Delta kernel has been used successfully in combination with HSIC for high-dimensional feature selection (Song et al. 2012; Yamada et al. 2014), and is symmetric, positive semi-definite, and characteristic for discrete data. For continuous or vector-valued linguistic variables, the RBF kernel can again be applied.

### 3.4 Scalability

*r*

_{A}and

*r*

_{B}, which are set to ensure that the magnitudes of the residuals

*K*−

*AA*

^{T}and

*L*−

*BB*

^{T}are below a predefined threshold. HSIC may then be approximated as:

*HA*can be computed without explicitly forming the

*n*×

*n*matrix

*H*, because of its simple structure. Alternative methods for scaling the computation of HSIC are discussed in a recent note by Zhang et al. (2016).

## 4. Synthetic Data

Real linguistic data sets lack ground truth about which features are geographically distinct, making it impossible to use such data to quantitatively evaluate the proposed approaches. We therefore use synthetic data to compare the power and sensitivity of the various approaches we have described. Our main goals are: (1) to calibrate the p-values produced by each approach in the event that the null hypothesis is true, using completely randomized data; (2) to test the power of each approach to capture spatial dependence, particularly under conditions in which the spatial dependence is obscured by noise.

We compare HSIC with specific instantiations of the methods described in Section 2, focusing on previous published applications of these methods to dialect analysis. Specifically, we consider the following methods:

**Moran's I**We follow Grieve, Speelman, and Geeraerts (2011), using a binary spatial weighting matrix with a distance threshold τ, usually set to the median of the distances between points in the data set.^{5}This method is not applicable to categorical data.**Join counts**We follow the approach of Lee and Kretzschmar, Jr. (1993), who define a binary spatial weighting matrix from a Delaunay triangulation, and then compute join counts for linked pairs of observations. This method is not applicable to frequency data.**Mantel test**We use Euclidean distance for the geographical distance matrix. For continuous linguistic data, we also use Euclidean distance; for discrete data, we use a delta function.

For all approaches, a one-tailed significance test is appropriate, because in nearly all conceivable dialectological scenarios we are testing only for the possibility that geographically proximate units are *more* similar than they would be under the null hypothesis. For some methods, it is possible to calculate a p-value from the test statistic using a closed form estimate of the variance. However, for consistency, we use a permutation approach to characterize the null distribution over the test statistic values. We permute the linguistic data *x*, breaking any link between geography and the language data, and then compute the distribution of the test statistic under many such permutations.

### 4.1 Data Generation

To ensure the verisimilitude of our synthetic data, we target the scenario of geotagged tweets in the Netherlands. For each municipality *i*, we stochastically determine the number and location of the tweets as follows:

**Number of data points**For each municipality, the number of tweets*n*_{i}is chosen to be proportional to the population, as estimated by Statistics Netherlands. Specifically, we draw $n~i\u223cPoisson(\mu obs\xd7populationi)$ and then set $ni=n~i+1$, ensuring that each municipality has at least one data point. The parameter μ_{obs}controls the frequency of the linguistic variable. For example, a common orthographic variable (e.g., “g-deletion”) might have a high value of μ_{obs}, whereas a rare lexical variable (e.g.,*soda*versus*pop*) might have a much lower value. Note that μ_{obs}is shared across all municipalities.**Locations**Next, for each tweet*t*, we determine the location*y*_{t}by sampling without replacement from the set of real tweet locations in municipality*i*(the data set is described in Section 5.3). This ensures that the distribution of geolocations in the synthetic data matches the real geographical distribution of tweets, rather than drawing from a parametric distribution that may not match the complexity of true geographical population distributions. Each location is represented as a latitude and longitude pair (Figure 2).

For each variable, each municipality is assigned a frequency vector θ_{i}, indicating the relative frequency of each variable form: for example 70% *soda*, 30% *pop*. We discuss methods for setting θ_{i} subsequently, which enable the simulation of a range of dialectal phenomena.

We simulate both counts data and frequency data. In counts data—such as geotagged tweets—the data points in each instance in municipality *i* are drawn from a binomial or multinomial distribution with parameter θ_{i}. In frequency data, we observe only the relatively frequency of each variable form for each municipality. In this case, we draw the frequency from a Dirichlet distribution with expected value equal to θ_{i}, drawing ϕ_{t} ∼ Dirichlet(*s*θ_{i}), where the scale parameter *s* controls the variance within each municipality.

### 4.2 Calibration

Our first use of synthetic data is to examine the p-values obtained from each method when the null hypothesis is true—that is, when there is no geographical variation in the data. The p-value corresponds to the likelihood of seeing a test statistic at least as extreme as the observed value, under the null hypothesis. Thus, if we repeatedly generate data under the null hypothesis, a well-calibrated test will return a distribution of p-values that is uniform in the interval [0,1]. We would expect to observe p < 0.05 in exactly 5% of cases, corresponding to the allowed rate of type I errors (incorrect rejection of the null hypothesis) at the threshold α = 0.05.

To measure the calibration of each of the proposed tests, we generate 1,000 random data sets using this procedure, and then compute the p-values under each test. In these random data sets, the relative frequency parameters θ_{i} are the same for all municipalities, which is the null hypothesis of complete randomization. To generate the binary and categorical data, we use μ_{obs} = 10^{−5}, meaning that the expected number of observations is 1 per 100,000 individuals in the municipality; for comparison, this corresponds roughly to the tweet frequency of the lengthened spelling *hellla* in the 2009–2012 Twitter data set gathered by Eisenstein et al. (2014).

To visualize the calibration of each test, we use quantile-quantile (Q-Q) plots, comparing the obtained p-values with a uniform distribution. A well-calibrated test should give a diagonal line from the origin to (1,1). Figure 3 shows the Q-Q plots obtained from each method on each relevant type of data (recall that not all methods can be applied to all types of data, as described in the previous section).

HSIC and Moran's I each have tuning parameters that control the behavior of the test: the kernel bandwidth in HSIC, and the distance cutoff in Moran's I. A simple heuristic is to use the median Euclidian distance $d\xaf$: In Moran's I, we use $d\xaf$ as the distance threshold for constructing the neighborhood matrix *W*; in HSIC, we use $1d\xaf2$ as the kernel bandwidth parameter. Figure 3 shows that by basing these parameters on the median distance between pairs of points, we obtain well-calibrated results. However, some prior work takes an alternative approach, sweeping over parameter values to obtain the most significant results (Grieve, Speelman, and Geeraerts 2011). In our experiments we sweep across the distance cutoff for Moran's I, and the bandwidth for the spatial distances in HSIC. This distorts the calibration, meaning that the resulting p-values are not reliable. This is most severe for Moran's I with type I error rates of 11.7% (binary data) and 14.3% (frequency data) when the significance threshold α is set to 5%. Given that such parameter sweeps are explicitly designed to maximize the number of positive test results—and not the overall calibration of the test—this is unsurprising. We therefore avoid parameter sweeps in the remainder of this article, and rely instead on median distance as a simple heuristic alternative.

### 4.3 Power

Next, we consider synthetic data in which there is geographical variation by construction. We assess the **power** of each approach by computing the fraction of simulations for which the approaches correctly rejected the null hypothesis of no spatial dependence, given a significance threshold of α = 0.05. We again use the Netherlands as the stage for all simulations, and consider two types of geographical variation.

**Dialect continua**We generate data such that the frequency of a linguistic variant increases linearly through space, as in a dialect continuum (Heeringa and Nerbonne 2001). In most of the synthetic data experiments herein, we average across a range of angles, from 0° to 357° with step sizes of 3°, yielding 120 distinct angles in total. Each angle aligns differently with the population distribution of the Netherlands, so we also assess sensitivity of each method to the angle itself. Figure 2 shows two synthetic data sets with dialect continua in different angles.**Geographical centers**Second, we consider a setting in which variation is based on one or more geographical**centers**. In this setting, all cities within some specified range of the center (or one of the centers) have some maximal frequency value θ_{i}; in other cities, this value decreases as distance from the nearest center grows. This corresponds to the dialectological scenario in which a variable form is centered on one specific city, as in, say, the association of the word*hella*with the San Francisco metropolitan area. We average across 25 possible centers: the capitals of each of the 12 provinces of the Netherlands; the national capital of the Netherlands (Amsterdam); the*two*most populous cities in each of the 12 provinces. For each setting, we randomly generate synthetic data four times, resulting in a total of 100 synthetic data sets for this condition.

*Parameter settings.* We use these data generation scenarios to test the sensitivity of HSIC and Moran's I to their hyperparameters, by varying the kernel bandwidths in HSIC (Figures 4a and 4b) and the distance threshold in Moran's I (Figures 4c and 4d). The sensitivity of HSIC to the bandwidth value decreases as the number of data points increases (as governed by μ_{obs}), especially in the case of dialect continua. The sensitivity of Moran's I to the distance cutoff value decreases with the amount of data in the case of dialect continua, but in the case of center-based variation, Moran's I becomes *more* sensitive to this parameter as there is more data. For both methods, the same trends regarding the best performing parameters can be observed. In the case of dialect continua, larger cutoffs and bandwidths perform best, but in the case of variation based on centers, smaller cutoffs and bandwidths lead to higher power. Overall, there is no single best parameter setting, but the median heuristics perform reasonably well for both types of variation.

*Direction of dialect continua.* We simulate dialect continua by varying the frequency of linguistic variables linearly through space. Because of the heterogeneity of population density, different spatial angles will have very different properties: For example, one choice of angle would imply a continuum cutting through several major cities, whereas another choice might imply a rural–urban distinction. Figure 5 shows the power of the methods on binary data (there are two variant forms, and each instance contains exactly one of them), in which we vary the angle of the continuum. HSIC is insensitive to the angle of variation, demonstrating the advantage of this kernel nonparametric method. Moran's I is relatively robust, whereas join count analysis performs poorly across the entire range of settings. The Mantel test is remarkably sensitive to the angle of variation, attaining nearly zero power for some scenarios of dialect continua. This is caused by the complex interaction between the underlying linguistic phenomenon and the east–west variation of the population density of the Netherlands. For example, when the dialect continuum is simulated at an angle of 105 degrees, the southeast of the Netherlands has a higher usage of the variable, but this is only a very small region because of the shape of the country. The Mantel test apparently has great difficulty in detecting geographical variation in such cases.

*Outliers.* In the frequency-based synthetic data, each instance uses each variable form with some continuous frequency—this is based on the scenario of letters to the editors of regional newspapers, as explored in prior work (Grieve, Speelman, and Geeraerts 2011). We test the robustness of each approach by introducing **outliers**: Randomly selected data points whose variable frequencies are replaced at random with extreme values of either 0 or 1. As shown in Figure 6, HSIC is the most robust against outliers, and the performance of the Mantel test is the most affected by outliers (recall that join count analysis applies only to discrete observations, so it cannot be compared on this measure).

*Overall.* We now compare the methods by averaging across various settings simulating dialect continua (Figure 7) and variation based on centers (Figure 8). To generate the categorical data, we vary μ_{obs} in our experiments, with a higher μ_{obs} resulting in more tweets and consequently less variation on the municipality level. As expected, the power of the approaches increases as μ_{obs} increases in the experiments on the categorical data, and the power of the approaches decreases as *σ* increases in the experiments on the frequency data. The experiments on the binary and categorical data show the same trends: HSIC performs the best across all settings. Join count analysis does well when the variation is based on centers, and Moran's I does best for dialect continua. Moran's I performs best on the frequency data, especially in the case of variation based on centers.

### 4.4 Summary

In this section, we evaluated each statistical test for geographical language variation on a battery of synthetic data. HSIC and the Mantel test are the only approaches applicable to all data types (binary, categorical and frequency data). Overall, HSIC is more effective than the Mantel test, which is much more sensitive to the specifics of the synthetic data scenario, such as the angle of the dialect continuum. HSIC is robust against outliers, and performs particularly well when the number of data points increases. Join count analysis is suitable for capturing non-linear variation, but its power is low compared with other approaches in the analysis of dialect continua. Conversely, when the linguistic data is binary, Moran's I performs well on dialect continua, but its power is low in situations of variation based on centers. In our experiments on frequency data, Moran's I performs well in both scenarios.

## 5. Empirical Data

We now assess the spatial dependence of linguistic variables on three real linguistic data sets: letters to the editor (English), a syntactic atlas of the Dutch dialects, and Dutch geotagged tweets. In each data set, we compute statistical significance for the geolinguistic dependence of multiple linguistic variables. To adjust the significance thresholds for multiple comparisons, we use the Benjamini-Hochberg procedure (Benjamini and Hochberg 1995) to bound the overall false discovery rate (FDR).

### 5.1 Letters to the Editor

In their application of Moran's I to English dialects in the United States, Grieve, Speelman, and Geeraerts 2011 compile a corpus of letters to the editors of newspapers to measure the presence of dialect variables in text. To compute the frequency of the lexical variables, letters are aggregated to core-based statistical areas, which are defined by the United States to capture the geographical region around an urban core. The frequency of 40 manually selected lexical variables is computed for each of 206 cities.

We use the Mantel test, HSIC, and Moran's I to assess the spatial dependence of variables in this data set. Join count analysis was excluded, because it is not suitable for frequency data. We verified our implementation of Moran's I by following the approach taken by Grieve et al.: we computed Moran's I for cutoffs in the range of 200 to 1,000 miles and selected the cutoff that yielded the lowest p-value. The obtained cutoffs and test statistics closely followed the values reported in the analysis by Grieve et al., with slight deviations possibly due to our use of a permutation test rather than a closed-form approximation to compute the p-values.

After adjusting the p-values using the FDR procedure, a 500-mile cutoff results in three significant linguistic variables.^{6} However, recall that the approach of selecting parameters by maximizing the number of positive test results tends to produce a large number of type I errors. When setting the distance cutoff to the median distance between data points, none of the linguistic variables were found to have a significant geographical association. Similarly, HSIC and the Mantel test also found no significant associations after adjusting for multiple comparisons. Figure 9 shows the proportion of significant variables according to Moran's I based on different thresholds. The numbers vary considerably depending on the threshold. The figure also suggests that the median distance (921 miles) may not be a suitable threshold for this data set.

### 5.2 Syntactic Atlas of the Dutch Dialects (SAND)

Dialect atlases are frequently used in the study of dialect. In this section we demonstrate the use of the discussed methods on SAND (Barbiers et al. 2005, 2008), an online electronic atlas that maps syntactic variation of Dutch varieties in the Netherlands, Belgium, and France.^{7} The data were collected between the years of 2000 and 2005. SAND has been used to measure the distances between dialects, and to discover dialect regions (Spruit 2006; Tjong Kim Sang 2015). To our knowledge, we are the first to use statistical methods to quantify the degree of spatial dependence of the linguistic variables in this atlas. Compared with the other empirical data sets that we consider, this data set is smaller and many variables contain more than two variants.

In our experiments, we consider only locations within the Netherlands (157 locations). The number of variants per linguistic variable ranges from one (due to our restriction to the Netherlands) to 11. We do not include Moran's I in our experiments, because it is not applicable to linguistic variables with more than two variants. We apply the remaining methods to all linguistic variables with 20 or more data points and at least two variants, resulting in a total of 143 variables.

Table 1 lists the 10 variables with the highest HSIC values. Statistical significance at a level of α = 0.05 is detected for 65.0% of the linguistic variables using HSIC, 78.3% when using join count analysis, and 52.4% when using the Mantel test. The three methods agree on 99 out of the 143 variables, and HSIC and join count analysis agree on 118 variables. From manual inspection, it seems that the non-linearity of the geographical patterns may have caused difficulties for the Mantel test. Figure 10 is an example of a variable where HSIC and join count analysis both had an FDR-adjusted p < 0.05, but the Mantel test did not detect a significant association.

Map id
. | Description
. |
---|---|

1:84b | Free relative, complementizer following relative pronoun |

1:84a | Short subject and object relative, complementizer following relative pronoun |

1:80b | ONE pronominalization |

1:33a | Complementizer agreement 3 plural |

1:76a | Reflexive pronouns; synthesis |

2:36b | Form of the participle of the modal verb willen ‘want’ |

1:29a | Complementizer agreement 1 plural |

1:69a | Correlation weak reflexive pronouns |

2:30b | Interruption of the verbal cluster; synthesis I |

2:61a | Forms for iemand ‘somebody’ |

Map id
. | Description
. |
---|---|

1:84b | Free relative, complementizer following relative pronoun |

1:84a | Short subject and object relative, complementizer following relative pronoun |

1:80b | ONE pronominalization |

1:33a | Complementizer agreement 3 plural |

1:76a | Reflexive pronouns; synthesis |

2:36b | Form of the participle of the modal verb willen ‘want’ |

1:29a | Complementizer agreement 1 plural |

1:69a | Correlation weak reflexive pronouns |

2:30b | Interruption of the verbal cluster; synthesis I |

2:61a | Forms for iemand ‘somebody’ |

### 5.3 Twitter

Our Twitter data set consists of 4,039,786 geotagged tweets from the Netherlands, written between 1 January 2015 and 31 October 2015. We manually selected a set of linguistic variables (Table 2), covering examples of lexical variation (e.g., two different words for referring to french fries), phonological variation (e.g., t-deletion), and syntactic variation (e.g., *heb gedaan* [‘have done’] vs. *gedaan heb* [‘done have’]). We are not aware of any previous work on dialectal variation in the Netherlands that uses spatial dependency testing on Twitter data. The number of tweets per municipality varies dramatically, and for the less frequent linguistic variables there are no tweets at all in some municipalities. In our computation of Moran's I, we only include municipalities with at least one tweet.

Linguistic variables
. | Description
. | N
. | Moran's I
. | HSIC
. | Mantel
. | Join counts
. |
---|---|---|---|---|---|---|

Friet / patat | french fries | 842 | 0.0004 | 0.0002 | 0.0003 | 0.0003 |

Proficiat / gefeliciteerd | congratulations | 14,474 | 0.0004 | 0.0002 | 0.0080 | 0.0003 |

Iedereen / een ieder | everyone | 13,009 | 0.8542 | 0.0002 | 0.8769 | 0.0432 |

Doei / aju | bye | 4,427 | 0.7163 | 0.0050 | 0.2570 | 0.3868 |

Efkes / eventjes | for a little while | 969 | 0.0036 | 0.0002 | 0.0003 | 0.0003 |

Naar huis / naar huus | to home | 3, 942 | 0.8542 | 0.1090 | 0.1245 | 0.9426 |

Niet meer / nie meer | not anymore | 11,596 | 0.0793 | 0.0002 | 0.5590 | 0.0329 |

Of niet / of nie | or not | 1,882 | 0.8357 | 0.1010 | 0.4191 | 0.9426 |

-oa- / -ao- | e.g., jao versus joa | 754 | 0.0004 | 0.0002 | 0.0003 | 0.0003 |

Even weer / weer even | for a little while again | 921 | 0.0004 | 0.0002 | 0.0003 | 0.0003 |

Have + participle | e.g., heb gedaan (‘have done’) vs. gedaan heb (‘done have’) | 1,122 | 0.8587 | 0.2849 | 0.6668 | 0.0255 |

Be + participle | e.g., ben geweest (‘have been’) vs. geweest ben (‘been have’) | 1,597 | 0.0793 | 0.2849 | 0.7862 | 0.0051 |

Spijkerbroek / jeans | jeans | 1,170 | 0.7796 | 0.0002 | 0.0080 | 0.0003 |

Doei/ houdoe | bye | 4,491 | 0.5016 | 0.0002 | 0.6668 | 0.0047 |

Bellen / telefoneren | to call by telephone | 4,689 | 0.2730 | 0.0003 | 0.9781 | 0.5941 |

Linguistic variables
. | Description
. | N
. | Moran's I
. | HSIC
. | Mantel
. | Join counts
. |
---|---|---|---|---|---|---|

Friet / patat | french fries | 842 | 0.0004 | 0.0002 | 0.0003 | 0.0003 |

Proficiat / gefeliciteerd | congratulations | 14,474 | 0.0004 | 0.0002 | 0.0080 | 0.0003 |

Iedereen / een ieder | everyone | 13,009 | 0.8542 | 0.0002 | 0.8769 | 0.0432 |

Doei / aju | bye | 4,427 | 0.7163 | 0.0050 | 0.2570 | 0.3868 |

Efkes / eventjes | for a little while | 969 | 0.0036 | 0.0002 | 0.0003 | 0.0003 |

Naar huis / naar huus | to home | 3, 942 | 0.8542 | 0.1090 | 0.1245 | 0.9426 |

Niet meer / nie meer | not anymore | 11,596 | 0.0793 | 0.0002 | 0.5590 | 0.0329 |

Of niet / of nie | or not | 1,882 | 0.8357 | 0.1010 | 0.4191 | 0.9426 |

-oa- / -ao- | e.g., jao versus joa | 754 | 0.0004 | 0.0002 | 0.0003 | 0.0003 |

Even weer / weer even | for a little while again | 921 | 0.0004 | 0.0002 | 0.0003 | 0.0003 |

Have + participle | e.g., heb gedaan (‘have done’) vs. gedaan heb (‘done have’) | 1,122 | 0.8587 | 0.2849 | 0.6668 | 0.0255 |

Be + participle | e.g., ben geweest (‘have been’) vs. geweest ben (‘been have’) | 1,597 | 0.0793 | 0.2849 | 0.7862 | 0.0051 |

Spijkerbroek / jeans | jeans | 1,170 | 0.7796 | 0.0002 | 0.0080 | 0.0003 |

Doei/ houdoe | bye | 4,491 | 0.5016 | 0.0002 | 0.6668 | 0.0047 |

Bellen / telefoneren | to call by telephone | 4,689 | 0.2730 | 0.0003 | 0.9781 | 0.5941 |

Table 2 shows the output of each statistical test for this data. Some of these linguistic variables exhibit strong spatial variation, and are identified as statistically significant by all approaches. An example is the different ways of referring to french fries ( *friet* versus *patat*, Figure 11a), where the figure shows a striking difference between the south and the north of the Netherlands. Another example is Figure 11b, which shows two different ways of saying ‘for a little while’ (*efkes* versus *eventjes*). The less common form, *efkes* is mostly used in Friesland, a province in the north of the Netherlands.

Examples of linguistic variables where the approaches disagree are shown in Figure 12. The first case (Figure 12a) is an example of lexical variation, with two different ways of saying *bye* in the Netherlands. A commonly used form is *doei*, while *houdoe* is known to be specific to North-Brabant, a Dutch province in the south of the Netherlands. HSIC and join count analysis both detect a significant pattern, but Moran's I and the Mantel test do not. The trend is less strong than in the previous examples, but the figure does suggest a higher usage of *houdoe* in the south of the Netherlands.

Another example is t-deletion for a specific phrase (*niet meer* versus *nie meer*), as shown in Figure 12b. Previous dialect research has found that geography is the most important external factor for t-deletion in the Netherlands, with contact zones, such as the Rivers region in the Netherlands (at the intersection of the dialects of the southern province of North-Brabant, the south-west province of Zuid-Holland, and the Veluwe region), having high frequencies of t-deletion (Goeman 1999). Both HSIC and join count analysis report an FDR-adjusted p < 0.05, whereas for Moran's I, the geographical association does not reach the threshold of significance.

We also present preliminary results on using HSIC as an exploratory tool on the same Twitter corpus. To focus on active users who are most likely tweeting on a personal basis, we exclude users with 1,000 or more followers and users who have fewer than 50 tweets, resulting in 8,333 users. We exclude infrequent words (used by fewer than 100 users) and very frequent words (used by 1,000 users or more), resulting in a total of 5,183 candidate linguistic variables. We represent the usage of a word by each author as a binary variable, and use HSIC to compute the level of spatial dependence for each word.

The top 10 words with the highest HSIC scores are *groningen* (city), *zwolle* (city), *eindhoven* (city) *arnhem* (city), *breda* (city), *enschede* (city), *nijmegen* (city), *leiden* (city), *twente* (region), and *delft* (city). Although these words do not reflect dialectal variation as it is normally construed, we expect their distribution to be heavily influenced by geography. Manual inspection revealed that many English words (e.g., *his, very*) have high geographical dependence. English speakers are more likely to visit tourist and commercial centers, so it is unsurprising that these words should show a strong geographical association. The top-ranked non-topical word is *proficiat*, occurring at rank 34 according to HSIC. *Proficiat* had previously been identified as a candidate dialect variable, and was included in our analysis in Table 2; this replication of prior dialectological knowledge validates the usage of HSIC as an exploratory tool. Less well-known are *joh* (an interjection) and *dadelijk* (‘immediately’/‘just a second’), which are ranked respectively at position 60 and 71 by HSIC. The geographical distributions of these words are shown in Figures 13a and 13b; both seem to distinguish the southern part of the Netherlands from the rest of the country. The identification of these words speaks to the potential of HSIC to guide the study of dialect by revealing geographically associated terms.^{8}

## 6. Conclusions

We have reviewed four methods for quantifying the spatial dependence of linguistic variables: Moran's I, which is perhaps the best known in sociolinguistics and dialectology; join count analysis; the Mantel test; and the Hilbert-Schmidt Independence Criterion (HSIC), which we introduce to linguistics in this paper.^{9} Of these methods, only HSIC is consistent, meaning that it converges to an accurate measure of the statistical dependence between *x* and *y* in the limit of sufficient data. In contrast, the other approaches are based on parametric models. When the assumptions of these models are violated, the power to detect significant geolinguistic associations is diminished. All three of these methods can be modified to account for various geographical distributions: For example, the spatial weighting matrix used in Moran's I and join count analysis can be constructed as a non-linear or non-monotonic function of distance (Getis and Aldstadt 2010), the distances in the Mantel test can be censored at some maximum value (Legendre, Fortin, and Borcard 2015), and so on. However, such modifications require the user to have strong prior expectations of the form of the geolinguistic dependence, and open the door to p-value hacking through iterative “improvements” to the test. In contrast, HSIC can be applied directly to any geotagged corpus, with minimal tuning. By representing the underlying probability distributions in a Hilbert space, HSIC implicitly makes a comparison across high-order moments of the distributions, thus recovering evidence of probabilistic dependence without parametric assumptions.

These theoretical advantages are borne out in an analysis of synthetic data in Section 4. We consider a range of realistic scenarios, finding that the power of Moran's I, the Mantel test, and join count analysis depends on the nature of the geographical variation (e.g., dialect continua versus centers), and in some cases, even on the direction of variation. Overall, we find that HSIC, although not the most powerful test in every scenario, offers the broadest applicability and the least potential for catastrophic failure of any of the proposed approaches.

We then showed how to apply these tests to a diverse range of real data sets in Section 5: Frequency observations in letters to the editor, binary and categorical observations in a dialect atlas, and binary observations in social media. We find that previous results on newspaper data were dependent on the procedure of selecting the geographical distance cutoff to maximize the number of positive test results; using all other test procedures, the significance of these results disappears (Section 5.1). On the dialect atlas, we find that the fraction of statistically significant variables ranges from 55.2% to 78.3%, depending on the statistical approach (Section 5.2). On the social media data, we obtain largely similar results from the four different tests, but HSIC detects the largest number of significant associations, identifying cases in which geography and population density were closely intertwined (Section 5.3).

To conclude, we believe that kernel embeddings of probability measures offer a powerful new approach for corpus analysis. In this article, we have focused on measuring geographical dependence, which can be used to test and discover new dialectal linguistic variables. But the underlying mathematical ideas may find application in other domains, such as tracking change over time (Štajner and Mitkov 2011; Popescu and Strapparava 2014), or between groups of authors (Koppel, Argamon, and Shimoni 2002). Of particular interest for future research is the use of structured kernels, such as tree kernels (Collins and Duffy 2001) or *n*-gram kernels (Lampos et al. 2015), which could test for structured linguistic phenomena such as variation in syntax (Johannsen, Hovy, and Søgaard 2015) or phonological change (Bouchard-Côté et al. 2007).

## Acknowledgments

Thanks to Jack Grieve for sharing the corpus of dialect variables from letters to the editor in North American newspapers, Arthur Gretton for advice about how best to use HSIC, Erik Tjong Kim Sang for help on using the SAND data, the DB group of the University of Twente for sharing the Dutch geotagged tweets, and Leonie Cornips and Sjef Barbiers for advice on selecting the Dutch linguistic variables. D. N. was supported by the Netherlands Organization for Scientific Research (NWO), grant 640.005.002 (FACT). This material is based upon work supported by the National Science Foundation under grant number RI-1452443, and by a grant from the Air Force Office of Scientific Research to J. E. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

## Notes

Global methods test for dependence over the entire data set. In some cases, there will be local dependence in a few “hot spots,” even when global dependence is not detected, and local autocorrelation statistics have been proposed to capture such dependences (Anselin 1995). For example, Grieve (2016) uses the Getis-Ord *G*_{i} statistic (Getis and Ord 1992) in his analysis of regional American English. Local statistics are particularly useful as an exploratory tool, but Grieve argues that the associated p-values are difficult to interpret because of the issue of multiple comparisons. We therefore focus on global tests in this article. The adaptation of the proposed HSIC statistic into a local measure of dependence is an intriguing topic for future work.

The matrix *W* can be defined in other ways. We can define a continuous-valued version of *W* by setting $wij=exp(\u2212\gamma dij)$, with *d*_{ij} equal to the geographical distance between units *i* and *j*. Alternatively, we could define a topological spatial weighting matrix by setting *w*_{ij} = 1 when *j* is one of the *k* nearest neighbors of *i* (Getis and Aldstadt 2010).

HSIC was introduced as a test of independence by Gretton et al. (2008). In this paper, we present the first application to computational linguistics.

Flaxman recently proposed a “kernelized Mantel test,” in which correlations are taken between kernel similarities rather than distances (Flaxman 2015). The resulting test statistic is similar, but not identical to, HSIC. Specifically, whereas HSIC centers the kernel matrix against the local mean kernel similarities for each point, the kernelized Mantel test centers against the global mean kernel similarity. This makes the test more sensitive to distant outliers. We implemented the kernelized Mantel test, and found its performance to be similar to the classical Mantel test, with lower statistical power than HSIC. Flaxman made similar observations in his analysis of the spatiotemporal distribution of crime events.

Other types of spatial weighting matrices might provide different results. We leave this for future work.

Grieve et al. report five significant variables. In our analysis, there are two variables with FDR-adjusted p-values of 0.0559.

The top 10 words with the highest Moran's I scores are similar: *groningen, eindhoven, friesland* (province), *leeuwarden* (city), *zwolle*, *proficiat*, *drachten* (city), *carnaval* (a festival), *brabant* (province), *enschede* (city).

Code is available at https://github.com/dongpng/geo-independence-testing.