Geographic Adaptation of Pretrained Language Models

While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone. Here, we contribute to closing this gap by examining geolinguistic knowledge, i.e., knowledge about geographic variation in language. We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup. We geoadapt four PLMs, covering language groups from three geographic areas, and evaluate them on five different tasks: fine-tuned (i.e., supervised) geolocation prediction, zero-shot (i.e., unsupervised) geolocation prediction, fine-tuned language identification, zero-shot language identification, and zero-shot prediction of dialect features. Geoadaptation is very successful at injecting geolinguistic knowledge into the PLMs: The geoadapted PLMs consistently outperform PLMs adapted using only language modeling (by especially wide margins on zero-shot prediction tasks), and we obtain new state-of-the-art results on two benchmarks for geolocation prediction and language identification. Furthermore, we show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the PLMs.

Humans, however, additionally make use of a rich spectrum of extralinguistic features when they learn and process language, including gender (Lass et al., 1979), ethnicity (Trent, 1995), and geography (Clopper and Pisoni, 2004).Despite the growing awareness for the importance of such factors in NLP (Hovy and Yang, 2021), extralinguistic features have been typically introduced in the finetuning phase so far, i.e., when specializing PLMs for a concrete task (e.g., Rosin et al., 2022).This prevents PLMs from forming generalizable representations the way humans do, impeding the exploitation of extralinguistic knowledge for tasks other than the fine-tuning task itself.
In this work, we focus on geographic knowledge, and more specifically geolinguistic knowledge, i.e., knowledge about geographic variation in language -the most salient type of extralinguistic variation in language (Wieling and Nerbonne, 2015).We present what we believe to be the first attempt to incorporate geolinguistic knowledge into PLMs in a pretraining step, i.e., before task-specific finetuning, making it possible to exploit it in any task for which it is expected to be useful.Specifically, we conduct an intermediate training step (Glavaš and Vulić, 2021) in the form of task-agnostic adaptation -dubbed geoadaptation -that couples language modeling with predicting the geographic location (i.e., longitude and latitude) on geolocated texts.We choose adaptation as opposed to pretrain-ing from scratch for three reasons: (i) intermediate training on language modeling (i.e., adaptation) before task-specific fine-tuning has proved beneficial for many NLP tasks (Gururangan et al., 2020), (ii) adaptation has a lower computational cost than pretraining (Strubell et al., 2019), and (iii) PLMs encoding general-purpose linguistic knowledge are readily available (Wolf et al., 2020). 1 The specific method we introduce for geoadaptation combines language modeling with token-level geolocation prediction via multi-task learning, with task weights based on the homoscedastic uncertainties of the task losses (Kendall et al., 2018).
We evaluate our geoadaptation framework on three groups of closely related languages, each with a corresponding PLM: (i) the German dialects spoken in Austria, Germany, and Switzerland (AGS) and GermanBERT, (ii) Bosnian-Croatian-Montenegrin-Serbian (BCMS) and BERTić, and (iii) Danish, Norwegian, and Swedish (DNS) and ScandiBERT.These groups exhibit strong geographic differences, providing an ideal testbed for geoadaptation. 2 We further test geoadaptation at scale by adapting mBERT, a multilingual PLM, on the union of AGS, BCMS, and DNS.
We evaluate the effectiveness of geoadaptation on five downstream tasks expected to benefit from geolinguistic knowledge: (i) fine-tuned (i.e., supervised) geolocation prediction, (ii) zero-shot (i.e., unsupervised) geolocation prediction, (iii) finetuned language identification, (iv) zero-shot language identification, and (v) zero-shot prediction of dialect features.Geoadaptation leads to consistent performance gains compared to baseline models adapted on the same data using only language modeling, with particularly striking improvements on all zero-shot tasks.On two popular benchmarks for geolocation prediction and language identification, geoadaptation establishes a new state of the art.Furthermore, we show that geoadaptation geographically retrofits the representation space of the PLMs.Overall, we see our study as an exciting step towards grounding PLMs in geography. 31 Notice that for the language areas we consider, there is currently also not enough geotagged data that would allow us to geographically pretrain models from scratch.
2 Our focus on AGS, BCMS, and DNS also contributes to the recent call for more work on languages other than English in NLP (Joshi et al., 2020;Razumovskaia et al., 2022). 3We make our code available at https://github.com/valentinhofmann/geoadaptation.

Related Work
Adaptation of PLMs.Continued language modeling training (i.e., adaptation) on data that comes from a similar distribution as the task-specific target data has been shown to improve the performance of PLMs for many NLP tasks (Glavaš et al., 2020;Gururangan et al., 2020) as well as in various language (Pfeiffer et al., 2020;Parović et al., 2022) and domain adaptation scenarios (Chronopoulou et al., 2021;Hung et al., 2022).Adaptation can be seen as a special case of intermediate training, which aims at improving the target-task performance of PLMs by carrying out additional training between pretraining and fine-tuning (Phang et al., 2018;Vu et al., 2020;Glavaš and Vulić, 2021).
Intermediate training has also been conducted in a multi-task fashion, encompassing two or more training objectives (Liu et al., 2019a;Aghajanyan et al., 2021).Our work differs from these efforts in that it injects geolinguistic knowledge -a type of extralinguistic knowledge -into PLMs.
Extralinguistic knowledge.Leaving aside the large body of work on injecting visual (e.g., Bugliarello et al., 2022) and structured knowledge (e.g., Lauscher et al., 2020) into PLMs, a few studies have examined the interplay of PLM adaptation and extralinguistic factors (Luu et al., 2021;Röttger and Pierrehumbert, 2021).However, they focus on time and adapt PLMs to individual extralinguistic contexts (i.e., time points).In contrast, we inject geographic information from all contexts into the PLM, forcing it to learn links between linguistic variability and a language-external variable -in our case, geography.This is fundamentally different from adapting the PLM only to certain realizations of the language-external variable.
Most other studies introduce the extralinguistic information during task-specific fine-tuning (Dhingra et al., 2021;Hofmann et al., 2021;Karpov and Kartashev, 2021;Kulkarni et al., 2021;Rosin et al., 2022).In contrast, we leverage geographic information only in the task-agnostic adaptation step.In task fine-tuning, the geoadapted PLM does not require any extralinguistic signal and is fine-tuned in the same manner as standard PLMs.
Geography in NLP.We also build upon the long line of NLP research on geography, which roughly falls into two camps.On the one hand, many studies model geographically-conditioned differences in language, pointing to lexical variation as the most conspicuous manifestation (Eisenstein et al., 2010(Eisenstein et al., , 2011;;Doyle, 2014;Eisenstein et al., 2014;Huang et al., 2016;Hovy and Purschke, 2018;Hovy et al., 2020), although phonological (Hulden et al., 2011;Blodgett et al., 2016), syntactic (Dunn, 2019;Demszky et al., 2021), and semantic properties (Bamman et al., 2014;Kulkarni et al., 2016) have been shown to exhibit geographic variation as well.On the other hand, there exists a large body of work on predicting geographic location from text, a task referred to as geolocation prediction (Rahimi et al., 2015a(Rahimi et al., ,b, 2017;;Salehi et al., 2017;Rahimi et al., 2018;Scherrer andLjubešić, 2020, 2021).To the best of our knowledge, we are the first to geographically adapt PLMs in a task-agnostic fashion, making them more effective for any downstream task for which geolinguistic knowledge is relevant, from geolocation prediction to dialect-related tasks and language identification.

Geoadaptation
Let D be a geotagged dataset consisting of sequences of tokens X = (x 1 , . . ., x n ) and corresponding geotags T = (t lon , t lat ), where t lon and t lat denote the geographic longitude and latitude.We want to adapt a PLM in such a way that it encodes the geographically-conditioned linguistic variability in D. Acknowledging the prominence of lexical variation among geographic differences in language (see §2), we accomplish this by combining masked language modeling (i.e., the pretraining objective) with token-level geolocation prediction in a multi-task setup that pushes the PLM to learn associations between linguistic phenomena and geolocations on the lexical level.4 Masked language modeling.We replace some tokens x i in X with masked tokens xi .Following Devlin et al. (2019), xi can be a special mask token ([MASK]), a random vocabulary token, or the original token itself.X is fed into the PLM, which outputs a sequence of representations E = (e(x 1 ), . . ., e(x n )).The representations of the masked tokens e(x i ) are then fed into a classification head.We compute the masked language modeling loss L mlm as the negative log-likelihood of the probability assigned to the true token.
Geolocation prediction.We additionally feed the vectors of masked tokens e(x i ) into a feed-forward regression head that predicts two real-values: longitude and latitude.The geolocation prediction loss L geo is the mean of the absolute prediction errors for longitude and latitude.Note that the gold geolocation is the same for all masked tokens from the same input sequence.We inject geographic information at the token level because lexical variation represents the most prominent type of geographic language variation (see §2).
Composite multi-task loss.We experiment with two different ways to compute the composite multitask loss L mt .First, we straightforwardly sum the two task-specific losses: L mt = L mlm + L geo .In multi-task training, however, a simple sum of the losses can be a suboptimal choice, especially if the losses are not of the same order of magnitude.In our case, L mlm and L geo are measured on different scales and relatively small values of L geo may still be multiples of relatively large values of L mlm (or vice versa).In a similar vein, the model might be more confident about one task than about the other (e.g., associating contextual token representations with geolocations may be easier than language modeling, i.e., predicting the correct token).To account for both factors, as a second method we compute the weights with which L geo and L mlm contribute to the joint loss based on their homoscedastic (i.e., task-dependent) uncertainties σ mlm and σ geo (Kendall and Gal, 2017).σ mlm and σ geo are learned as part of the model training.The dynamic weighting ensures that the objectives are given equal importance with respect to the overall optimization.Defining l ∈ {mlm, geo}, we follow Kendall et al. (2018) and replace L l with: Equation 1 holds for both regression (e.g., mean absolute error as for L geo ) and classification losses (e.g., categorical cross-entropy as for L mlm ) and can be derived from their Bayesian formulations (Kendall et al., 2018).Notice that Ll is smoothly differentiable and well-formed: log σ l ensures that the task weight 1/σ 2 l does not converge to zero (or σ 2 l diverges to infinity), which is the trivial solution to minimizing 1/(2σ 2 l )L l .For numerical stability, we set η l = 2 log σ l and compute Ll as: The final multi-task loss is the sum of the two uncertainty losses: Lmt = Lmlm + Lgeo .4 Experimental Setup Models.We examine four PLMs in this paper.
For AGS, we use GermanBERT, a German BERT (Devlin et al., 2019) model. 5For BCMS, we use BERTić (Ljubešić and Lauc, 2021), a BCMS ELECTRA (Clark et al., 2020) model. 6We specifically use the generator, i.e., a BERT model.For DNS, we resort to ScandiBERT, an XLM-Roberta (Conneau et al., 2020) model pretrained on corpora from five Scandinavian languages. 7Since we are interested to see whether geoadaptation can be expanded to a larger geographical area (e.g., an entire continent), we also geoadapt mBERT, a multilingual BERT (Devlin et al., 2019) model, on the union of the AGS, BCMS, and DNS areas. 8We refer to this setting as EUR.
Data.We start with a general overview of the data used for the experiments.Details about data splits are provided when describing the setup for geoadaptation as well as the evaluation tasks.Figure 1 shows the geographic distribution of the data.Tables 1 and 2 list summary statistics.
For AGS, we use the German data of the 2021 VarDial shared task on geolocation prediction (Chakravarthi et al., 2021), which consist of geotagged Jodel posts from the AGS area.We merge the Austrian/German and Swiss portions of the data.For BCMS, we use the BCMS data of the 2021 VarDial shared task on geolocation prediction (Chakravarthi et al., 2021), which consist of geotagged tweets from the BCMS area.To remedy the sparsity of the data for some regions, we retrieve an additional set of geotagged tweets from the BCMS area posted between 2008 and 2021 using the Twitter API, ensuring that there is no overlap with the VarDial data.For evaluation, we additionally draw upon SETimes, a news dataset for discriminating between Bosnian, Croatian, and Serbian (Rupnik et al., 2023).For DNS, we use geotagged tweets from the Nordic Tweet Stream (Laitinen et al., 2018), confining geotags to the DNS area.9For evaluation, we additionally use the DNS portion of NordicDSL, a dataset of Wikipedia snippets for discriminating between Nordic languages (Haas and Derczynski, 2021).For EUR, we mix the AGS, BCMS, and DNS data.
Geoadaptation.For AGS, we create a balanced subset of the VarDial train posts (5,000 per country). 10For BCMS, we draw upon the union of the VarDial train posts and the newly-collected posts to create a balanced subset (20,000 per country).For DNS, we similarly create a balanced subset of the posts (100,000 per country).For EUR, we sample balanced subsets of the AGS, BCMS, and DNS geoadaptation data (5,000 per country).Using these four datasets, we adapt the PLMs via the proposed multi-task learning approach (see §3).We geoadapt the PLMs for 25 epochs and save the model snaphots after each epoch.To track progress, we measure perplexity and token-level median distance on the VarDial development sets for AGS and BCMS, a separate set of 75,000 posts for DNS, and a separate set of 10,000 posts for EUR.aspects of the learned associations between linguistic phenomena and geography.

Fine-tuned geolocation prediction (FT-Geoloc).
We fine-tune the geoadapted PLMs for geolocation prediction.For AGS and BCMS, we use the train, dev, and test splits from VarDial.For DNS, we create separate sets of train, dev, and test posts; we do the same for EUR, drawing train, dev, and test posts from the union of the AGS, BCMS, and DNS data (see Table 1).We make sure that there is no overlap between the geoadaptation posts and dev and test posts of any of the downstream evaluation tasks.Following prior work by Scherrer and Ljubešić (2021), we cast geolocation prediction as a multi-class classification task: we first map all geolocations in the train sets into k clusters using kmeans and assign each geotagged post to its closest cluster.11Concretely, we pass the contextualized vector of the [CLS] token to a single-layer softmax classifier that outputs probability distributions over the k geographic clusters.
In line with prior work, we use the median of the Euclidean distance between the predicted and true geolocation as the evaluation metric.Note that FT-Geoloc is different from geolocation prediction in geoadaptation (see §3): there, we (i) cast geolocation prediction as a regression task (i.e., predict the exact longitude and latitude) and (ii) predict the geolocation from the masked tokens, rather than the representation of the whole post.

Zero-shot geolocation prediction (ZS-Geoloc).
Given the central objective of geoadaptation (i.e., to induce mappings between linguistic variation and geography), we next test if the geoadapted models can predict geographic information from text without any fine-tuning.To this end, we directly probe the PLMs for geolinguistic associations: with the help of prompts, we ask the PLMs to generate the correct toponym corresponding to a post's geolocation using their language modeling head, which has not been trained on geolocation prediction in any way (see §3).We do this on the most finegrained geographic resolution possible, i.e., cities for BCMS/DNS and states for AGS. 12 For EUR, we draw upon the union of AGS, BCMS, and DNS, resulting in a mix of cities and states.
For zero-shot prediction, we append prompts with the meaning 'This is [MASK]' to the post (AGS: Das ist [MASK]; BCMS: To je [MASK]; DNS: Dette er [MASK]). 13For EUR, we just append [MASK] to the post.We pass the whole sequence to the PLM and forward the output representation of the [MASK] token into the language modeling head.Following common practice (Xiong et al., 2020), we restrict the output vocabulary to the set of candidate labels, i.e., we select the city or state name with the highest logit.We measure the performance in terms of accuracy.
Fine-tuned language identification (FT-Lang).Next, we consider language identification, a task of great importance for many applications that is particularly challenging in the case of closely related languages (Zampieri et al., 2014;Haas and Derczynski, 2021).While arguably less directly tied to geography than geolocation prediction, we believe that language identification should also benefit from geoadaptation since one or (in the case of multilingual communities) few languages are used at any given location -having knowledge about geolinguistic variation should thus make it easier to distinguish different languages.
We start by fine-tuning the PLMs for language identification.For AGS, BCMS, and DNS, we reuse the respective FT-Geoloc datasets and sample 15,000 train, 1,500 dev, and 1,500 test posts per language (determined based on their geolocation).For EUR, we reuse the exact FT-Geoloc train, dev, and test split.To test how well the effects of geoadaptation generalize to out-of-domain data, we also fine-tune BERTić on SETimes (i.e., news articles) and ScandiBERT on NordicDSL (i.e., Wikipedia snippets).In terms of modeling, we formulate language identification as a multi-class classification task, with three classes for AGS/DNS, four classes for BCMS, and 10 classes for EUR.We again pass the contextualized vector of the [CLS] token to a single-layer softmax classifier that outputs probability distributions over the languages.We measure the performance in terms of accuracy.
Zero-shot language identification (ZS-Lang).Similarly to geolocation prediction, we are interested to see how well the geoadapted PLMs can identify the language of a text without fine-tuning.We reuse the FT-Lang test sets for this task.The setup follows ZS-Geoloc, i.e., we append the same prompts to the posts, pass the full sequences through the PLMs, and feed the output representations of the [MASK] token into the language modeling head.However, instead of city/state names, we now consider language names, specifically bosanski ('Bosnian'), crnogorski ('Montenegrin'), hrvatski ('Croatian'), and srpski ('Serbian') in the case of BCMS, and dansk ('Danish'), norsk ('Norwegian'), and svensk ('Swedish') in the case of DNS. 14 We select the language name with the highest logit and measure the performance in terms of accuracy.
Zero-shot dialect feature prediction (ZS-Dialect).The fifth evaluation tests whether geoadaptation increases the PLMs' awareness of dialectal variation.We only conduct this task for BCMS, which exhibits many well-documented dialectal variants that exist as tokens in the BERTić vocabulary.
We consider two subtasks.In the first subtask (Phon), we test whether BERTić can select the correct variant for a phonological variable, specifically the reflex of the Old Slavic vowel ě.This feature exhibits geographic variation in BCMS: in the (north-)west, the reflexes ije and je are predominately used, whereas the (south-)east mostly uses e (Ljubešić et al., 2018), e.g., lijepo vs. lepo ('nice').Drawing upon words for which both ije/je and e variants exist in the BERTić vocabulary, we filter out words that appear in fewer than 10 posts in the merged VarDial dev and test data, resulting in a set of 64 words (i.e., 32 pairs).Subsequently, we randomly sample 10 posts for each of the words.For the second subtask (Lex), we evaluate the recognition of lexical variation that is not tied to a phonological feature (Alexander, 2006), e.g., porodica vs. obitelj ('family').Based on a Croatian-Serbian comparative dictionary,15 we select all pairs for which both words are in the BERTić vocabulary.We remove words that occur in fewer than 10 Var-Dial dev and test posts and sample 10 posts for each of the remaining 61 words.
For prediction, we mask out the phonological/lexical variant and follow the same approach as  .18 .193 .319 .149 .191Table 3: Results on fine-tuned geolocation prediction (FT-Geoloc) and zero-shot geolocation prediction (ZS-Geoloc).
Measure for FT-Geoloc: median distance (in km); measure for ZS-Geoloc: prediction accuracy.For FT-Geoloc and BCMS, the first row shows the current state-of-the-art performance (Scherrer and Ljubešić, 2021).For ZS-Geoloc, the first row shows random performance.Bold: best score in each column; underline: second best score.We highlight scores that are significantly (p < .05)worse than the best score with a † and scores that are significantly (p < .05)worse than the two best scores with a ‡ .We indicate with a ?scores for which we cannot test for statistical significance since we do not have access to the distribution of output predictions.
for ZS-Geoloc and ZS-Lang, with the difference that we restrict the vocabulary to the two relevant variants (e.g., porodica vs. obitelj).We measure the performance in terms of accuracy.
Model variants.We evaluate the two geoadaptation variants, minimizing the simple sum of L mlm and L geo (GeoAda-S) and the weighted sum based on homoscedastic uncertainty (GeoAda-W).To quantify the effects of geoadaptation compared to standard adaptation, we adapt the PLMs on the same data using only L mlm as the primary baseline (MLMAda), i.e., the MLMAda models are adapted on the exact same text data as GeoAda-S and GeoAda-W, but using continued language modeling training without geolocation prediction.Where possible (i.e., BCMS FT-Geoloc and out-of-domain BCMS FT-Lang), we compare against the current state-of-the-art (SotA) performances (Scherrer and Ljubešić, 2021;Rupnik et al., 2023) -BERTić finetuned on the train data.On the zero-shot tasks, we also report random performance (Rand).
Language identification is a task that is not typically addressed using PLMs.Instead, most state-ofthe-art systems are less expensive models trained on character n-grams (Zampieri et al., 2017;Haas and Derczynski, 2021;Rupnik et al., 2023).To get a sense of whether PLMs in general and geoadapted PLMs in particular are competitive with such custom-built systems, we evaluate GlotLID (Kargaran et al., 2023), a strong language identification tool based on FastText (Bojanowski et al., 2017;Joulin et al., 2017), on FT-Lang.Since GlotLID was not specifically trained on the domains examined in FT-Lang, we also train new FastText models on the data used to fine-tune the PLMs.

Results and Analysis
Tables 3, 5, 6, and 7 compare the performance of the geoadapted PLMs against the baselines.To test for statistical significance of the performance differences, we use paired, two-sided Student's ttests in the case of FT-Geoloc and McNemar's tests for binary data (McNemar, 1947) in the case of ZS-Geoloc, FT-Lang, ZS-Lang, and ZS-Dialect, as recommended by Dror et al. (2018).We correct the resulting p-values for each evaluation using the Holm-Bonferroni method (Holm, 1979).
Overall, the geoadapted models consistently and substantially outperform the baselines -out of the 30 main evaluations, it is always one of the two geoadapted models that achieves the best score, a result that is highly unlikely to occur by chance if there is no underlying performance difference between the geoadapted and non-geoadapted models. 16Furthermore, in the two cases where we can directly compare to a prior state of the art, one or both geoadapted models outperform it.These findings strongly suggest that geoadaptation successfully induces associations between language variation and geographic location.
Fine-tuned geolocation prediction.PLMs geoadapted with uncertainty weighting (GeoAda-W) predict the geolocation most precisely (see Table 3).On BCMS, GeoAda-W improves the previous state of the art -achieved by a directly fine-tuned BERTić model -by 3.3 km on test and by over 6 km on dev.On EUR (arguably the most challenging setting), GeoAda-W improves upon MLMAda (i.e., a model adapted without geographic signal) by more than 10 km on both dev and test.MLMAda always performs worse than the two geoadapted models, despite the fact that task-specific fine-tuning likely compensates for some of the geographic knowledge GeoAda-W and GeoAda-S obtain in geoadaptation.This shows that geoadaptation drives the performance improvements, and that language modeling adaptation alone does not suffice.Loss 16 Assuming equal underlying performance for MLMAda, GeoAda-S, and GeoAda-W (and ignoring other baselines), the probability for this result is p = (2/3) 30 < 10 −5 .weighting based on homoscedastic uncertainties seems beneficial for FT-Geoloc: while GeoAda-S already outperforms the baselines, GeoAda-W in seven out of eight cases brings further significant gains.We also observe that all models reach peak performance in the first few fine-tuning epochs (not shown), and that geoadaptation is useful even when the geoadaptation data are a subset of the finetuning data (as is the case for AGS).This confirms that the performance gains come from the geoadaptation and are not merely the result of longer training on geolocation prediction.
Zero-shot geolocation prediction.In this task, the PLMs have to predict the token of the correct toponym (i.e., city or state).Notice that the PLMs receive information about exact geolocations during geoadaptation and do not leverage toponym tokens in any direct way.ZS-Geoloc is thus an ideal litmus test as it shows how well the link between language variation and geography, injected into the PLMs via geoadaptation, generalizes.The results (see Table 3) strongly suggest that geoadaptation leads to such generalization: both geoadapted model variants bring massive and statistically significant gains in prediction accuracy over MLMAda (e.g., GeoAda-W vs. MLMAda: +17.5% on BCMS, +8.3% on EUR).As on FT-Geoloc, uncertainty weighting (GeoAda-W) overall outperforms simple loss summation (GeoAda-S).
Figure 2 shows the confusion matrices for the three methods on BCMS, offering further insights.MLMAda assigns most posts from a country to the corresponding capital (e.g., posts from Croatian cities to Zagreb).These tokens are the most frequent ones out of all considered cities, which seems to heavily affect MLMAda.In contrast, pre- Table 4: Results on zero-shot geolocation prediction (ZS-Geoloc) with calibration (Zhao et al., 2021).Measure: prediction accuracy.Besides the results, we give the changes compared to vanilla ZS-Geoloc and indicate with a * if they are significant (p < .05).See Table 3 for an explanation of the other symbols used in the table.
dictions of GeoAda-S and GeoAda-W are much more nuanced, i.e., more diverse and less tied to the frequency of the toponym tokens: the geoadapted models are not only able to correctly assign posts from smaller, less frequently mentioned cities (e.g., Dubrovnik, Zenica), but their errors also reflect regional linguistic consistency and geographic proximity.For example, GeoAda-S predicts Rijeka as the origin of many Pula posts, and Bar as the origin of many Dubrovnik posts; similarly, GeoAda-W assigns posts from Split to Dubrovnik and posts from Bar to Podgorica.17One common method to alleviate the impact of different prior probabilities in the zero-shot setting (a potential reason for the bad performance of MLMAda) is to calibrate the PLM predictions (Holtzman et al., 2021;Zhao et al., 2021).Following Zhao et al. (2021), we measure the prior probabilities of all toponym tokens using a neutral prompt (specifically, 'This is [MASK]' for AGS/BCMS/DNS and a [MASK] token for EUR) and repeat the ZS-Geoloc evaluation, dividing the output probabilities by the prior probabilities (Table 4).We find that all models (both geoadapted and non-geoadapted) improve as a result of calibration, i.e., the output probabilities seem to be miscalibrated if not specifically adjusted by means of the prior probabilities.However, refuting the hypothesis that miscalibration causes the inferior performance of MLMAda, the average gain due to calibration is larger for the geoadapted models (GeoAda-S: +4.8%, GeoAda-W: +3.0%) than for the non-geoadapted models (MLMAda: +1.9%).This suggests that a miscalibration of the toponym probabilities -rather than disproportionately affecting the non-geoadapted models -generally impairs the geolinguistic capabilities of a PLM.The consequences of such an impairment seem to be the more detrimental the more profound the underlying geolinguistic knowledge.
Taken together, these observations indicate that GeoAda-S and GeoAda-W possess detailed knowledge of geographic variation in language.Since geoadaptation provides no supervision in the form of toponym names, this implies an impressive generalization, i.e., the association of linguistic constructs to toponyms, with geolocations (specifically, scalar longitude-latitude pairs) as the intermediary signal driving the generalization.
Fine-tuned language identification.The geoadapted PLMs are best at identifying the language in which a text is written: both GeoAda-S and GeoAda-W consistently show a higher accuracy than MLMAda (e.g., GeoAda-W vs. MLMAda: +5% on BCMS dev, +1.9% on EUR test), and the difference in performance is statistically significant in six out of eight cases (see Table 5).As opposed to the two geolocation tasks where uncertainty weighting (GeoAda-W) clearly leads to better results than summing the losses (GeoAda-S), the difference is less pronounced for FT-Lang and significant only in one case (EUR test), even though GeoAda-W numerically outperforms GeoAda-S overall.Compared to the language identification models operating on the level of character n-grams (GlotLID, FastText), geoadaptation always brings statistically significant performance gains.Even MLMAda outperforms GlotLID and FastText in all cases, indicating that PLMs are generally competitive with more traditional systems on this task.We further notice that the relative disadvantage is particularly pronounced for GlotLID on BCMS.Upon inspection, we find that GlotLID's inferior performance on BCMS is due to the fact that it predicts more than 80% of the examples as Croatian.This imbalance can be explained as a result of the domain difference between GlotLID's training data and the FT-Lang evaluation data: while GlotLID was mostly trained on formal texts such as Wikipedia articles and government documents (Kargaran et al., 2023), we test it on data from Twit- Table 5: Results on fine-tuned language identification (FT-Lang) and zero-shot language identification (ZS-Lang).Measure: prediction accuracy.See Table 3  Table 6: Results on out-of-domain fine-tuned language identification (FT-Lang) and zero-shot language identification (ZS-Lang).Measure for FT-Lang and BCMS: macro-average F1-score (for comparability); measure elsewhere: prediction accuracy.For FT-Lang and BCMS, the first row shows the current state-of-the-art performance (Rupnik et al., 2023).For ZS-Lang, the first row shows random performance.See Table 3 for an explanation of the symbols used in the table.
ter.Crucially, while Croatian is the only BCMS language that consistently uses Latin script in formal contexts, with Cyrillic script being preferred especially in Serbian, Latin script is everywhere much more common on social media, even in Serbia (George, 2019).GlotLID seems to be heavily affected by this script mismatch and is only very rarely able to correctly predict the language of non-Croatian posts written in Latin script.
These trends are also reflected by the results on the out-of-domain language identification benchmarks: geoadaptation always outperforms adaptation based on language modeling alone as well as models operating on the level of character n-gram (see Table 6).On the SETimes benchmark (BCMS), GeoAda-W further establishes a new state of the art, almost halving the error rate from 0.5% to 0.3%.Similarly to in-domain FT-Lang, the two geoadaptation variants perform similarly.GlotLID again predicts many non-Croatian examples in Latin script as Croatian, leading to a substantially worse performance on BCMS.
The superior performance of the geoadapted models in language identification -a task that is distinct from geolocation prediction and not typically addressed by means of PLMs -suggests that the geolinguistic knowledge acquired during geoadaptation is highly generalizable, making it beneficial for a broader set of tasks with a connection to geography, and not only the task used as an auxiliary objective for geoadaptation itself.
Zero-shot language identification.Here, the PLMs have to predict the token corresponding to the language in which a text is written, e.g., hrvatski ('Croatian').This task requires generalization on two levels: first (similarly to FT-Lang), the PLMs have not been trained on language identification and are thus required to draw upon the geolinguistic knowledge they have formed during geoadaptation; second (similarly to ZS-Geoloc), the geolinguistic knowledge has not been provided to them in a form that would make it readily usable in a zero-shot setting -recall that the geographic information is presented in the form of longitude-latitude pairs (i.e., two scalars), whereas the language modeling head (which is used for the zero-shot predictions) is not trained differently than for vanilla adaptation (MLMAda).Despite these challenges, we find that geoadaptation substantially improves the performance of the PLMs on ZS-Lang (see Tables 5 and 6).The fact that the performance gains are equally pronounced on in-domain (e.g., GeoAda-W vs. MLMAda: +4.2% on DNS) and out-of-domain examples (e.g., GeoAda-W vs. MLMAda: +5.3% on DNS) highlights again that geoadaptation endows PLMs with knowledge that allows for a high degree of generalization.In stark contrast to geoadaptation (GeoAda-S, GeoAda-W), language modeling adaptation alone (MLMAda) barely helps in acquiring geographic knowledge (a), which is also reflected by the consistently worse performance on ZS-Lang (b).MLMAda does form dialectal associations after several epochs, but the inductive bias of geoadaptation allows GeoAda-S and GeoAda-W to establish those associations more quickly (c, d).Zero-shot dialect feature prediction.The results on ZS-Dialect -phonological (Phon) and lexical (Lex) -generally follow the trends from the other four tasks (see Table 7): the geoadapted PLMs clearly (and statistically significantly) outperform MLMAda, albeit with overall narrower margins than in most other zero-shot tasks for BCMS (e.g., GeoAda-S vs. MLMAda: +8.6% on Phon, GeoAda-W vs. MLMAda: +4.1% on Lex).MLMAda is expectedly more competitive here: selecting the word variant that better fits into the linguistic context is essentially a language modeling task, for which additional language modeling training intuitively helps.For example, typical future tense constructions in Serbian vs. Croatian (ja ću da okupim vs. ja ću okupiti, 'I'll gather') have strong selectional preferences on subsequent lexical units (Alexander, 2006; e.g., porodicu vs. obitelj for 'family').We further verify this by comparing the zeroshot performance on BCMS for different model checkpoints obtained during training.The performance curves over 25 (geo-)adaptation epochs, shown in Figure 3, confirm our hypothesis that longer language modeling adaptation substantially improves the performance of MLMAda on predicting dialect features, but its benefits for geolocation prediction and language identification remain lim- ited.While prolonged language modeling adaptation allows MLMAda to eventually learn the dialectal associations, the inductive bias of the knowledge injected via geoadaptation allows GeoAda-S and GeoAda-W to reach high performance much sooner, after merely two to three epochs.
Effects of loss weighting.The dynamic weighting of L mlm and L geo (i.e., GeoAda-W) clearly outperforms the simple summation of the losses (i.e., GeoAda-S) on the geolocation prediction tasks (FT-Geoloc, ZS-Geoloc), but the difference between the two geoadaptation variants is less pronounced for FT-Lang, ZS-Lang, and ZS-Dialect.While geographic knowledge is beneficial for all five tasks, geolocation prediction arguably demands a more direct exploitation of that knowledge.Comparing the model variants in terms of the two task losses, we observe that GeoAda-S reaches lower L mlm levels, whereas GeoAda-W ends with lower L geo levels (see Figure 4 for the example of BCMS), which would explain the differences in their performance.We inspect GeoAda-W's task uncertainty weights after geoadaptation and observe η mlm = 0.29 and η geo = −0.35for AGS, η mlm = 1.12 and η geo = −1.22 for BCMS, η mlm = 0.84 and η geo = −1.23 for DNS, and η mlm = 0.90 and η geo = −1.95 for EUR.Thus, GeoAda-W consistently assigns more Table 8: Comparison between sequence-level geoadaptation (GeoAda-Seq) and token-level geoadaptation (GeoAda-Tok) for BCMS.GeoAda-Tok stands for the better-performing model between GeoAda-S and GeoAda-W on each task (see Tables 3, 5, and 7).See Table 3 for an explanation of the symbols used in the table.importance to L geo . 18The fact that the divergence of the task uncertainty weights is smallest for AGS explains why the difference between GeoAda-S and GeoAda-W on FT-Geoloc/ZS-Geoloc is least pronounced for that language group.
Sequence-level geoadaptation.The decision to inject geographical information at the level of tokens was motivated by the central importance of the lexicon for geographically-conditioned linguistic variability (see §2).A plausible alternative -one less tied to lexical variation alone -is to geoadapt the PLMs by predicting the geolocation from the representation of the whole input text, i.e., to feed the contextualized representation of the [CLS] token to the regressor that predicts longitude and latitude.For comparison, we evaluate this variant too (GeoAda-Seq) and compare it against the best token-level geoadapted model (GeoAda-Tok; e.g., GeoAda-W for BCMS FT-Geoloc) on all PLMs and tasks.For reasons of space, we only present BCMS here, but the overall trends for AGS, DNS, and EUR are very similar.Sequence-level geoadaptation trails token-level geoadaptation on all tasks except for fine-tuned geolocation prediction (see Table 8).In general, while the difference is small for the fine-tuned tasks, it is large (and always significant) for the zero-shot tasks -for example, GeoAda-Seq performs only slightly better than MLMAda on ZS-Geoloc (see Table 3).This suggests that injecting geographic information on the level of tokens allows the PLMs to acquire more nuanced geolinguistic knowledge.Nonetheless, sequence-level geoadaptation still outperforms the non-geoadapted baselines.
Geoadaptation as geographic retrofitting.Even though it makes intuitive sense that minimizing L geo improves the geolinguistic knowledge of PLMs, we want to determine the exact mechanism by which it does so.Based on the results described so far, we make the following hypothesis: geoadaptation changes the representation space of the PLMs in such a way that tokens indicative of a certain location are brought close to each other, i.e., it has the effect of geographic retrofitting (Hovy and Purschke, 2018).We examine this hypothesis by analyzing (i) how the representations of toponyms and lexical variants change in relation to each other, and (ii) how the representations of toponyms change internally.We examine the PLM output embeddings (which directly impact the zeroshot predictions) and focus on BCMS.
For the first question, we use the geoadaptation data to compute type-level embeddings for the five largest Croatian (Zagreb, Split, Rijeka, Osijek, Zadar) and Serbian (Beograd, Niš, Kragujevac, Subotica, Pančevo) cities as well as the ije/e variants used for ZS-Dialect.Following established practice (e.g., Vulić et al., 2020;Litschko et al., 2022), we obtain type-level vectors for words (i.e., city name or phonological variant) by averaging the contextualized output representations of their token occurrences.We then resort to WEAT (Caliskan et al., 2017), a measure that quantifies the difference in association strength between a word (in our case, a city name) and two word sets (in our case, ije vs. e phonological variants).A positive or negative score indicates that a city name is associated more strongly with the ije or e variants, respectively.Figure 5 shows that during geoadaptation (GeoAda-W), the Croatian city names develop a strong association with the ije variants (i.e., positive WEAT scores), whereas the Serbian city names develop a strong association with the e variants (i.e., negative WEAT scores), which is exactly in line with their geographic distribution (Alexander, 2006).By contrast, the associations created during adaptation based on language modeling alone (MLMAda) are substantially weaker.
We then use the same set of 10 Croatian and Serbian cities and compare their pairwise geodesic distances against the pairwise cosine distances of the city name embeddings, at the end of geoadaptation.The correlation between the two sets of distances (Pearson's r) is only 0.577 for MLMAda, but 0.881 for GeoAda-W, indicating almost perfect correlation.Furthermore, after only five epochs, the correlation is already 0.845 for GeoAda-W (vs.only 0.124 for MLMAda).This striking correspondence between real-world geography and the topology of the embedding space of geoadapted PLMs can also be seen by plotting the first two principal components of the city name embeddings on top of a geographic map, where we use orthogonal Procrustes (Schönemann, 1966;Hamilton et al., 2016) to align the points (see Figure 6).These results are strong evidence that geoadaptation indeed acts as a form of geographic retrofitting.The geographically restructured representation space of the PLMs can then be further refined via fine-tuning (as in FT-Geoloc and FT-Lang) or directly probed in a zero-shot manner (as in ZS-Geoloc, ZS-Lang and ZS-Dialect).

Conclusion
We introduce geoadaptation, an approach for taskagnostic continued pretraining of PLMs that forces them to learn associations between linguistic phenomena and geographic locations.The method we propose for geoadaptation couples language modeling and token-level geolocation prediction via multi-task learning.While we focus on PLMs pretrained via masked language modeling, geoadaptation can in principle be applied to autoregressive PLMs as well.We geoadapt four PLMs and obtain consistent gains on five tasks, establishing a new state of the art on established benchmarks.We further show that geoadaptation acts as a form of geographic retrofitting.Overall, we see our study as an exciting step towards NLP technology that takes into account extralinguistic aspects in general and geographic aspects in particular.

Figure 1 :
Figure 1: Geographic distribution of the data for AGS (left), BCMS (middle), and DNS (right).Each point represents a Jodel post (AGS) or tweet (BCMS, DNS).Point density correlates with population density, with the densest areas corresponding to urban centers.For DNS, we exclude the Svalbard islands, which do not have any points.

Figure 2 :
Figure 2: Confusion matrices for MLMAda (a), GeoAda-S (b), and GeoAda-W (c) on ZS-Geoloc (BCMS).While MLMAda always predicts one of the three most frequent city tokens (Beograd, Sarajevo, or Zagreb), the predictions of GeoAda-S and GeoAda-W are much more diverse and less tied to frequency.

Figure 3 :
Figure 3: Performance on BCMS ZS-Geoloc (a), ZS-Lang (b), and ZS-Dialect (c, d) for different number of epochs.In stark contrast to geoadaptation (GeoAda-S, GeoAda-W), language modeling adaptation alone (MLMAda) barely helps in acquiring geographic knowledge (a), which is also reflected by the consistently worse performance on ZS-Lang (b).MLMAda does form dialectal associations after several epochs, but the inductive bias of geoadaptation allows GeoAda-S and GeoAda-W to establish those associations more quickly (c, d).
Figure 4: (Geo-)adaptation diagnostics.The figure illustrates how log perplexity of language modeling (a) and median distance of token-level geolocation prediction (b) change on dev during BCMS geoadaptation.

Figure 5 :
Figure 5: Association strength between the BERTić embeddings of Croatian/Serbian cities and ije/e variants for MLMAda (top) and GeoAda-W (bottom), measured using WEAT(Caliskan et al., 2017).A positive or negative score indicates that a city is associated more strongly with the ije or e variants, respectively.

Figure 6 :
Figure 6: The first two principal components of the city output embeddings (points), plotted on top of a geographic map of Croatia and Serbia.The x-marks indicate the actual geographic locations of the cities.

Table 1 :
Evaluation tasks.Inspired by existing NLP research on geography (see §2), we evaluate the geoadapted PLMs on five tasks that probe different Data statistics.The table provides the number of Jodel posts (AGS), tweets (BCMS, DNS), or both (EUR) used for (geo-)adaptation and the five evaluation tasks (FT-Geoloc, ZS-Geoloc, FT-Lang, ZS-Lang, ZS-Dialect).There is no overlap between the Jodel posts/tweets used for (geo-)adaptation and the ones used for evaluation.The FT-Geoloc splits for AGS and BCMS are the original VarDial(Chakravarthi et al., 2021)splits.

Table 2 :
(Haas and Derczynski, 2021)cs.The table provides the number of news articles (BCMS) and Wikipedia snippets (DNS) used for out-of-domain FT-Lang and ZS-Lang.The FT-Lang splits are the original SETimes(Rupnik et al., 2023)and NordicDSL(Haas and Derczynski, 2021)splits.

Table 7 :
Results on zero-shot dialect feature prediction (ZS-Dialect), which is only conducted for BCMS.Measure: prediction accuracy.See Table3for an explanation of the symbols used in the table.