Abstract
While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone. Here, we contribute to closing this gap by examining geolinguistic knowledge, i.e., knowledge about geographic variation in language. We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup. We geoadapt four PLMs, covering language groups from three geographic areas, and evaluate them on five different tasks: fine-tuned (i.e., supervised) geolocation prediction, zero-shot (i.e., unsupervised) geolocation prediction, fine-tuned language identification, zero-shot language identification, and zero-shot prediction of dialect features. Geoadaptation is very successful at injecting geolinguistic knowledge into the PLMs: The geoadapted PLMs consistently outperform PLMs adapted using only language modeling (by especially wide margins on zero-shot prediction tasks), and we obtain new state-of-the-art results on two benchmarks for geolocation prediction and language identification. Furthermore, we show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the PLMs.
1 Introduction
The default tool for the majority of NLP tasks is now de facto pretrained language models (PLMs; Devlin et al., 2019; Liu et al., 2019b; Radford et al., 2019; Brown et al., 2020; Clark et al., 2020; Raffel et al., 2020; Chowdhery et al., 2022; Hoffmann et al., 2022; Touvron et al., 2023, inter alia), which are trained using language modeling objectives on large text corpora. Despite the conceptual simplicity of language modeling, pretraining induces complex forms of linguistic knowledge in PLMs, at various levels (Rogers et al., 2020; Mahowald et al., 2023): morphological (Edmiston, 2020; Hofmann et al., 2020; Weissweiler et al., 2023), lexical (Ethayarajh, 2019; Vulić et al., 2020), syntactic (Hewitt and Manning, 2019; Jawahar et al., 2019; Wei et al., 2021; Weissweiler et al., 2022), and semantic (Wiedemann et al., 2019; Ettinger, 2020). This general linguistic knowledge is then (re-)shaped for concrete tasks via fine-tuning, i.e., supervised training on task-specific labeled data.
Humans, however, additionally make use of a rich spectrum of extralinguistic features when they learn and process language, including gender (Lass et al., 1979), ethnicity (Trent, 1995), and geography (Clopper and Pisoni, 2004). Despite the growing awareness for the importance of such factors in NLP (Hovy and Yang, 2021), extralinguistic features have been typically introduced in the fine-tuning phase so far, i.e., when specializing PLMs for a concrete task (e.g., Rosin et al., 2022). This prevents PLMs from forming generalizable representations the way humans do, impeding the exploitation of extralinguistic knowledge for tasks other than the fine-tuning task itself.
In this work, we focus on geographic knowledge, and more specifically geolinguistic knowledge, i.e., knowledge about geographic variation in language—the most salient type of extralinguistic variation in language (Wieling and Nerbonne, 2015). We present what we believe to be the first attempt to incorporate geolinguistic knowledge into PLMs in a pretraining step, i.e., before task-specific fine-tuning, making it possible to exploit it in any task for which it is expected to be useful. Specifically, we conduct an intermediate training step (Glavaš and Vulić, 2021) in the form of task-agnostic adaptation—dubbed geoadaptation—that couples language modeling with predicting the geographic location (i.e., longitude and latitude) on geolocated texts. We choose adaptation as opposed to pretraining from scratch for three reasons: (i) intermediate training on language modeling (i.e., adaptation) before task-specific fine-tuning has proved beneficial for many NLP tasks (Gururangan et al., 2020), (ii) adaptation has a lower computational cost than pretraining (Strubell et al., 2019), and (iii) PLMs encoding general-purpose linguistic knowledge are readily available (Wolf et al., 2020).1 The specific method we introduce for geoadaptation combines language modeling with token-level geolocation prediction via multi-task learning, with task weights based on the homoscedastic uncertainties of the task losses (Kendall et al., 2018).
We evaluate our geoadaptation framework on three groups of closely related languages, each with a corresponding PLM: (i) the German dialects spoken in Austria, Germany, and Switzerland (AGS) and GermanBERT; (ii) Bosnian-Croatian-Montenegrin-Serbian (BCMS) and BERTić; and (iii) Danish, Norwegian, and Swedish (DNS) and ScandiBERT. These groups exhibit strong geographic differences, providing an ideal testbed for geoadaptation.2 We further test geoadaptation at scale by adapting mBERT, a multilingual PLM, on the union of AGS, BCMS, and DNS.
We evaluate the effectiveness of geoadaptation on five downstream tasks expected to benefit from geolinguistic knowledge: (i) fine-tuned (i.e., supervised) geolocation prediction, (ii) zero-shot (i.e., unsupervised) geolocation prediction, (iii) fine-tuned language identification, (iv) zero-shot language identification, and (v) zero-shot prediction of dialect features. Geoadaptation leads to consistent performance gains compared to baseline models adapted on the same data using only language modeling, with particularly striking improvements on all zero-shot tasks. On two popular benchmarks for geolocation prediction and language identification, geoadaptation establishes a new state of the art. Furthermore, we show that geoadaptation geographically retrofits the representation space of the PLMs. Overall, we see our study as an exciting step towards grounding PLMs in geography.3
2 Related Work
Adaptation of PLMs.
Continued language modeling training (i.e., adaptation) on data that comes from a similar distribution as the task-specific target data has been shown to improve the performance of PLMs for many NLP tasks (Glavaš et al., 2020; Gururangan et al., 2020) as well as in various language (Pfeiffer et al., 2020; Parović et al., 2022) and domain adaptation scenarios (Chronopoulou et al., 2021; Hung et al., 2022). Adaptation can be seen as a special case of intermediate training, which aims at improving the target-task performance of PLMs by carrying out additional training between pretraining and fine-tuning (Phang et al., 2018; Vu et al., 2020; Glavaš and Vulić, 2021). Intermediate training has also been conducted in a multi-task fashion, encompassing two or more training objectives (Liu et al., 2019a; Aghajanyan et al., 2021). Our work differs from these efforts in that it injects geolinguistic knowledge—a type of extralinguistic knowledge—into PLMs.
Extralinguistic Knowledge.
Leaving aside the large body of work on injecting visual (e.g., Bugliarello et al., 2022) and structured knowledge (e.g., Lauscher et al., 2020) into PLMs, a few studies have examined the interplay of PLM adaptation and extralinguistic factors (Luu et al., 2021; Röttger and Pierrehumbert, 2021). However, they focus on time and adapt PLMs to individual extralinguistic contexts (i.e., time points). In contrast, we inject geographic information from all contexts into the PLM, forcing it to learn links between linguistic variability and a language-external variable—in our case, geography. This is fundamentally different from adapting the PLM only to certain realizations of the language-external variable.
Most other studies introduce the extralinguistic information during task-specific fine-tuning (Dhingra et al., 2021; Hofmann et al., 2021; Karpov and Kartashev, 2021; Kulkarni et al., 2021; Rosin et al., 2022). In contrast, we leverage geographic information only in the task-agnostic adaptation step. In task fine-tuning, the geoadapted PLM does not require any extralinguistic signal and is fine-tuned in the same manner as standard PLMs.
Geography in NLP.
We also build upon the long line of NLP research on geography, which roughly falls into two camps. On the one hand, many studies model geographically conditioned differences in language, pointing to lexical variation as the most conspicuous manifestation (Eisenstein et al., 2010; Eisenstein et al., 2011; Doyle, 2014; Eisenstein et al., 2014; Huang et al., 2016; Hovy and Purschke, 2018; Hovy et al., 2020), although phonological (Hulden et al., 2011; Blodgett et al., 2016), syntactic (Dunn, 2019; Demszky et al., 2021), and semantic properties (Bamman et al., 2014; Kulkarni et al., 2016) have been shown to exhibit geographic variation as well. On the other hand, there exists a large body of work on predicting geographic location from text, a task referred to as geolocation prediction (Rahimi et al., 2015a, b, 2017; Salehi et al., 2017; Rahimi et al., 2018; Scherrer and Ljubešić, 2020, 2021). To the best of our knowledge, we are the first to geographically adapt PLMs in a task-agnostic fashion, making them more effective for any downstream task for which geolinguistic knowledge is relevant, from geolocation prediction to dialect-related tasks and language identification.
3 Geoadaptation
Let be a geotagged dataset consisting of sequences of tokens X = (x1,…, xn) and corresponding geotags T = (tlon, tlat), where tlon and tlat denote the geographic longitude and latitude. We want to adapt a PLM in such a way that it encodes the geographically conditioned linguistic variability in . Acknowledging the prominence of lexical variation among geographic differences in language (see §2), we accomplish this by combining masked language modeling (i.e., the pretraining objective) with token-level geolocation prediction in a multi-task setup that pushes the PLM to learn associations between linguistic phenomena and geolocations on the lexical level.4
Masked Language Modeling.
We replace some tokens xi in X with masked tokens . Following Devlin et al. (2019), can be a special mask token ([MASK]), a random vocabulary token, or the original token itself. X is fed into the PLM, which outputs a sequence of representations E = (e(x1),…, e(xn)). The representations of the masked tokens are then fed into a classification head. We compute the masked language modeling loss mlm as the negative log-likelihood of the probability assigned to the true token.
Geolocation Prediction.
We additionally feed the vectors of masked tokens into a feed-forward regression head that predicts two real-values: longitude and latitude. The geolocation prediction loss geo is the mean of the absolute prediction errors for longitude and latitude. Note that the gold geolocation is the same for all masked tokens from the same input sequence. We inject geographic information at the token level because lexical variation represents the most prominent type of geographic language variation (see §2).
Composite Multi-task Loss.
4 Experimental Setup
Models.
We examine four PLMs in this paper. For AGS, we use GermanBERT, a German BERT (Devlin et al., 2019) model.5 For BCMS, we use BERTić (Ljubešić and Lauc, 2021), a BCMS ELECTRA (Clark et al., 2020) model.6 We specifically use the generator, i.e., a BERT model. For DNS, we resort to ScandiBERT, an XLM-Roberta (Conneau et al., 2020) model pretrained on corpora from five Scandinavian languages.7 Since we are interested to see whether geoadaptation can be expanded to a larger geographical area (e.g., an entire continent), we also geoadapt mBERT, a multilingual BERT (Devlin et al., 2019) model, on the union of the AGS, BCMS, and DNS areas.8 We refer to this setting as EUR.
Data.
We start with a general overview of the data used for the experiments. Details about data splits are provided when describing the setup for geoadaptation as well as the evaluation tasks. Figure 1 shows the geographic distribution of the data. Tables 1 and 2 list summary statistics.
Language . | Adaptation . | FT-Geoloc . | ZS-Geoloc . | FT-Lang . | ZS-Lang . | ZS-Dialect . | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
Train . | Dev . | Test . | Train . | Dev . | Test . | Phon . | Lex . | ||||
AGS | 15,000 | 343,748 | 31,538 | 33,953 | 1,600 | 45,000 | 4,500 | 4,500 | – | – | – |
BCMS | 80,000 | 353,953 | 38,013 | 4,189 | 1,400 | 60,000 | 6,000 | 6,000 | 6,000 | 640 | 610 |
DNS | 300,000 | 150,000 | 75,000 | 75,000 | 3,900 | 45,000 | 4,500 | 4,500 | 4,500 | – | – |
EUR | 50,000 | 100,000 | 10,000 | 10,000 | 4,500 | 100,000 | 10,000 | 10,000 | – | – | – |
Language . | Adaptation . | FT-Geoloc . | ZS-Geoloc . | FT-Lang . | ZS-Lang . | ZS-Dialect . | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
Train . | Dev . | Test . | Train . | Dev . | Test . | Phon . | Lex . | ||||
AGS | 15,000 | 343,748 | 31,538 | 33,953 | 1,600 | 45,000 | 4,500 | 4,500 | – | – | – |
BCMS | 80,000 | 353,953 | 38,013 | 4,189 | 1,400 | 60,000 | 6,000 | 6,000 | 6,000 | 640 | 610 |
DNS | 300,000 | 150,000 | 75,000 | 75,000 | 3,900 | 45,000 | 4,500 | 4,500 | 4,500 | – | – |
EUR | 50,000 | 100,000 | 10,000 | 10,000 | 4,500 | 100,000 | 10,000 | 10,000 | – | – | – |
Language . | FT-Lang . | ZS-Lang . | ||
---|---|---|---|---|
Train . | Dev . | Test . | ||
BCMS | 7,374 | 963 | 921 | 921 |
DNS | 22,796 | 5,699 | 1,497 | 1,497 |
Language . | FT-Lang . | ZS-Lang . | ||
---|---|---|---|---|
Train . | Dev . | Test . | ||
BCMS | 7,374 | 963 | 921 | 921 |
DNS | 22,796 | 5,699 | 1,497 | 1,497 |
For AGS, we use the German data of the 2021 VarDial shared task on geolocation prediction (Chakravarthi et al., 2021), which consist of geotagged Jodel posts from the AGS area. We merge the Austrian/German and Swiss portions of the data. For BCMS, we use the BCMS data of the 2021 VarDial shared task on geolocation prediction (Chakravarthi et al., 2021), which consist of geotagged tweets from the BCMS area. To remedy the sparsity of the data for some regions, we retrieve an additional set of geotagged tweets from the BCMS area posted between 2008 and 2021 using the Twitter API, ensuring that there is no overlap with the VarDial data. For evaluation, we additionally draw upon SETimes, a news dataset for discriminating between Bosnian, Croatian, and Serbian (Rupnik et al., 2023). For DNS, we use geotagged tweets from the Nordic Tweet Stream (Laitinen et al., 2018), confining geotags to the DNS area.9 For evaluation, we additionally use the DNS portion of NordicDSL, a dataset of Wikipedia snippets for discriminating between Nordic languages (Haas and Derczynski, 2021). For EUR, we mix the AGS, BCMS, and DNS data.
Geoadaptation.
For AGS, we create a balanced subset of the VarDial train posts (5,000 per country).10 For BCMS, we draw upon the union of the VarDial train posts and the newly collected posts to create a balanced subset (20,000 per country). For DNS, we similarly create a balanced subset of the posts (100,000 per country). For EUR, we sample balanced subsets of the AGS, BCMS, and DNS geoadaptation data (5,000 per country). Using these four datasets, we adapt the PLMs via the proposed multi-task learning approach (see §3). We geoadapt the PLMs for 25 epochs and save the model snaphots after each epoch. To track progress, we measure perplexity and token-level median distance on the VarDial development sets for AGS and BCMS, a separate set of 75,000 posts for DNS, and a separate set of 10,000 posts for EUR.
Evaluation Tasks.
Inspired by existing NLP research on geography (see §2), we evaluate the geoadapted PLMs on five tasks that probe different aspects of the learned associations between linguistic phenomena and geography.
Fine-tuned Geolocation Prediction (FT-Geoloc).
We fine-tune the geoadapted PLMs for geolocation prediction. For AGS and BCMS, we use the train, dev, and test splits from VarDial. For DNS, we create separate sets of train, dev, and test posts; we do the same for EUR, drawing train, dev, and test posts from the union of the AGS, BCMS, and DNS data (see Table 1). We make sure that there is no overlap between the geoadaptation posts and dev and test posts of any of the downstream evaluation tasks. Following prior work by Scherrer and Ljubešić (2021), we cast geolocation prediction as a multi-class classification task: We first map all geolocations in the train sets into k clusters using k-means and assign each geotagged post to its closest cluster.11 Concretely, we pass the contextualized vector of the [CLS] token to a single-layer softmax classifier that outputs probability distributions over the k geographic clusters.
In line with prior work, we use the median of the Euclidean distance between the predicted and true geolocation as the evaluation metric. Note that FT-Geoloc is different from geolocation prediction in geoadaptation (see §3): there, we (i) cast geolocation prediction as a regression task (i.e., predict the exact longitude and latitude) and (ii) predict the geolocation from the masked tokens, rather than the representation of the whole post.
Zero-shot Geolocation Prediction (ZS-Geoloc).
Given the central objective of geoadaptation (i.e., to induce mappings between linguistic variation and geography), we next test if the geoadapted models can predict geographic information from text without any fine-tuning. To this end, we directly probe the PLMs for geolinguistic associations: With the help of prompts, we ask the PLMs to generate the correct toponym corresponding to a post’s geolocation using their language modeling head, which has not been trained on geolocation prediction in any way (see §3). We do this on the most fine-grained geographic resolution possible, i.e., cities for BCMS/DNS and states for AGS.12 For EUR, we draw upon the union of AGS, BCMS, and DNS, resulting in a mix of cities and states.
To create the data for ZS-Geoloc, we start by reverse-geocoding all posts and then select cities/states that contain at least 100 posts and have names existing in the PLM vocabulary. We randomly sample 100 posts from each of these cities/states (AGS: Bayern, Bern, Brandenburg, Bremen, Hessen, Kärnten, Luzern, Niedersachsen, Oberösterreich, Saarland, Sachsen, Salzburg, Steiermark, Thüringen, Tirol, Zürich; BCMS: Bar, Beograd, Bor, Dubrovnik, Kragujevac, Niš, Podgorica, Pula, Rijeka, Sarajevo, Split, Tuzla, Zagreb, Zenica; DNS: Aalborg, Aarhus, Arendal, Bergen, Drammen, Fredrikstad, Göteborg, Halmstad, Haugesund, Helsingborg, Kalmar, Karlstad, Kristiansand, København, Linköping, Luleå, Lund, Moss, Nora, Norrköping, Odense, Oslo, Porsgrunn, Roskilde, Sala, Sandefjord, Sarpsborg, Skien, Stavanger, Stockholm, Södertälje, Tromsø, Trondheim, Tønsberg, Uddevalla, Umeå, Uppsala, Ålesund, Örebro; EUR: 45 underlined cities/states above, which are in the mBERT vocabulary).
For zero-shot prediction, we append prompts with the meaning ‘This is [MASK]’ to the post (AGS: Das ist [MASK]; BCMS: To je [MASK]; DNS: Dette er [MASK]).13 For EUR, we just append [MASK] to the post. We pass the whole sequence to the PLM and forward the output representation of the [MASK] token into the language modeling head. Following common practice (Xiong et al., 2020), we restrict the output vocabulary to the set of candidate labels, i.e., we select the city or state name with the highest logit. We measure the performance in terms of accuracy.
Fine-tuned Language Identification (FT-Lang).
Next, we consider language identification, a task of great importance for many applications that is particularly challenging in the case of closely related languages (Zampieri et al., 2014; Haas and Derczynski, 2021). While arguably less directly tied to geography than geolocation prediction, we believe that language identification should also benefit from geoadaptation since one or (in the case of multilingual communities) few languages are used at any given location—having knowledge about geolinguistic variation should thus make it easier to distinguish different languages.
We start by fine-tuning the PLMs for language identification. For AGS, BCMS, and DNS, we reuse the respective FT-Geoloc datasets and sample 15,000 train, 1,500 dev, and 1,500 test posts per language (determined based on their geolocation). For EUR, we reuse the exact FT-Geoloc train, dev, and test split. To test how well the effects of geoadaptation generalize to out-of-domain data, we also fine-tune BERTić on SETimes (i.e., news articles) and ScandiBERT on NordicDSL (i.e., Wikipedia snippets). In terms of modeling, we formulate language identification as a multi-class classification task, with three classes for AGS/DNS, four classes for BCMS, and 10 classes for EUR. We again pass the contextualized vector of the [CLS] token to a single-layer softmax classifier that outputs probability distributions over the languages. We measure the performance in terms of accuracy.
Zero-shot Language Identification (ZS-Lang).
Similarly to geolocation prediction, we are interested to see how well the geoadapted PLMs can identify the language of a text without fine-tuning. We reuse the FT-Lang test sets for this task. The setup follows ZS-Geoloc, i.e., we append the same prompts to the posts, pass the full sequences through the PLMs, and feed the output representations of the [MASK] token into the language modeling head. However, instead of city/state names, we now consider language names, specifically, bosanski (‘Bosnian’), crnogorski (‘Montenegrin’), hrvatski (‘Croatian’), and srpski (‘Serbian’) in the case of BCMS, and dansk (‘Danish’), norsk (‘Norwegian’), and svensk (‘Swedish’) in the case of DNS.14 We select the language name with the highest logit and measure the performance in terms of accuracy.
Zero-shot Dialect Feature Prediction (ZS-Dialect).
The fifth evaluation tests whether geoadaptation increases the PLMs’ awareness of dialectal variation. We only conduct this task for BCMS, which exhibits many well-documented dialectal variants that exist as tokens in the BERTić vocabulary.
We consider two subtasks. In the first subtask (Phon), we test whether BERTić can select the correct variant for a phonological variable, specifically the reflex of the Old Slavic vowel ě. This feature exhibits geographic variation in BCMS: In the (north-)west, the reflexes ije and je are predominately used, whereas the (south-)east mostly uses e (Ljubešić et al., 2018), e.g., lijepo vs. lepo (‘nice’). Drawing upon words for which both ije/je and e variants exist in the BERTić vocabulary, we filter out words that appear in fewer than 10 posts in the merged VarDial dev and test data, resulting in a set of 64 words (i.e., 32 pairs). Subsequently, we randomly sample 10 posts for each of the words. For the second subtask (Lex), we evaluate the recognition of lexical variation that is not tied to a phonological feature (Alexander, 2006), e.g., porodica vs. obitelj (‘family’). Based on a Croatian-Serbian comparative dictionary,15 we select all pairs for which both words are in the BERTić vocabulary. We remove words that occur in fewer than 10 VarDial dev and test posts and sample 10 posts for each of the remaining 61 words.
For prediction, we mask out the phonological/lexical variant and follow the same approach as for ZS-Geoloc and ZS-Lang, with the difference that we restrict the vocabulary to the two relevant variants (e.g., porodica vs. obitelj). We measure the performance in terms of accuracy.
Model Variants.
We evaluate the two geoadaptation variants, minimizing the simple sum of mlm and geo (GeoAda-S) and the weighted sum based on homoscedastic uncertainty (GeoAda-W). To quantify the effects of geoadaptation compared to standard adaptation, we adapt the PLMs on the same data using only mlm as the primary baseline (MLMAda), i.e., the MLMAda models are adapted on the exact same text data as GeoAda-S and GeoAda-W, but using continued language modeling training without geolocation prediction. Where possible (i.e., BCMS FT-Geoloc and out-of-domain BCMS FT-Lang), we compare against the current state-of-the-art (SotA) performances (Scherrer and Ljubešić, 2021; Rupnik et al., 2023)—BERTić fine-tuned on the train data. On the zero-shot tasks, we also report random performance (Rand).
Language identification is a task that is not typically addressed using PLMs. Instead, most state-of-the-art systems are less expensive models trained on character n-grams (Zampieri et al., 2017; Haas and Derczynski, 2021; Rupnik et al., 2023). To get a sense of whether PLMs in general and geoadapted PLMs in particular are competitive with such custom-built systems, we evaluate GlotLID (Kargaran et al., 2023), a strong language identification tool based on FastText (Bojanowski et al., 2017; Joulin et al., 2017), on FT-Lang. Since GlotLID was not specifically trained on the domains examined in FT-Lang, we also train new FastText models on the data used to fine-tune the PLMs.
Hyperparameters.
For geoadaptation, we use a batch size of 32 (16 for mBERT) and perform grid search for the learning rate r ∈{1 × 10−5,3 × 10−5,1 × 10−4}. We always geoadapt the PLMs for 25 epochs. For FT-Geoloc, we use a batch size of 32 (16 for mBERT) and perform grid search for the number of epochs n ∈{1,…,10} and the learning rate r ∈{1 × 10−5,3 × 10−5,1 × 10−4}. For FT-Lang, we use a batch size of 32 (16 for mBERT) and perform grid search for the number of epochs n ∈{1,…,5} and the learning rate r ∈{1 × 10−5,3 × 10−5,1 × 10−4}. For all training settings (geoadaptation, FT-Geoloc, FT-Lang) we tune r for MLMAda only and use the best configuration for GeoAda-W and GeoAda-S. This means that the overall number of hyperparameter trials is three times larger for MLMAda than GeoAda-W and GeoAda-S, i.e., we are giving a substantial advantage to the models that serve as a baseline. We use Adam (Kingma and Ba, 2015) as the optimizer. All experiments are performed on a GeForce GTX 1080 Ti GPU (11GB). For the FastText models trained on FT-Lang, we perform grid search for the number of epochs n ∈{5,10,15,20,25}, the minimum length of included character n-grams , and the maximum length of included character n-grams .
5 Results and Analysis
Tables 3, 5, 6, and 7 compare the performance of the geoadapted PLMs against the baselines. To test for statistical significance of the performance differences, we use paired, two-sided Student’s t-tests in the case of FT-Geoloc and McNemar’s tests for binary data (McNemar, 1947) in the case of ZS-Geoloc, FT-Lang, ZS-Lang, and ZS-Dialect, as recommended by Dror et al. (2018). We correct the resulting p-values for each evaluation using the Holm-Bonferroni method (Holm, 1979).
Method . | FT-Geoloc ↓ . | ZS-Geoloc ↑ . | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
AGS . | BCMS . | DNS . | EUR . | |||||||||
Dev . | Test . | Dev . | Test . | Dev . | Test . | Dev . | Test . | AGS . | BCMS . | DNS . | EUR . | |
SotA / Rand | — | — | ?30.11 | ?15.49 | — | — | — | — | ‡.071 | ‡.070 | ‡.026 | ‡.021 |
MLMAda | ‡193.51 | ‡196.18 | ‡29.36 | ‡16.72 | ‡101.15 | ‡101.15 | ‡107.20 | ‡107.41 | ‡.142 | ‡.144 | ‡.106 | ‡.108 |
GeoAda-S | †190.21 | 193.18 | †26.02 | †13.98 | †98.82 | †97.63 | †98.00 | †101.76 | .192 | †.287 | †.135 | †.159 |
GeoAda-W | 189.06 | †194.85 | 23.90 | 12.13 | 95.80 | 97.06 | 97.18 | 97.18 | .193 | .319 | .149 | .191 |
Method . | FT-Geoloc ↓ . | ZS-Geoloc ↑ . | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
AGS . | BCMS . | DNS . | EUR . | |||||||||
Dev . | Test . | Dev . | Test . | Dev . | Test . | Dev . | Test . | AGS . | BCMS . | DNS . | EUR . | |
SotA / Rand | — | — | ?30.11 | ?15.49 | — | — | — | — | ‡.071 | ‡.070 | ‡.026 | ‡.021 |
MLMAda | ‡193.51 | ‡196.18 | ‡29.36 | ‡16.72 | ‡101.15 | ‡101.15 | ‡107.20 | ‡107.41 | ‡.142 | ‡.144 | ‡.106 | ‡.108 |
GeoAda-S | †190.21 | 193.18 | †26.02 | †13.98 | †98.82 | †97.63 | †98.00 | †101.76 | .192 | †.287 | †.135 | †.159 |
GeoAda-W | 189.06 | †194.85 | 23.90 | 12.13 | 95.80 | 97.06 | 97.18 | 97.18 | .193 | .319 | .149 | .191 |
Overall, the geoadapted models consistently and substantially outperform the baselines—out of the 30 main evaluations, it is always one of the two geoadapted models that achieves the best score, a result that is highly unlikely to occur by chance if there is no underlying performance difference between the geoadapted and non-geoadapted models.16 Furthermore, in the two cases where we can directly compare to a prior state of the art, one or both geoadapted models outperform it. These findings strongly suggest that geoadaptation successfully induces associations between language variation and geographic location.
Fine-tuned Geolocation Prediction.
PLMs geoadapted with uncertainty weighting (GeoAda-W) predict the geolocation most precisely (see Table 3). On BCMS, GeoAda-W improves the previous state of the art—achieved by a directly fine-tuned BERTić model—by 3.3 km on test and by over 6 km on dev. On EUR (arguably the most challenging setting), GeoAda-W improves upon MLMAda (i.e., a model adapted without geographic signal) by more than 10 km on both dev and test. MLMAda always performs worse than the two geoadapted models, despite the fact that task-specific fine-tuning likely compensates for some of the geographic knowledge GeoAda-W and GeoAda-S obtain in geoadaptation. This shows that geoadaptation drives the performance improvements, and that language modeling adaptation alone does not suffice. Loss weighting based on homoscedastic uncertainties seems beneficial for FT-Geoloc: While GeoAda-S already outperforms the baselines, GeoAda-W in seven out of eight cases brings further significant gains. We also observe that all models reach peak performance in the first few fine-tuning epochs (not shown), and that geoadaptation is useful even when the geoadaptation data are a subset of the fine-tuning data (as is the case for AGS). This confirms that the performance gains come from the geoadaptation and are not merely the result of longer training on geolocation prediction.
Zero-shot Geolocation Prediction.
In this task, the PLMs have to predict the token of the correct toponym (i.e., city or state). Notice that the PLMs receive information about exact geolocations during geoadaptation and do not leverage toponym tokens in any direct way. ZS-Geoloc is thus an ideal litmus test as it shows how well the link between language variation and geography, injected into the PLMs via geoadaptation, generalizes. The results (see Table 3) strongly suggest that geoadaptation leads to such generalization: Both geoadapted model variants bring massive and statistically significant gains in prediction accuracy over MLMAda (e.g., GeoAda-W vs. MLMAda: +17.5% on BCMS, +8.3% on EUR). As on FT-Geoloc, uncertainty weighting (GeoAda-W) overall outperforms simple loss summation (GeoAda-S).
Figure 2 shows the confusion matrices for the three methods on BCMS, offering further insights. MLMAda assigns most posts from a country to the corresponding capital (e.g., posts from Croatian cities to Zagreb). These tokens are the most frequent ones out of all considered cities, which seems to heavily affect MLMAda. In contrast, predictions of GeoAda-S and GeoAda-W are much more nuanced, i.e., more diverse and less tied to the frequency of the toponym tokens: The geoadapted models are not only able to correctly assign posts from smaller, less frequently mentioned cities (e.g., Dubrovnik, Zenica), but their errors also reflect regional linguistic consistency and geographic proximity. For example, GeoAda-S predicts Rijeka as the origin of many Pula posts, and Bar as the origin of many Dubrovnik posts; similarly, GeoAda-W assigns posts from Split to Dubrovnik and posts from Bar to Podgorica.17
One common method to alleviate the impact of different prior probabilities in the zero-shot setting (a potential reason for the bad performance of MLMAda) is to calibrate the PLM predictions (Holtzman et al., 2021; Zhao et al., 2021). Following Zhao et al. (2021), we measure the prior probabilities of all toponym tokens using a neutral prompt (specifically, ‘This is [MASK]’ for AGS/BCMS/DNS and a [MASK] token for EUR) and repeat the ZS-Geoloc evaluation, dividing the output probabilities by the prior probabilities (Table 4). We find that all models (both geoadapted and non-geoadapted) improve as a result of calibration, i.e., the output probabilities seem to be miscalibrated if not specifically adjusted by means of the prior probabilities. However, refuting the hypothesis that miscalibration causes the inferior performance of MLMAda, the average gain due to calibration is larger for the geoadapted models (GeoAda-S: +4.8%, GeoAda-W: +3.0%) than for the non-geoadapted models (MLMAda: +1.9%). This suggests that a miscalibration of the toponym probabilities—rather than disproportionately affecting the non-geoadapted models—generally impairs the geolinguistic capabilities of a PLM. The consequences of such an impairment seem to be the more detrimental the more profound the underlying geolinguistic knowledge.
Taken together, these observations indicate that GeoAda-S and GeoAda-W possess detailed knowledge of geographic variation in language. Since geoadaptation provides no supervision in the form of toponym names, this implies an impressive generalization, i.e., the association of linguistic constructs to toponyms, with geolocations (specifically, scalar longitude-latitude pairs) as the intermediary signal driving the generalization.
Fine-tuned Language Identification.
The geoadapted PLMs are best at identifying the language in which a text is written: Both GeoAda-S and GeoAda-W consistently show a higher accuracy than MLMAda (e.g., GeoAda-W vs. MLMAda: +5% on BCMS dev, +1.9% on EUR test), and the difference in performance is statistically significant in six out of eight cases (see Table 5). As opposed to the two geolocation tasks where uncertainty weighting (GeoAda-W) clearly leads to better results than summing the losses (GeoAda-S), the difference is less pronounced for FT-Lang and significant only in one case (EUR test), even though GeoAda-W numerically outperforms GeoAda-S overall. Compared to the language identification models operating on the level of character n-grams (GlotLID, FastText), geoadaptation always brings statistically significant performance gains. Even MLMAda outperforms GlotLID and FastText in all cases, indicating that PLMs are generally competitive with more traditional systems on this task. We further notice that the relative disadvantage is particularly pronounced for GlotLID on BCMS. Upon inspection, we find that GlotLID’s inferior performance on BCMS is due to the fact that it predicts more than 80% of the examples as Croatian. This imbalance can be explained as a result of the domain difference between GlotLID’s training data and the FT-Lang evaluation data: While GlotLID was mostly trained on formal texts such as Wikipedia articles and government documents (Kargaran et al., 2023), we test it on data from Twitter. Crucially, while Croatian is the only BCMS language that consistently uses Latin script in formal contexts, with Cyrillic script being preferred especially in Serbian, Latin script is everywhere much more common on social media, even in Serbia (George, 2019). GlotLID seems to be heavily affected by this script mismatch and is only very rarely able to correctly predict the language of non-Croatian posts written in Latin script.
Method . | FT-Lang ↑ . | ZS-Lang ↑ . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
AGS . | BCMS . | DNS . | EUR . | |||||||
Dev . | Test . | Dev . | Test . | Dev . | Test . | Dev . | Test . | BCMS . | DNS . | |
Rand | – | – | – | – | – | – | – | – | ‡.245 | ‡.339 |
GlotLID | – | – | ‡.323 | ‡.316 | ‡.927 | ‡.931 | – | – | – | – |
FastText | ‡.843 | ‡.840 | ‡.598 | ‡.588 | ‡.948 | ‡.959 | ‡.757 | ‡.762 | – | – |
MLMAda | .851 | .855 | ‡.693 | ‡.694 | ‡.964 | ‡.966 | ‡.776 | ‡.777 | ‡.417 | ‡.885 |
GeoAda-S | .861 | .856 | .734 | .726 | .972 | .975 | .789 | †.786 | .553 | †.896 |
GeoAda-W | .861 | .858 | .743 | .734 | .973 | .976 | .792 | .796 | †.543 | .927 |
Method . | FT-Lang ↑ . | ZS-Lang ↑ . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
AGS . | BCMS . | DNS . | EUR . | |||||||
Dev . | Test . | Dev . | Test . | Dev . | Test . | Dev . | Test . | BCMS . | DNS . | |
Rand | – | – | – | – | – | – | – | – | ‡.245 | ‡.339 |
GlotLID | – | – | ‡.323 | ‡.316 | ‡.927 | ‡.931 | – | – | – | – |
FastText | ‡.843 | ‡.840 | ‡.598 | ‡.588 | ‡.948 | ‡.959 | ‡.757 | ‡.762 | – | – |
MLMAda | .851 | .855 | ‡.693 | ‡.694 | ‡.964 | ‡.966 | ‡.776 | ‡.777 | ‡.417 | ‡.885 |
GeoAda-S | .861 | .856 | .734 | .726 | .972 | .975 | .789 | †.786 | .553 | †.896 |
GeoAda-W | .861 | .858 | .743 | .734 | .973 | .976 | .792 | .796 | †.543 | .927 |
These trends are also reflected by the results on the out-of-domain language identification benchmarks: Geoadaptation always outperforms adaptation based on language modeling alone as well as models operating on the level of character n-gram (see Table 6). On the SETimes benchmark (BCMS), GeoAda-W further establishes a new state of the art, almost halving the error rate from 0.5% to 0.3%. Similarly to in-domain FT-Lang, the two geoadaptation variants perform similarly. GlotLID again predicts many non-Croatian examples in Latin script as Croatian, leading to a substantially worse performance on BCMS.
Method . | FT-Lang ↑ . | ZS-Lang ↑ . | ||||
---|---|---|---|---|---|---|
BCMS . | DNS . | |||||
Dev . | Test . | Dev . | Test . | BCMS . | DNS . | |
SotA / Rand | – | ?.995 | – | – | ‡.311 | ‡.351 |
GlotLID | ‡.692 | ‡.697 | ‡.932 | ‡.931 | – | – |
FastText | .992 | †.983 | ‡.957 | ‡.949 | – | – |
MLMAda | .992 | .992 | .962 | †.957 | ‡.604 | †.822 |
GeoAda-S | .993 | .995 | .964 | .962 | .640 | †.826 |
GeoAda-W | .994 | .997 | .964 | .961 | †.631 | .875 |
Method . | FT-Lang ↑ . | ZS-Lang ↑ . | ||||
---|---|---|---|---|---|---|
BCMS . | DNS . | |||||
Dev . | Test . | Dev . | Test . | BCMS . | DNS . | |
SotA / Rand | – | ?.995 | – | – | ‡.311 | ‡.351 |
GlotLID | ‡.692 | ‡.697 | ‡.932 | ‡.931 | – | – |
FastText | .992 | †.983 | ‡.957 | ‡.949 | – | – |
MLMAda | .992 | .992 | .962 | †.957 | ‡.604 | †.822 |
GeoAda-S | .993 | .995 | .964 | .962 | .640 | †.826 |
GeoAda-W | .994 | .997 | .964 | .961 | †.631 | .875 |
The superior performance of the geoadapted models in language identification—a task that is distinct from geolocation prediction and not typically addressed by means of PLMs—suggests that the geolinguistic knowledge acquired during geoadaptation is highly generalizable, making it beneficial for a broader set of tasks with a connection to geography, and not only the task used as an auxiliary objective for geoadaptation itself.
Zero-shot Language Identification.
Here, the PLMs have to predict the token corresponding to the language in which a text is written, e.g., hrvatski (‘Croatian’). This task requires generalization on two levels: First (similarly to FT-Lang), the PLMs have not been trained on language identification and are thus required to draw upon the geolinguistic knowledge they have formed during geoadaptation; second (similarly to ZS-Geoloc), the geolinguistic knowledge has not been provided to them in a form that would make it readily usable in a zero-shot setting—recall that the geographic information is presented in the form of longitude-latitude pairs (i.e., two scalars), whereas the language modeling head (which is used for the zero-shot predictions) is not trained differently than for vanilla adaptation (MLMAda). Despite these challenges, we find that geoadaptation substantially improves the performance of the PLMs on ZS-Lang (see Tables 5 and 6). The fact that the performance gains are equally pronounced on in-domain (e.g., GeoAda-W vs. MLMAda: +4.2% on DNS) and out-of-domain examples (e.g., GeoAda-W vs. MLMAda: +5.3% on DNS) highlights again that geoadaptation endows PLMs with knowledge that allows for a high degree of generalization.
Zero-shot Dialect Feature Prediction.
The results on ZS-Dialect—phonological (Phon) and lexical (Lex)—generally follow the trends from the other four tasks (see Table 7): The geoadapted PLMs clearly (and statistically significantly) outperform MLMAda, albeit with overall narrower margins than in most other zero-shot tasks for BCMS (e.g., GeoAda-S vs. MLMAda: +8.6% on Phon, GeoAda-W vs. MLMAda: +4.1% on Lex). MLMAda is expectedly more competitive here: Selecting the word variant that better fits into the linguistic context is essentially a language modeling task, for which additional language modeling training intuitively helps. For example, typical future tense constructions in Serbian vs. Croatian (ja ću da okupim vs. ja ću okupiti, ‘I’ll gather’) have strong selectional preferences on subsequent lexical units (Alexander, 2006; e.g., porodicu vs. obitelj for ‘family’).
Method . | ZS-Dialect ↑ . | |
---|---|---|
Phon . | Lex . | |
Rand | ‡.501 | ‡.499 |
MLMAda | ‡.784 | ‡.872 |
GeoAda-S | .870 | .910 |
GeoAda-W | .858 | .913 |
Method . | ZS-Dialect ↑ . | |
---|---|---|
Phon . | Lex . | |
Rand | ‡.501 | ‡.499 |
MLMAda | ‡.784 | ‡.872 |
GeoAda-S | .870 | .910 |
GeoAda-W | .858 | .913 |
We further verify this by comparing the zero-shot performance on BCMS for different model checkpoints obtained during training. The performance curves over 25 (geo-)adaptation epochs, shown in Figure 3, confirm our hypothesis that longer language modeling adaptation substantially improves the performance of MLMAda on predicting dialect features, but its benefits for geolocation prediction and language identification remain limited. While prolonged language modeling adaptation allows MLMAda to eventually learn the dialectal associations, the inductive bias of the knowledge injected via geoadaptation allows GeoAda-S and GeoAda-W to reach high performance much sooner, after merely two to three epochs.
Effects of Loss Weighting.
The dynamic weighting of mlm and geo (i.e., GeoAda-W) clearly outperforms the simple summation of the losses (i.e., GeoAda-S) on the geolocation prediction tasks (FT-Geoloc, ZS-Geoloc), but the difference between the two geoadaptation variants is less pronounced for FT-Lang, ZS-Lang, and ZS-Dialect. While geographic knowledge is beneficial for all five tasks, geolocation prediction arguably demands a more direct exploitation of that knowledge. Comparing the model variants in terms of the two task losses, we observe that GeoAda-S reaches lower mlm levels, whereas GeoAda-W ends with lower geo levels (see Figure 4 for the example of BCMS), which would explain the differences in their performance. We inspect GeoAda-W’s task uncertainty weights after geoadaptation and observe ηmlm = 0.29 and ηgeo = −0.35 for AGS, ηmlm = 1.12 and ηgeo = −1.22 for BCMS, ηmlm = 0.84 and ηgeo = −1.23 for DNS, and ηmlm = 0.90 and ηgeo = −1.95 for EUR. Thus, GeoAda-W consistently assigns more importance to geo.18 The fact that the divergence of the task uncertainty weights is smallest for AGS explains why the difference between GeoAda-S and GeoAda-W on FT-Geoloc/ ZS-Geoloc is least pronounced for that language group.
Sequence-level Geoadaptation.
The decision to inject geographical information at the level of tokens was motivated by the central importance of the lexicon for geographically conditioned linguistic variability (see §2). A plausible alternative—one less tied to lexical variation alone—is to geoadapt the PLMs by predicting the geolocation from the representation of the whole input text, i.e., to feed the contextualized representation of the [CLS] token to the regressor that predicts longitude and latitude. For comparison, we evaluate this variant too (GeoAda-Seq) and compare it against the best token-level geoadapted model (GeoAda-Tok; e.g., GeoAda-W for BCMS FT-Geoloc) on all PLMs and tasks. For reasons of space, we only present BCMS here, but the overall trends for AGS, DNS, and EUR are very similar.
Sequence-level geoadaptation trails token-level geoadaptation on all tasks except for fine-tuned geolocation prediction (see Table 8). In general, while the difference is small for the fine-tuned tasks, it is large (and always significant) for the zero-shot tasks—for example, GeoAda-Seq performs only slightly better than MLMAda on ZS-Geoloc (see Table 3). This suggests that injecting geographic information on the level of tokens allows the PLMs to acquire more nuanced geolinguistic knowledge. Nonetheless, sequence-level geoadaptation still outperforms the non-geoadapted baselines.
Model . | FT-Geoloc ↓ . | ZS-Geoloc ↑ . | FT-Lang ↑ . | ZS-Lang ↑ . | ZS-Dialect ↑ . | |||
---|---|---|---|---|---|---|---|---|
Dev . | Test . | Dev . | Test . | Phon . | Lex . | |||
GeoAda-Seq | †27.35 | 12.13 | †.188 | .737 | .730 | †.542 | †.844 | †.885 |
GeoAda-Tok | 23.90 | 12.13 | .319 | .743 | .734 | .553 | .870 | .913 |
Model . | FT-Geoloc ↓ . | ZS-Geoloc ↑ . | FT-Lang ↑ . | ZS-Lang ↑ . | ZS-Dialect ↑ . | |||
---|---|---|---|---|---|---|---|---|
Dev . | Test . | Dev . | Test . | Phon . | Lex . | |||
GeoAda-Seq | †27.35 | 12.13 | †.188 | .737 | .730 | †.542 | †.844 | †.885 |
GeoAda-Tok | 23.90 | 12.13 | .319 | .743 | .734 | .553 | .870 | .913 |
Geoadaptation as Geographic Retrofitting.
Even though it makes intuitive sense that minimizing geo improves the geolinguistic knowledge of PLMs, we want to determine the exact mechanism by which it does so. Based on the results described so far, we make the following hypothesis: Geoadaptation changes the representation space of the PLMs in such a way that tokens indicative of a certain location are brought close to each other, i.e., it has the effect of geographic retrofitting (Hovy and Purschke, 2018). We examine this hypothesis by analyzing (i) how the representations of toponyms and lexical variants change in relation to each other, and (ii) how the representations of toponyms change internally. We examine the PLM output embeddings (which directly impact the zero-shot predictions) and focus on BCMS.
For the first question, we use the geoadaptation data to compute type-level embeddings for the five largest Croatian (Zagreb, Split, Rijeka, Osijek, Zadar) and Serbian (Beograd, Niš, Kragujevac, Subotica, Pančevo) cities as well as the ije/e variants used for ZS-Dialect. Following established practice (e.g., Vulić et al., 2020; Litschko et al., 2022), we obtain type-level vectors for words (i.e., city name or phonological variant) by averaging the contextualized output representations of their token occurrences. We then resort to WEAT (Caliskan et al., 2017), a measure that quantifies the difference in association strength between a word (in our case, a city name) and two word sets (in our case, ije vs. e phonological variants). A positive or negative score indicates that a city name is associated more strongly with the ije or e variants, respectively. Figure 5 shows that during geoadaptation (GeoAda-W), the Croatian city names develop a strong association with the ije variants (i.e., positive WEAT scores), whereas the Serbian city names develop a strong association with the e variants (i.e., negative WEAT scores), which is exactly in line with their geographic distribution (Alexander, 2006). By contrast, the associations created during adaptation based on language modeling alone (MLMAda) are substantially weaker.
We then use the same set of 10 Croatian and Serbian cities and compare their pairwise geodesic distances against the pairwise cosine distances of the city name embeddings, at the end of geoadaptation. The correlation between the two sets of distances (Pearson’s r) is only 0.577 for MLMAda, but 0.881 for GeoAda-W, indicating almost perfect correlation. Furthermore, after only five epochs, the correlation is already 0.845 for GeoAda-W (vs. only 0.124 for MLMAda). This striking correspondence between real-world geography and the topology of the embedding space of geoadapted PLMs can also be seen by plotting the first two principal components of the city name embeddings on top of a geographic map, where we use orthogonal Procrustes (Schönemann, 1966; Hamilton et al., 2016) to align the points (see Figure 6).
These results are strong evidence that geoadaptation indeed acts as a form of geographic retrofitting. The geographically restructured representation space of the PLMs can then be further refined via fine-tuning (as in FT-Geoloc and FT-Lang) or directly probed in a zero-shot manner (as in ZS-Geoloc, ZS-Lang and ZS-Dialect).
6 Conclusion
We introduce geoadaptation, an approach for task-agnostic continued pretraining of PLMs that forces them to learn associations between linguistic phenomena and geographic locations. The method we propose for geoadaptation couples language modeling and token-level geolocation prediction via multi-task learning. While we focus on PLMs pretrained via masked language modeling, geoadaptation can in principle be applied to autoregressive PLMs as well. We geoadapt four PLMs and obtain consistent gains on five tasks, establishing a new state of the art on established benchmarks. We further show that geoadaptation acts as a form of geographic retrofitting. Overall, we see our study as an exciting step towards NLP technology that takes into account extralinguistic aspects in general and geographic aspects in particular.
Acknowledgments
This work was funded by the European Research Council (grant #740516 awarded to LMU Munich) and the Engineering and Physical Sciences Research Council (grant EP/T023333/1 awarded to University of Oxford). Valentin Hofmann was also supported by the German Academic Scholarship Foundation. Goran Glavaš was supported by the EUINACTION grant from NORFACE Governance and German Science Foundation (462-19-010, GL950/2-1). Nikola Ljubešić was supported by the Slovenian Research and Innovation Agency (P6-0411, J7-4642, L2-50070). We thank the reviewers and action editor for their very helpful comments.
Notes
Notice that for the language areas we consider, there is currently also not enough geotagged data that would allow us to geographically pretrain models from scratch.
We make our code available at https://github.com/valentinhofmann/geoadaptation.
In this work, we focus on PLMs pretrained via masked language modeling. However, geoadaptation can in principle also be applied to autoregressive PLMs.
For the sake of simplicity, in the following we will refer to both Jodel posts and tweets as posts.
In preliminary experiments, we found that geographically balanced sampling is beneficial for geoadaptation.
We standardize longitude and latitude values and use the Euclidean distance as the clustering metric. Following Scherrer and Ljubešić (2021), we choose k = 75.
Most posts in the AGS data come from rural areas.
We experimented with other prompts (e.g., ‘This is in [MASK]’) and obtained similar results.
We do not conduct ZS-Lang for AGS and EUR since the names of the German dialects (e.g., Schweizerdeutsch) are not in the GermanBERT and mBERT vocabularies.
Assuming equal underlying performance for MLMAda, GeoAda-S, and GeoAda-W (and ignoring other baselines), the probability for this result is p = (2/3)30 < 10−5.
Note that Bar and Dubrovnik are not in the same country.
Because (see Equation 2), the smaller the value of ηl, the larger the emphasis on task l.
References
Author notes
Action Editor: Trevor Cohn