## Abstract

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 crosslingual semantic similarity data sets. Because of its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and crosslingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and crosslingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised crosslingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex–style resources for additional languages. We make these contributions—the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be helpful in guiding future developments in multilingual lexical semantics and representation learning—available via a Web site that will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.

## 1 Introduction

The lack of annotated training and evaluation data for many tasks and domains hinders the development of computational models for the majority of the world’s languages (Snyder and Barzilay 2010; Adams et al. 2017; Ponti et al. 2019a; Joshi et al. 2020). The necessity to guide and advance multilingual and crosslingual NLP through annotation efforts that follow crosslingually consistent guidelines has been recently recognized by collaborative initiatives such as the Universal Dependency (UD) project (Nivre et al. 2019). The latest version of UD (as of July 2020) covers about 90 languages. Crucially, this resource continues to steadily grow and evolve through the contributions of annotators from across the world, extending the UD’s reach to a wide array of typologically diverse languages. Besides steering research in multilingual parsing (Zeman et al. 2018; Kondratyuk and Straka 2019; Doitch et al. 2019) and crosslingual parser transfer (Rasooli and Collins 2017; Lin et al. 2019; Rotman and Reichart 2019), the consistent annotations and guidelines have also enabled a range of insightful comparative studies focused on the languages’ syntactic (dis)similarities (Chen and Gerdes 2017; Bjerva and Augenstein 2018; Bjerva et al. 2019; Ponti et al. 2018a; Pires, Schlinger, and Garrette 2019).

Inspired by the UD work and its substantial impact on research in (multilingual) syntax, in this article we introduce Multi-SimLex, a suite of manually and consistently annotated semantic data sets for 12 different languages, focused on the fundamental lexical relation of semantic similarity on a continuous scale (i.e., gradience/strength of semantic similarity) (Budanitsky and Hirst 2006; Hill, Reichart, and Korhonen 2015). For any pair of words, this relation measures whether (and to what extent) their referents share the same (functional) features (e.g., lion – cat), as opposed to general cognitive association (e.g., lionzoo) captured by co-occurrence patterns in texts (i.e., the distributional information).1 Data sets that quantify the strength of semantic similarity between concept pairs such as SimLex-999 (Hill, Reichart, and Korhonen 2015) or SimVerb-3500 (Gerz et al. 2016) have been instrumental in improving models for distributional semantics and representation learning. Discerning between semantic similarity and relatedness/association is not only crucial for theoretical studies on lexical semantics (see §2), but has also been shown to benefit a range of language understanding tasks in NLP. Examples include dialog state tracking (Mrkšić et al. 2017; Ren et al. 2018), spoken language understanding (Kim et al. 2016; Kim, de Marneffe, and Fosler-Lussier 2016), text simplification (Glavaš and Vulić 2018; Ponti et al. 2018b; Lauscher et al. 2019), and dictionary and thesaurus construction (Cimiano, Hotho, and Staab 2005; Hill et al. 2016).

Despite the proven usefulness of semantic similarity data sets, they are available only for a small and typologically narrow sample of resource-rich languages such as German, Italian, and Russian (Leviant and Reichart 2015), whereas some language types and low-resource languages typically lack similar evaluation data. Even if some resources do exist, they are limited in their size (e.g., 500 pairs in Turkish (Ercan and Yıldız 2018), 500 in Farsi (Camacho-Collados et al. 2017), or 300 in Finnish (Venekoski and Vankka 2017)) and coverage (e.g., all data sets that originated from the original English SimLex-999 contain only high-frequent concepts, and are dominated by nouns). This is why, as our departure point, we introduce a larger and more comprehensive English word similarity data set spanning 1,888 concept pairs (see §4).

Most importantly, semantic similarity data sets in different languages have been created using heterogeneous construction procedures with different guidelines for translation and annotation, as well as different rating scales. For instance, some data sets were obtained by directly translating the English SimLex-999 in its entirety (Leviant and Reichart 2015; Mrkšić et al. 2017) or in part (Venekoski and Vankka 2017). Other data sets were created from scratch (Ercan and Yıldız 2018) and yet others sampled English concept pairs differently from SimLex-999 and then translated and reannotated them in target languages (Camacho-Collados et al. 2017). This heterogeneity makes these data sets incomparable and precludes systematic crosslinguistic analyses. In this article, consolidating the lessons learned from previous data set construction paradigms, we propose a carefully designed translation and annotation protocol for developing monolingual Multi-SimLex data sets with aligned concept pairs for typologically diverse languages. We apply this protocol to a set of 12 languages, including a mixture of major languages (e.g., Mandarin, Russian, and French) as well as several low-resource ones (e.g., Kiswahili, Welsh, and Yue Chinese). We demonstrate that our proposed data set creation procedure yields data with high inter-annotator agreement rates (e.g., the average mean inter-annotator agreement over all 12 languages is Spearman’s ρ = 0.740, ranging from ρ = 0.667 for Russian to ρ = 0.812 for French).

The unified construction protocol and alignment between concept pairs enables a series of quantitative analyses. Preliminary studies on the influence that polysemy and crosslingual variation in lexical categories (see §2.3) have on similarity judgments are provided in §5. Data created according to Multi-SimLex protocol also allow for probing into whether similarity judgments are universal across languages, or rather depend on linguistic affinity (in terms of linguistic features, phylogeny, and geographical location). We investigate this question in §5.4. Naturally, Multi-SimLex data sets can be used as an intrinsic evaluation benchmark to assess the quality of lexical representations based on monolingual, joint multilingual, and transfer learning paradigms. We conduct a systematic evaluation of several state-of-the-art representation models in §7, showing that there are large gaps between human and system performance in all languages. The proposed construction paradigm also supports the automatic creation of 66 crosslingual Multi-SimLex data sets by interleaving the monolingual ones. We outline the construction of the crosslingual data sets in §6, and then present a quantitative evaluation of a series of cutting-edge crosslingual representation models on this benchmark in §8.

### Contributions.

We now summarize the main contributions of this work:

1. Building on lessons learned from prior work, we create a more comprehensive lexical semantic similarity data set for the English language spanning a total of 1,888 concept pairs balanced with respect to similarity, frequency, and concreteness, and covering four word classes: nouns, verbs, adjectives and, for the first time, adverbs. This data set serves as the main source for the creation of equivalent data sets in several other languages.

2. We present a carefully designed and rigorous language-agnostic translation and annotation protocol. These well-defined guidelines will facilitate the development of future Multi-SimLex data sets for other languages. The proposed protocol eliminates some crucial issues with prior efforts focused on the creation of multilingual semantic resources, namely: i) limited coverage; ii) heterogeneous annotation guidelines; and iii) concept pairs that are semantically incomparable across different languages.

3. We offer to the community manually annotated evaluation sets of 1,888 concept pairs across 12 typologically diverse languages, and 66 large crosslingual evaluation sets. To the best of our knowledge, Multi-SimLex is the most comprehensive evaluation resource to date focused on the relation of semantic similarity.

4. We benchmark a wide array of recent state-of-the-art monolingual and crosslingual word representation models across our sample of languages. The results can serve as strong baselines that lay the foundation for future improvements.

5. We present a first large-scale evaluation study on the ability of encoders pretrained on language modeling (such as bert (Devlin et al. 2019) and xlm (Conneau and Lample 2019)) to reason over word-level semantic similarity in different languages. To our surprise, the results show that monolingual pretrained encoders, even when presented with word types out of context, are sometimes competitive with static word embedding models such as fastText (Bojanowski et al. 2017) or word2vec (Mikolov et al. 2013). The results also reveal a huge gap in performance between massively multilingual pretrained encoders and language-specific encoders in favor of the latter: Our findings support other recent empirical evidence related to the “curse of multilinguality” (Conneau et al. 2019; Bapna and Firat 2019) in representation learning.

6. We make all of these resources available on a Web site that facilitates easy creation, submission, and sharing of Multi-Simlex–style data sets for a larger number of languages. We hope that this will yield an even larger repository of semantic resources that inspire future advances in NLP within and across languages.

In light of the success of UD (Nivre et al. 2019), we hope that our initiative will instigate a collaborative public effort with established and clear-cut guidelines that will result in additional Multi-SimLex data sets in a large number of languages in the near future. Moreover, we hope that it will provide means to advance our understanding of distributional and lexical semantics across a large number of languages. All monolingual and crosslingual Multi-SimLex data sets—along with detailed translation and annotation guidelines—are available online at: https://multisimlex.com/.

## 2 Lexical Semantic Similarity

### 2.1 Similarity and Association

The focus of the Multi-SimLex initiative is on the lexical relation of “true/pure” semantic similarity, as opposed to the broader conceptual association. For any pair of words, this relation measures whether their referents share the same features. For instance, graffiti and frescos are similar to the extent that they are both forms of painting and appear on walls. This relation can be contrasted with the cognitive association between two words, which often depends on how much their referents interact in the real world, or are found in the same situations. For instance, a painter is easily associated with frescos, although they lack any physical commonalities. Association is also known in the literature under other names: relatedness (Budanitsky and Hirst 2006), topical similarity (McKeown et al. 2002), and domain similarity (Turney 2012).

Semantic similarity and association overlap to some degree, but do not coincide (Kiela, Hill, and Clark 2015; Vulić, Kiela, and Korhonen 2017). In fact, there exist plenty of pairs that are intuitively associated but not similar. Pairs where the converse is true can also be encountered, although more rarely. An example are synonyms where a word is common and the other infrequent, such as to seize and to commandeer. Hill, Reichart, and Korhonen (2015) revealed that although similarity measures based on the WordNet graph (Wu and Palmer 1994) and human judgments of association in the University of South Florida Free Association Database (Nelson, McEvoy, and Schreiber 2004) do correlate, a number of pairs follow opposite trends. Several studies on human cognition also point in the same direction. For instance, semantic priming can be triggered by similar words without association (Lucas 2000). On the other hand, a connection with cue words is established more quickly for topically related words than for similar words in free association tasks (De Deyne and Storms 2008).

A key property of semantic similarity is its gradience: Pairs of words can be similar to a different degree. On the other hand, the relation of synonymy is binary: Pairs of words are synonyms if they can be substituted in all contexts (or most contexts, in a looser sense); otherwise they are not. Although synonyms can be conceived as lying on one extreme of the semantic similarity continuum, it is crucial to note that their definition is stated in purely relational terms, rather than invoking their referential properties (Lyons 1977; Cruse 1986; Coseriu 1967). This makes behavioral studies on semantic similarity fundamentally different from lexical resources like WordNet (Miller 1995), which include paradigmatic relations (such as synonymy).

### 2.2 Similarity for NLP: Intrinsic Evaluation and Semantic Specialization

The ramifications of the distinction between similarity and association are profound for distributional semantics. This paradigm of lexical semantics is grounded in the distributional hypothesis, formulated by Firth (1957) and Harris (1951). According to this hypothesis, the meaning of a word can be recovered empirically from the contexts in which it occurs within a collection of texts. Because both pairs of topically related words and pairs of purely similar words tend to appear in the same contexts, their associated meaning confounds the two distinct relations (Hill, Reichart, and Korhonen 2015; Schwartz, Reichart, and Rappoport 2015; Vulić et al. 2017b). As a result, distributional methods obscure a crucial facet of lexical meaning.

This limitation also reflects onto word embeddings (WEs), representations of words as low-dimensional vectors that have become indispensable for a wide range of NLP applications (Collobert et al. 2011; Chen and Manning 2014; Melamud et al. 2016, inter alia). In particular, it involves both static WEs learned from co-occurrence patterns (Mikolov et al. 2013; Levy and Goldberg 2014; Bojanowski et al. 2017) and contextualized WEs learned from modeling word sequences (Peters et al. 2018; Devlin et al. 2019, inter alia). As a result, in the induced representations, geometrical closeness (measured, e.g., through cosine distance) conflates genuine similarity with broad relatedness. For instance, the vectors for antonyms such as sober and drunk, by definition dissimilar, might be neighbors in the semantic space under the distributional hypothesis. Similar to work on distributional representations that predated the WE era (Sahlgren 2006), Turney (2012), Kiela and Clark (2014), and Melamud et al. (2016) demonstrated that different choices of hyperparameters in WE algorithms (such as context window) emphasize different relations in the resulting representations. Likewise, Agirre et al. (2009) and Levy and Goldberg (2014) discovered that WEs learned from texts annotated with syntactic information mirror similarity better than simple local bag-of-words neighborhoods.

The failure of WEs to capture semantic similarity, in turn, affects model performance in several NLP applications where such knowledge is crucial. In particular, Natural Language Understanding tasks such as statistical dialog modeling, text simplification, or semantic text similarity (Mrkšić et al. 2016; Kim et al. 2016; Ponti et al. 2019c), among others, suffer the most. As a consequence, resources providing clean information on semantic similarity are key in mitigating the side effects of the distributional signal. In particular, such databases can be used for the intrinsic evaluations of specific WE models as a proxy for their reliability for downstream applications (Collobert and Weston 2008; Baroni and Lenci 2010; Hill, Reichart, and Korhonen 2015); intuitively, the more WEs are misaligned with human judgments of similarity, the more their performance on actual tasks is expected to be degraded. Moreover, word representations can be specialized (a.k.a. retrofitted) by disentangling word relations of similarity and association. In particular, linguistic constraints sourced from external databases (such as synonyms from WordNet) can be injected into WEs (Faruqui et al. 2015; Wieting et al. 2015; Mrkšić et al. 2017; Lauscher et al. 2019; Kamath et al. 2019, inter alia) in order to enforce a particular relation in a distributional semantic space while preserving the original adjacency properties.

### 2.3 Similarity and Language Variation: Semantic Typology

In this work, we tackle the concept of (true and gradient) semantic similarity from a multilingual perspective. Although the same meaning representations may be shared by all human speakers at a deep cognitive level, there is no one-to-one mapping between the words in the lexicons of different languages. This makes the comparison of similarity judgments across languages difficult, because the meaning overlap of translationally equivalent words is sometimes far less than exact. This results from the fact that the way languages “partition” semantic fields is partially arbitrary (Trier 1931), although constrained crosslingually by common cognitive biases (Majid et al. 2007). For instance, consider the field of colors: English distinguishes between green and blue, whereas Murle (South Sudan) has a single word for both (Kay and Maffi 2013).2

In general, semantic typology studies the variation in lexical semantics across the world’s languages. According to Evans (2011), the ways languages categorize concepts into the lexicon follow three main axes: 1) granularity: what is the number of categories in a specific domain?; 2) boundary location: where do the lines marking different categories lie?; 3) grouping and dissection: what are the membership criteria of a category; which instances are considered to be more prototypical? Different choices with respect to these axes lead to different lexicalization patterns.3 For instance, distinct senses in a polysemous word in English, such as skin (referring to both the body and fruit), may be assigned separate words in other languages such as Italian pelle and buccia, respectively (Rzymski et al. 2020). We later analyze whether similarity scores obtained from native speakers also loosely follow the patterns described by semantic typology.

## 3 Previous Work and Evaluation Data

### Word Pair Data Sets.

Rich expert-created resources such as WordNet (Miller 1995; Fellbaum 1998), VerbNet (Kipper Schuler 2005; Kipper et al. 2008), or FrameNet (Baker, Fillmore, and Lowe 1998) encode a wealth of semantic and syntactic information, but are expensive and time-consuming to create. The scale of this problem becomes multiplied by the number of languages in consideration. Therefore, crowd-sourcing with non-expert annotators has been adopted as a quicker alternative to produce smaller and more focused semantic resources and evaluation benchmarks. This alternative practice has had a profound impact on distributional semantics and representation learning (Hill, Reichart, and Korhonen 2015). Whereas some prominent English word pair data sets such as WordSim-353 (Finkelstein et al. 2002), MEN (Bruni, Tran, and Baroni 2014), or Stanford Rare Words (Luong, Socher, and Manning 2013) did not discriminate between similarity and relatedness, the importance of this distinction was established by Hill, Reichart, and Korhonen (2015) (see again the discussion in §2.1) through the creation of SimLex-999. This inspired other similar data sets that focused on different lexical properties. For instance, SimVerb-3500 (Gerz et al. 2016) provided similarity ratings for 3,500 English verbs, whereas CARD-660 (Pilehvar et al. 2018) aimed at measuring the semantic similarity of infrequent concepts.

### Semantic Similarity Data Sets in Other Languages.

Motivated by the impact of data sets such as SimLex-999 and SimVerb-3500 on representation learning in English, a line of related work put focus on creating similar resources in other languages. The dominant approach is translating and reannotating the entire original English SimLex-999 data set, as done previously for German, Italian, and Russian (Leviant and Reichart 2015), Hebrew and Croatian (Mrkšić et al. 2017), and Polish (Mykowiecka, Marciniak, and Rychlik 2018). Venekoski and Vankka (2017) applied this process only to a subset of 300 concept pairs from the English SimLex-999. On the other hand, Camacho-Collados et al. (2017) sampled a new set of 500 English concept pairs to ensure wider topical coverage and balance across similarity spectra, and then translated those pairs to German, Italian, Spanish, and Farsi (SEMEVAL-500). A similar approach was followed by Ercan and Yıldız (2018) for Turkish, by Huang et al. (2019) for Mandarin Chinese, and by Sakaizawa and Komachi (2018) for Japanese. Netisopakul, Wohlgenannt, and Pulich (2019) translated the concatenation of SimLex-999, WordSim-353, and the English SEMEVAL-500 into Thai and then reannotated it. Finally, Barzegar et al. (2018) translated English SimLex-999 and WordSim-353 to 11 resource-rich target languages (German, French, Russian, Italian, Dutch, Chinese, Portuguese, Swedish, Spanish, Arabic, Farsi), but they did not provide details concerning the translation process and the resolution of translation disagreements. More importantly, they also did not reannotate the translated pairs in the target languages. As we discussed in §2.3 and reiterate later in §5, semantic differences among languages can have a profound impact on the annotation scores; particularly, we show in §5.4 that these differences even roughly define language clusters based on language affinity.

A core issue with the current data sets concerns a lack of one unified procedure that ensures the comparability of resources in different languages. Further, concept pairs for different languages are sourced from different corpora (e.g., direct translation of the English data versus sampling from scratch in the target language). Moreover, the previous SimLex-based multilingual data sets inherit the main deficiencies of the English original version, such as the focus on nouns and highly frequent concepts. Finally, prior work mostly focused on languages that are widely spoken and do not account for the variety of the world’s languages. Our long-term goal is devising a standardized methodology to extend the coverage also to languages that are resource-lean and/or typologically diverse (e.g., Welsh, Kiswahili, as in this work).

### Multilingual Data Sets for Natural Language Understanding.

The Multi-SimLex initiative and corresponding data sets are also aligned with the recent efforts on procuring multilingual benchmarks that can help advance computational modeling of natural language understanding across different languages. For instance, pretrained multilingual language models such as multilingual bert (Devlin et al. 2019) or xlm (Conneau and Lample 2019) are typically probed on XNLI test data (Conneau et al. 2018b) for cross-lingual natural language inference. XNLI was created by translating examples from the English MultiNLI data set, and projecting its sentence labels (Williams, Nangia, and Bowman 2018). Other recent multilingual data sets target the task of question answering based on reading comprehension: i) MLQA (Lewis et al. 2019) includes 7 languages; ii) XQuAD (Artetxe, Ruder, and Yogatama 2019) 10 languages; and iii) TyDiQA (Clark et al. 2020) 9 widely spoken typologically diverse languages. While MLQA and XQuAD result from the translation from an English data set, TyDiQA was built independently in each language. Another multilingual data set, PAWS-X (Yang et al. 2019), focused on the paraphrase identification task and was created translating the original English PAWS (Zhang, Baldridge, and He 2019) into 6 languages. XCOPA (Ponti et al. 2020) is a crosslingual data set for the evaluation of crosslingual causal commonsense reasoning, obtained through translation of the English COPA data (Roemmele, Bejan, and Gordon 2011) to 11 target languages. A large number of tasks have been recently integrated into unified multilingual evaluation suites: XTREME (Hu et al. 2020) and XGLUE (Liang et al. 2020). We believe that Multi-SimLex can substantially contribute to this endeavor by offering a comprehensive multilingual benchmark for the fundamental lexical level relation of semantic similarity. In future work, Multi-SimLex also offers an opportunity to investigate the correlations between word-level semantic similarity and performance in downstream tasks such as QA and NLI across different languages.

## 4 The Base for Multi-SimLex: Extending English SimLex-999

In this section, we discuss the design principles behind the English (eng) Multi-SimLex data set, which is the basis for all the Multi-SimLex data sets in other languages, as detailed in §5. We first argue that a new, more balanced, and more comprehensive evaluation resource for lexical semantic similarity in English is necessary. We then describe how the 1,888 word pairs contained in the eng Multi-SimLex were selected in such a way as to represent various linguistic phenomena within a single integrated resource.

### Construction Criteria.

The following criteria have to be satisfied by any high-quality semantic evaluation resource, as argued by previous studies focused on the creation of such resources (Hill, Reichart, and Korhonen 2015; Gerz et al. 2016; Vulić et al. 2017a; Camacho-Collados et al. 2017, inter alia):

(C1) Representative and diverse. The resource must cover the full range of diverse concepts occurring in natural language, including different word classes (e.g., nouns, verbs, adjectives, adverbs), concrete and abstract concepts, a variety of lexical fields, and different frequency ranges.

(C2) Clearly defined. The resource must provide a clear understanding of which semantic relation exactly is annotated and measured, possibly contrasting it with other relations. For instance, the original SimLex-999 and SimVerb-3500 explicitly focus on true semantic similarity and distinguish it from broader relatedness captured by data sets such as MEN (Bruni, Tran, and Baroni 2014) or WordSim-353 (Finkelstein et al. 2002).

(C3) Consistent and reliable. The resource must ensure consistent annotations obtained from non-expert native speakers following simple and precise annotation guidelines.

In choosing the word pairs and constructing eng Multi-SimLex, we adhere to these requirements. Moreover, we follow good practices established by the research on related resources. In particular, since the introduction of the original SimLex-999 data set (Hill, Reichart, and Korhonen 2015), follow-up work has improved its construction protocol across several aspects, including: 1) coverage of more lexical fields, for example, by relying on a diverse set of Wikipedia categories (Camacho-Collados et al. 2017), 2) infrequent/rare words (Pilehvar et al. 2018), 3) focus on particular word classes, for example, verbs (Gerz et al. 2016), and 4) annotation quality control (Pilehvar et al. 2018). Our goal is to make use of these improvements toward a larger, more representative, and more reliable lexical similarity data set in English and, consequently, in all other languages.

### The Final Output: English Multi-SimLex.

In order to ensure that criterion C1 is satisfied, we consolidate and integrate the data already carefully sampled in prior work into a single, comprehensive, and representative data set. This way, we can control for diversity, frequency, and other properties while avoiding performing this time-consuming selection process from scratch. Note that, on the other hand, the word pairs chosen for English are scored from scratch as part of the entire Multi-SimLex annotation process, introduced later in §5. We now describe the external data sources for the final set of word pairs:

1. Source: SimLex-999 (Hill, Reichart, and Korhonen 2015). The English Multi-SimLex has been initially conceived as an extension of the original SimLex-999 data set. Therefore, we include all 999 word pairs from SimLex, which span 666 noun pairs, 222 verb pairs, and 111 adjective pairs. While SimLex-999 already provides examples representing different POS classes, it does not have a sufficient coverage of different linguistic phenomena: For instance, it contains only very frequent concepts, and it does not provide a representative set of verbs (Gerz et al. 2016).

2. Source: SemEval-17: Task 2 (henceforth SEMEVAL-500; Camacho-Collados et al. 2017). We start from the full data set of 500 concept pairs to extract a total of 334 concept pairs for English Multi-SimLex a) which contain only single-word concepts, b) which are not named entities, c) where POS tags of the two concepts are the same, d) where both concepts occur in the top 250K most frequent word types in the English Wikipedia, and e) which do not already occur in SimLex-999. The original concepts were sampled as to span all the 34 domains available as part of BabelDomains (Camacho-Collados and Navigli 2017), which roughly correspond to the main high-level Wikipedia categories. This ensures topical diversity in our sub-sample.

3. Source: CARD-660 (Pilehvar et al. 2018). Sixty-seven word pairs are taken from this data set focused on rare word similarity, applying the same selection criteria a to e utilized for SEMEVAL-500. Words are controlled for frequency based on their occurrence counts from the Google News data and the ukWaC corpus (Baroni et al. 2009). CARD-660 contains some words that are very rare (logboat), domain-specific (erythroleukemia), and slang (2mrw), which might be difficult to translate and annotate across a wide array of languages. Hence, we opt for retaining only the concept pairs above the threshold of the top 250K most frequent Wikipedia concepts, as above.

4. Source: SimVerb-3500 (Gerz et al. 2016) Because both CARD-660 and SEMEVAL-500 are heavily skewed toward noun pairs, and nouns also dominate the original SimLex-999, we also extract additional verb pairs from the verb-specific similarity data set SimVerb-3500. We randomly sample 244 verb pairs from SimVerb-3500 that represent all similarity spectra. In particular, we add 61 verb pairs for each of the similarity intervals: [0,1.5),[1.5,3),[3,4.5),[4.5,6]. Because verbs in SimVerb-3500 were originally chosen from VerbNet (Kipper, Snyder, and Palmer 2004; Kipper et al. 2008), they cover a wide range of verb classes and their related linguistic phenomena.

### Fulfillment of Construction Criteria.

The final eng Multi-SimLex data set spans 1,051 noun pairs, 469 verb pairs, 245 adjective pairs, and 123 adverb pairs.4 As mentioned earlier, the criterion C1 has been fulfilled by relying only on word pairs that already underwent meticulous sampling processes in prior work, integrating them into a single resource. As a consequence, Multi-SimLex allows for fine-grained analyses over different POS classes, concreteness levels, similarity spectra, frequency intervals, relation types, morphology, and lexical fields; and it also includes some challenging orthographically similar examples (e.g., infection – inflection).5 We ensure that criteria C2 and C3 are satisfied by using similar annotation guidelines as Simlex-999, SimVerb-3500, and SEMEVAL-500 that explicitly target semantic similarity. In what follows, we outline the carefully tailored process of translating and annotating Multi-SimLex data sets in all target languages.

## 5 Multi-SimLex: Translation and Annotation

We now detail the development of the final Multi-SimLex resource, describing our language selection process, as well as translation and annotation of the resource, including the steps taken to ensure and measure the quality of this resource. We also provide key data statistics and preliminary crosslingual comparative analyses.

### Language Selection.

Multi-SimLex comprises eleven languages in addition to English. The main objective for our inclusion criteria has been to balance language prominence (by number of speakers of the language) for maximum impact of the resource, while simultaneously having a diverse suite of languages based on their typological features (such as morphological type and language family). Table 1 summarizes key information about the languages currently included in Multi-SimLex. We have included a mixture of fusional, agglutinative, isolating, and introflexive languages that come from eight different language families. This includes languages that are very widely used such as Chinese Mandarin and Spanish, and low-resource languages such as Welsh and Kiswahili. We acknowledge that, despite a good balance between typological diversity and language prominence, the initial Multi-SimLex language sample still contains some gaps as it does not cover languages from some language families and geographical regions such as the Americas or Australia. This is mostly due to added difficulty for the authors to reach trusted translators and annotators for these languages. This also indicates why Multi-SimLex has been envisioned as a collaborative community project: We hope to further include additional languages and inspire other researchers that work more closely with under-resourced languages to contribute to the effort over the lifetime of this project.

Table 1
The list of 12 languages in the Multi-SimLex multilingual suite along with their corresponding language family (IE = Indo-European), broad morphological type, and their ISO 639-3 code. The number of speakers is based on the total count of L1 and L2 speakers, according to ethnologue.com.
LanguageISO 639-3FamilyType# Speakers
Chinese Mandarin cmn Sino-Tibetan Isolating 1.116 B
Welsh cym IE: Celtic Fusional 0.7 M
English eng IE: Germanic Fusional 1.132 B
Estonian est Uralic Agglutinative 1.1 M
Finnish fin Uralic Agglutinative 5.4 M
French fra IE: Romance Fusional 280 M
Hebrew heb Afro-Asiatic Introflexive 9 M
Polish pol IE: Slavic Fusional 50 M
Russian rus IE: Slavic Fusional 260 M
Spanish spa IE: Romance Fusional 534.3 M
Kiswahili swa Niger-Congo Agglutinative 98 M
Yue Chinese yue Sino-Tibetan Isolating 73.5 M
LanguageISO 639-3FamilyType# Speakers
Chinese Mandarin cmn Sino-Tibetan Isolating 1.116 B
Welsh cym IE: Celtic Fusional 0.7 M
English eng IE: Germanic Fusional 1.132 B
Estonian est Uralic Agglutinative 1.1 M
Finnish fin Uralic Agglutinative 5.4 M
French fra IE: Romance Fusional 280 M
Hebrew heb Afro-Asiatic Introflexive 9 M
Polish pol IE: Slavic Fusional 50 M
Russian rus IE: Slavic Fusional 260 M
Spanish spa IE: Romance Fusional 534.3 M
Kiswahili swa Niger-Congo Agglutinative 98 M
Yue Chinese yue Sino-Tibetan Isolating 73.5 M

The work on data collection can be divided into two crucial phases:

1. A translation phase where the extended English language data set with 1,888 pairs (described in §4) is translated into eleven target languages, and 2) an annotation phase where human raters scored each pair in the translated set as well as the English set. Detailed guidelines for both phases are available online at: https://multisimlex.com.6

### 5.1 Word Pair Translation

Translators for each target language were instructed to find direct or approximate translations for the 1,888 word pairs that satisfy the following rules. (1) All pairs in the translated set must be unique (i.e., no duplicate pairs); (2) Translating two words from the same English pair into the same word in the target language is not allowed (e.g., it is not allowed to translate car and automobile to the same Spanish word coche). (3) The translated pairs must preserve the semantic relations between the two words when possible. This means that, when multiple translations are possible, the translation that best conveys the semantic relation between the two words found in the original English pair is selected. (4) If it is not possible to use a single-word translation in the target language, then a multiword expression can be used to convey the nearest possible semantics given the above points (e.g., the English word homework is translated into the Polish multiword expression praca domowa).

Satisfying these rules when finding appropriate translations for each pair—while keeping to the spirit of the intended semantic relation in the English version—is not always straightforward. For instance, kinship terminology in Sinitic languages (Mandarin and Yue) uses different terms depending on whether the family member is older or younger, and whether the family member comes from the mother’s side or the father’s side. In Mandarin, brother has no direct translation and can be translated as either: (older brother) or (younger brother). Therefore, in such cases, the translators are asked to choose the best option given the semantic context (relation) expressed by the pair in English; otherwise, to select one of the translations arbitrarily. This is also used to remove duplicate pairs in the translated set, by differentiating the duplicates using a variant at each instance. Further, many translation instances were resolved using near-synonymous terms in the translation. For example, the words in the pair: wood – timber can only be directly translated in Estonian to puit, and are not distinguishable. Therefore, the translators approximated the translation for timber to the compound noun puitmaterjal (literally: wood material) in order to produce a valid pair in the target language. In some cases, a less formal yet frequent variant is used as a translation. For example, the words in the pair physiciandoctor both translate to the same word in Estonian (arst); the less formal word doktor is used as a translation of doctor to generate a valid pair.

We measure the quality of the translated pairs by using a random sample set of 100 pairs (from the 1,888 pairs) to be translated by an independent translator for each target language. The sample is proportionally stratified according to the part-of-speech categories. The independent translator is given identical instructions to the main translator; we then measure the percentage of matched translated words between the two translations of the sample set. Table 2 summarizes the inter-translator agreement results for all languages and by part-of-speech subsets. Overall across all languages, the agreement is 84.8%, which is similar to prior work (Camacho-Collados et al. 2017; Vulić, Ponzetto, and Glavaš 2019).

Table 2
Inter-translator agreement (% of matched translated words) by independent translators using a randomly selected 100-pair English sample from the Multi-SimLex data set, and the corresponding 100-pair samples from the other data sets.
Languages:cmncymestfinfrahebpolrusspaswayueAvg
Nouns 84.5 80.0 90.0 87.3 78.2 98.2 90.0 95.5 85.5 80.0 77.3 86.0
Adjectives 88.5 88.5 61.5 73.1 69.2 100.0 84.6 100.0 69.2 88.5 84.6 82.5
Verbs 88.0 74.0 82.0 76.0 78.0 100.0 74.0 100.0 74.0 76.0 86.0 82.5
Adverbs 92.9 100.0 57.1 78.6 92.9 100.0 85.7 100.0 85.7 85.7 78.6 87.0
Overall 86.5 81.0 82.0 82.0 78.0 99.0 85.0 97.5 80.5 81.0 80.5 84.8
Languages:cmncymestfinfrahebpolrusspaswayueAvg
Nouns 84.5 80.0 90.0 87.3 78.2 98.2 90.0 95.5 85.5 80.0 77.3 86.0
Adjectives 88.5 88.5 61.5 73.1 69.2 100.0 84.6 100.0 69.2 88.5 84.6 82.5
Verbs 88.0 74.0 82.0 76.0 78.0 100.0 74.0 100.0 74.0 76.0 86.0 82.5
Adverbs 92.9 100.0 57.1 78.6 92.9 100.0 85.7 100.0 85.7 85.7 78.6 87.0
Overall 86.5 81.0 82.0 82.0 78.0 99.0 85.0 97.5 80.5 81.0 80.5 84.8

### 5.2 Guidelines and Word Pair Scoring

Across all languages, 145 human annotators were asked to score all 1,888 pairs (in their given language). We finally collect at least ten valid annotations for each word pair in each language. All annotators were required to abide by the following instructions:

1. Each annotator must assign an integer score between 0 and 6 (inclusive) indicating how semantically similar the two words in a given pair are. A score of 6 indicates very high similarity (i.e., perfect synonymy), and zero indicates no similarity.

2. Each annotator must score the entire set of 1,888 pairs in the data set. The pairs must not be shared between different annotators.

3. Annotators are able to break the workload over a period of approximately 2–3 weeks, and are able to use external sources (e.g., dictionaries, thesauri, WordNet) if required.

4. Annotators are kept anonymous, and are not able to communicate with each other during the annotation process.

The selection criteria for the annotators required that all annotators must be native speakers of the target language. Preference to annotators with university education was given, but not required. Annotators were asked to complete a spreadsheet containing the translated pairs of words, as well as the part-of-speech, and a column to enter the score. The annotators did not have access to the original pairs in English.

To ensure the quality of the collected ratings, we have used an adjudication protocol similar to the one proposed and validated by Pilehvar et al. (2018). It consists of the following three rounds:

Round 1: All annotators are asked to follow the instructions outlined above, and to rate all 1,888 pairs with integer scores between 0 and 6.

Round 2: We compare the scores of all annotators and identify the pairs for each annotator that have shown the most disagreement. We ask the annotators to reconsider the assigned scores for those pairs only. The annotators may choose to either change or keep the scores. As in the case with Round 1, the annotators have no access to the scores of the other annotators, and the process is anonymous. This process gives a chance for annotators to correct errors or reconsider their judgments, and has been shown to be very effective in reaching consensus, as reported by Pilehvar et al. (2018). We used a very similar procedure as Pilehvar et al. (2018) to identify the pairs with the most disagreement; for each annotator, we marked the ith pair if the rated score si falls within: siμi + 1.5 or siμi − 1.5, where μi is the mean of the other annotators’ scores.

Round 3: We compute the average agreement for each annotator (with the other annotators) by measuring the average Spearman’s correlation against all other annotators. We discard the scores of annotators that have shown the least average agreement with all other annotators, while we maintain at least ten annotators per language by the end of this round. The actual process is done in multiple iterations: (S1) we measure the average agreement for each annotator with every other annotator (this corresponds to the APIAA measure, see later); (S2) if we still have more than 10 valid annotators and the lowest average score is higher than in the previous iteration, we remove the lowest one, and rerun S1. Table 3 shows the number of annotators at both the start (Round 1) and end (Round 3) of our process for each language.

Table 3
Number of human annotators. R1 = Annotation Round 1, R3 = Round 3.
Languages:cmncymengestfinfrahebpolrusspaswayue
R1: Start 13 12 14 12 13 10 11 12 12 12 11 13
R3: End 11 10 13 10 10 10 10 10 10 10 10 11
Languages:cmncymengestfinfrahebpolrusspaswayue
R1: Start 13 12 14 12 13 10 11 12 12 12 11 13
R3: End 11 10 13 10 10 10 10 10 10 10 10 11
We measure the agreement between annotators using two metrics, average pairwise inter-annotator agreement (APIAA), and average mean inter-annotator agreement (AMIAA). Both of these use Spearman’s correlation (ρ) between annotators’ scores, the only difference is how they are averaged. They are computed as follows:
$1)apiaa=2∑i,jρ(si,sj)N(N−1)2)amiaa=∑iρ(si,μi)N, where:μi=∑j,j≠isjN−1$
(1)
where ρ(si,sj) is the Spearman’s correlation between annotators i and j’s scores (si,sj) for all pairs in the data set, and N is the number of annotators. APIAA has been used widely as the standard measure for inter-annotator agreement, including in the original SimLex paper (Hill, Reichart, and Korhonen 2015). It simply averages the pairwise Spearman’s correlation between all annotators. On the other hand, AMIAA compares the average Spearman’s correlation of one held-out annotator with the average of all the other N − 1 annotators, and then averages across all N ‘held-out’ annotators. It smooths individual annotator effects and arguably serves as a better upper bound than APIAA (Gerz et al. 2016; Vulić et al. 2017a; Pilehvar et al. 2018, inter alia).

We present the respective APIAA and AMIAA scores in Table 4 and Table 5 for all part-of-speech subsets, as well as the agreement for the full data sets. As reported in prior work (Gerz et al. 2016; Vulić et al. 2017a), AMIAA scores are typically higher than APIAA scores. Crucially, the results indicate “strong agreement” (across all languages) using both measurements. The languages with the highest annotator agreement were French (fra) and Yue Chinese (yue), while Russian (rus) had the lowest overall IAA scores. These scores, however, are still considered to be “moderately strong agreement.”

Table 4
Average pairwise inter-annotator agreement (APIAA). A score of 0.6 and above indicates strong agreement.
Languages:cmncymengestfinfrahebpolrusspaswayue
Nouns 0.661 0.622 0.659 0.558 0.647 0.698 0.538 0.606 0.524 0.582 0.626 0.727
Adjectives 0.757 0.698 0.823 0.695 0.721 0.741 0.683 0.699 0.625 0.640 0.658 0.785
Verbs 0.694 0.604 0.707 0.580 0.644 0.691 0.615 0.593 0.555 0.588 0.631 0.760
Adverbs 0.699 0.593 0.695 0.579 0.646 0.595 0.561 0.543 0.535 0.563 0.562 0.716
Overall 0.680 0.619 0.698 0.583 0.646 0.697 0.572 0.609 0.530 0.576 0.623 0.733
Languages:cmncymengestfinfrahebpolrusspaswayue
Nouns 0.661 0.622 0.659 0.558 0.647 0.698 0.538 0.606 0.524 0.582 0.626 0.727
Adjectives 0.757 0.698 0.823 0.695 0.721 0.741 0.683 0.699 0.625 0.640 0.658 0.785
Verbs 0.694 0.604 0.707 0.580 0.644 0.691 0.615 0.593 0.555 0.588 0.631 0.760
Adverbs 0.699 0.593 0.695 0.579 0.646 0.595 0.561 0.543 0.535 0.563 0.562 0.716
Overall 0.680 0.619 0.698 0.583 0.646 0.697 0.572 0.609 0.530 0.576 0.623 0.733
Table 5
Average mean inter-annotator agreement (AMIAA). A score of 0.6 and above indicates strong agreement.
Languages:cmncymengestfinfrahebpolrusspaswayue
Nouns 0.757 0.747 0.766 0.696 0.766 0.809 0.680 0.717 0.657 0.710 0.725 0.804
Adjectives 0.800 0.789 0.865 0.790 0.792 0.831 0.754 0.792 0.737 0.743 0.686 0.811
Verbs 0.774 0.733 0.811 0.715 0.757 0.808 0.720 0.722 0.690 0.710 0.702 0.784
Adverbs 0.749 0.693 0.777 0.697 0.748 0.729 0.645 0.655 0.608 0.671 0.623 0.716
Overall 0.764 0.742 0.794 0.715 0.760 0.812 0.699 0.723 0.667 0.703 0.710 0.792
Languages:cmncymengestfinfrahebpolrusspaswayue
Nouns 0.757 0.747 0.766 0.696 0.766 0.809 0.680 0.717 0.657 0.710 0.725 0.804
Adjectives 0.800 0.789 0.865 0.790 0.792 0.831 0.754 0.792 0.737 0.743 0.686 0.811
Verbs 0.774 0.733 0.811 0.715 0.757 0.808 0.720 0.722 0.690 0.710 0.702 0.784
Adverbs 0.749 0.693 0.777 0.697 0.748 0.729 0.645 0.655 0.608 0.671 0.623 0.716
Overall 0.764 0.742 0.794 0.715 0.760 0.812 0.699 0.723 0.667 0.703 0.710 0.792

### 5.3 Data Analysis

#### Similarity Score Distributions.

Across all languages, the average score (mean =1.61, median =1.1) is on the lower side of the similarity scale. However, looking closer at the scores of each language in Table 6, we indicate notable differences in both the averages and the spread of scores. Notably, French has the highest average of similarity scores (mean =2.61, median =2.5), and Kiswahili has the lowest average (mean =1.28, median =0.5). Russian has the lowest spread (σ = 1.37), and Polish has the largest (σ = 1.62). All of the languages are strongly correlated with each other, as shown in Figure 1, where all of the Spearman’s correlation coefficients are greater than 0.6 for all language pairs. Languages that share the same language family are highly correlated (e.g., cmn-yue, rus-pol, est-fin). In addition, we observe high correlations between English and most other languages, as expected. This is due to the effect of using English as the base/anchor language to create the data set. In simple words, if one translates to two languages L1 and L2 starting from the same set of pairs in English, it is highly likely that L1 and L2 will diverge from English in different ways. Therefore, the similarity between L1-eng and L2-eng is expected to be higher than between L1-L2, especially if L1 and L2 are typologically dissimilar languages (e.g., heb-cmn, see Figure 1). This phenomenon is well documented in related prior work (Leviant and Reichart 2015; Camacho-Collados et al. 2017; Mrkšić et al. 2017; Vulić, Ponzetto, and Glavaš 2019). Although we acknowledge this as an artifact of the data set design, it would otherwise be impossible to construct a semantically aligned and comprehensive data set across a large number of languages.

Table 6
Fine-grained distribution of concept pairs over different rating intervals in each Multi-SimLex language, reported as percentages. The total number of concept pairs in each data set is 1,888.
Lang:cmncymengestfinfrahebpolrusspaswayue
Interval
[0,1) 56.99 52.01 50.95 35.01 47.83 17.69 28.07 49.36 50.21 43.96 61.39 57.89
[1,2) 8.74 19.54 17.06 30.67 21.35 20.39 35.86 17.32 22.40 22.35 11.86 7.84
[2,3) 13.72 11.97 12.66 16.21 12.02 22.03 16.74 11.86 11.81 14.83 9.11 11.76
[3,4) 11.60 8.32 8.16 10.22 10.17 17.64 8.47 8.95 8.10 9.38 7.10 12.98
[4,5) 6.41 5.83 6.89 6.25 5.61 12.55 6.62 7.57 5.88 6.78 6.30 6.89
[5,6] 2.54 2.33 4.29 1.64 2.97 9.64 4.24 4.93 1.59 2.70 4.24 2.65
Lang:cmncymengestfinfrahebpolrusspaswayue
Interval
[0,1) 56.99 52.01 50.95 35.01 47.83 17.69 28.07 49.36 50.21 43.96 61.39 57.89
[1,2) 8.74 19.54 17.06 30.67 21.35 20.39 35.86 17.32 22.40 22.35 11.86 7.84
[2,3) 13.72 11.97 12.66 16.21 12.02 22.03 16.74 11.86 11.81 14.83 9.11 11.76
[3,4) 11.60 8.32 8.16 10.22 10.17 17.64 8.47 8.95 8.10 9.38 7.10 12.98
[4,5) 6.41 5.83 6.89 6.25 5.61 12.55 6.62 7.57 5.88 6.78 6.30 6.89
[5,6] 2.54 2.33 4.29 1.64 2.97 9.64 4.24 4.93 1.59 2.70 4.24 2.65
Figure 1

Spearman’s correlation coefficient (ρ) of the similarity scores for all languages in Multi-SimLex.

Figure 1

Spearman’s correlation coefficient (ρ) of the similarity scores for all languages in Multi-SimLex.

We also report differences in the distribution of the frequency of words among the languages in Multi-SimLex. Figure 2 shows six example languages, where each bar segment shows the proportion of words in each language that occur in the given frequency range. For example, the 10K–20K segment of the bars represents the proportion of words in the data set that occur in the list of most frequent words between the frequency rank of 10,000 and 20,000 in that language; likewise with other intervals. Frequency lists for the presented languages are derived from Wikipedia and Common Crawl corpora.7 Although many concept pairs are direct or approximate translations of English pairs, we can see that the frequency distribution does vary across different languages, and is also related to inherent language properties. For instance, in Finnish and Russian, although we use infinitive forms of all verbs, conjugated verb inflections are often more frequent in raw corpora than the corresponding infinitive forms. The variance can also be partially explained by the difference in monolingual corpora size used to derive the frequency rankings in the first place: Absolute vocabulary sizes are expected to fluctuate across different languages. However, it is also important to note that the data sets also contain subsets of lower-frequency and rare words, which can be used for rare word evaluations in multiple languages, in the spirit of Pilehvar et al. (2018)’s English rare word data set.

Figure 2

A distribution over different frequency ranges for words from Multi-SimLex data sets for selected languages. Multiword expressions are excluded from the analysis.

Figure 2

A distribution over different frequency ranges for words from Multi-SimLex data sets for selected languages. Multiword expressions are excluded from the analysis.

#### Crosslinguistic Differences.

Table 7 shows some examples of average similarity scores of English, Spanish, Kiswahili, and Welsh concept pairs. Remember that the scores range from 0 to 6: The higher the score, the more similar the participants found the concepts in the pair. The examples from Table 7 show evidence of both the stability of average similarity scores across languages (unlikely – friendly, book – literature, and vanish – disappear), as well as language-specific differences (care – caution). Some differences in similarity scores seem to group languages into clusters. For example, the word pair regular – average has an average similarity score of 4.0 and 4.1 in English and Spanish, respectively, whereas in Kiswahili and Welsh the average similarity score of this pair is 0.5 and 0.8. We analyze this phenomenon in more detail in §5.4.

Table 7
Examples of concept pairs with their similarity scores from four languages. For brevity, only the original English concept pair is included, but note that the pair is translated to all target languages, see §5.1.
Word PairPOSengspaswacym
Similar average rating
book – literature 2.5 2.3 2.1 2.3
vanish – disappear 5.2 5.3 5.5 5.3

Different average rating
regular – average ADJ 4.1 0.5 0.8
care – caution 4.1 5.7 0.2 3.1

One language higher
large – big ADJ 5.9 2.7 3.8 3.8
bank – seat 5.1 0.1
sunset - evening 1.6 1.5 5.5 2.8
purely – completely ADV 2.3 2.3 1.1 5.4

One language lower
woman – wife 0.9 2.9 4.1 4.8
amazingly – fantastically ADV 5.1 0.4 4.1 4.1
wonderful – terrific ADJ 5.3 5.4 0.9 5.7
promise – swear 4.8 5.3 4.3
Word PairPOSengspaswacym
Similar average rating
book – literature 2.5 2.3 2.1 2.3
vanish – disappear 5.2 5.3 5.5 5.3

Different average rating
regular – average ADJ 4.1 0.5 0.8
care – caution 4.1 5.7 0.2 3.1

One language higher
large – big ADJ 5.9 2.7 3.8 3.8
bank – seat 5.1 0.1
sunset - evening 1.6 1.5 5.5 2.8
purely – completely ADV 2.3 2.3 1.1 5.4

One language lower
woman – wife 0.9 2.9 4.1 4.8
amazingly – fantastically ADV 5.1 0.4 4.1 4.1
wonderful – terrific ADJ 5.3 5.4 0.9 5.7
promise – swear 4.8 5.3 4.3

There are also examples for each of the four languages having a notably higher or lower similarity score for the same concept pair than the three other languages. For example, large – big in English has an average similarity score of 5.9, whereas Spanish, Kiswahili, and Welsh speakers rate the closest concept pair in their native language to have a similarity score between 2.7 and 3.8. What is more, woman – wife receives an average similarity of 0.9 in English, 2.9 in Spanish, and greater than 4.0 in Kiswahili and Welsh. The examples from Spanish include banco – asiento (bank – seat) which receives an average similarity score 5.1, whereas in the other three languages the similarity score for this word pair does not exceed 0.1. At the same time, the average similarity score of espantosamente – fantásticamente (amazingly – fantastically) is much lower in Spanish (0.4) than in other languages (4.1 – 5.1). In Kiswahili, an example of a word pair with a higher similarity score than the rest would be machweo – jioni (sunset – evening), having an average score of 5.5, while the other languages receive 2.8 or less, and a notably lower similarity score is given to wa ajabu - mkubwa sana (wonderful – terrific), getting 0.9, while the other languages receive 5.3 or more. Welsh examples include yn llwyr - yn gyfan gwbl (purely – completely), which scores 5.4 among Welsh speakers but 2.3 or less in other languages, whereas addo – tyngu (promise – swear) is rated as 0 by all Welsh annotators, but in the other three languages 4.3 or more on average.

There can be several explanations for the differences in similarity scores across languages, including but not limited to cultural context, polysemy, metonymy, translation, regional and generational differences, and most commonly, the fact that words and meanings do not exactly map onto each other across languages. For example, it is likely that the other three languages do not have two separate words for describing the concepts in the concept pair: big – large, and the translators had to opt for similar lexical items that were more distant in meaning, explaining why in English the concept pair received a much higher average similarity score than in other languages. A similar issue related to the mapping problem across languages arose in the Welsh concept pair yn llwye – yn gyfan gwbl, where Welsh speakers agreed that the two concepts are very similar. When asked, bilingual speakers considered the two Welsh concepts more similar than English equivalents purely – completely, potentially explaining why a higher average similarity score was reached in Welsh. The example of woman – wife can illustrate cultural differences or another translation-related issue where the word ‘wife’ did not exist in some languages (for example, Estonian), and therefore had to be described using other words, affecting the comparability of the similarity scores. This was also the case with the football – soccer concept pair. The pair bank – seat demonstrates the effect of the polysemy mismatch across languages: Whereas ‘bank’ has two different meanings in English, neither of them is similar to the word ‘seat’, but in Spanish, ‘banco’ can mean ‘bank’, but it can also mean ‘bench’. Quite naturally, Spanish speakers gave the pair banco – asiento a higher similarity score than the speakers of languages where this polysemy did not occur.

An example of metonymy affecting the average similarity score can be seen in the Kiswahili version of the word pair: sunset – evening (machweo – jioni). The average similarity score for this pair is much higher in Kiswahili, likely because the word ‘sunset’ can act as a metonym of ‘evening’. The low similarity score of wonderful – terrific in Kiswahili (wa ajabu - mkubwa sana) can be explained by the fact that while ‘mkubwa sana’ can be used as ‘terrific’ in Kiswahili, it technically means ‘very big’, adding to the examples of translation- and mapping-related effects. The word pair amazingly – fantastically (espantosamente – fantásticamente) brings out another translation-related problem: the accuracy of the translation. Although ‘espantosamente’ could arguably be translated to ‘amazingly’, more common meanings include: ‘frightfully’, ‘terrifyingly’, and ‘shockingly’, explaining why the average similarity score differs from the rest of the languages. Another problem was brought out by addo – tyngu (promise – swear) in Welsh, where the ‘tyngu’ may not have been a commonly used or even a known word choice for annotators, pointing out potential regional or generational differences in language use.

Table 8 presents examples of concept pairs from English, Spanish, Kiswahili, and Welsh on which the participants agreed the most. For example, in English all participants rated the similarity of trial – test to be 4 or 5. In Spanish and Welsh, all participants rated start – begin to correspond to a score of 5 or 6. In Kiswahili, money – cash received a similarity rating of 6 from every participant. Although there are numerous examples of concept pairs in these languages where the participants agreed on a similarity score of 4 or higher, it is worth noting that none of these languages had a single pair where all participants agreed on either 1-2, 2-3, or 3-4 similarity rating. Interestingly, in English all pairs where all the participants agreed on a 5-6 similarity score were adjectives.

Table 8
Examples of concept pairs with their similarity scores from four languages where all participants show strong agreement in their rating.
LanguageWord PairPOSRating all participants agree with
eng trial – test 4–5
swa archbishop – bishop 4–5
spa, cym start – begin 5–6
eng smart – intelligent ADJ 5–6
eng, spa quick – rapid ADJ 5–6
spa circumstance – situation 5–6
cym football – soccer 5–6
swa football – soccer
swa pause – wait
swa money – cash
cym friend – buddy
LanguageWord PairPOSRating all participants agree with
eng trial – test 4–5
swa archbishop – bishop 4–5
spa, cym start – begin 5–6
eng smart – intelligent ADJ 5–6
eng, spa quick – rapid ADJ 5–6
spa circumstance – situation 5–6
cym football – soccer 5–6
swa football – soccer
swa pause – wait
swa money – cash
cym friend – buddy

### 5.4 Effect of Language Affinity on Similarity Scores

Based on the analysis in Figure 1 and inspecting the anecdotal examples in the previous section, it is evident that the correlation between similarity scores across languages is not random. To corroborate this intuition, we visualize the vectors of similarity scores for each single language by reducing their dimensionality to 2 via principal component analysis (Pearson 1901). The resulting scatter plot in Figure 3 reveals that languages from the same family or branch have similar patterns in the scores. In particular, Russian and Polish (both Slavic), Finnish and Estonian (both Uralic), Cantonese and Mandarin Chinese (both Sinitic), and Spanish and French (both Romance) are all neighbors.

Figure 3

Principal component analysis of the language vectors resulting from the concatenation of similarity judgments for all pairs.

Figure 3

Principal component analysis of the language vectors resulting from the concatenation of similarity judgments for all pairs.

In order to quantify exactly the effect of language affinity on the similarity scores, we run correlation analyses between these and language features. In particular, we extract feature vectors from URIEL (Littell et al. 2017), a massively multilingual typological database that collects and normalizes information compiled by grammarians and field linguists about the world’s languages. In particular, we focus on information about geography (the areas where the language speakers are concentrated), family (the phylogenetic tree each language belongs to), and typology (including syntax, phonological inventory, and phonology).8 Moreover, we consider typological representations of languages that are not manually crafted by experts, but rather learned from texts (Östling and Tiedemann 2017). We experiment with the vectors of Malaviya, Neubig, and Littell (2017), who proposed to construct such representations by training language-identifying vectors end-to-end as part of neural machine translation models.

The vector for similarity judgments and the vector of linguistic features for a given language have different dimensionality. Hence, we first construct a distance matrix for each vector space, such that both columns and rows are language indices, and each cell value is the cosine distance between the vectors of the corresponding language pair. Given a set of L languages, each resulting matrix S has dimensionality of ℝ|L|×|L| and is symmetrical. To estimate the correlation between the matrix for similarity judgments and each of the matrices for linguistic features, we run a Mantel test (Mantel 1967), a non-parametric statistical test based on matrix permutations that takes into account inter-dependencies among pairwise distances.

The results of the Mantel test reported in Table 9 show that there exist statistically significant correlations between similarity judgments and geography, family, and syntax, given that p < 0.05 and z > 1.96. The correlation coefficient is particularly strong for geography (r = 0.647) and syntax (r = 0.649). The former result is intuitive, because languages in contact easily borrow and loan lexical units, and cultural interactions may result in similar cognitive categorizations. The result for syntax, instead, cannot be explained so easily, as formal properties of language do not affect lexical semantics. Instead, we conjecture that, although no causal relation is present, both syntactic features and similarity judgments might be linked to a common explanatory variable (such as geography). In fact, several syntactic properties are not uniformly spread across the globe. For instance, verbs with Verb–Object–Subject word order are mostly concentrated in Oceania (Dryer 2013). In turn, geographical proximity leads to similar judgment patterns, as mentioned above. On the other hand, we find no correlation with phonology and inventory, as expected, nor with the bottom–up typological features from Malaviya, Neubig, and Littell (2017).

Table 9
Mantel test on the correlation between similarity judgments from Multi-SimLex and linguistic features from typological databases.
FeaturesDimensionMantel rMantel pMantel z
geography 299 0.647 0.007* 3.443
family 3718 0.329 0.023* 2.711
syntax 103 0.649 0.007* 3.787
inventory 158 0.155 0.459 0.782
phonology 28 0.397 0.046 1.943
Malaviya, Neubig, and Littell (2017512 −0.431 0.264 −1.235
FeaturesDimensionMantel rMantel pMantel z
geography 299 0.647 0.007* 3.443
family 3718 0.329 0.023* 2.711
syntax 103 0.649 0.007* 3.787
inventory 158 0.155 0.459 0.782
phonology 28 0.397 0.046 1.943
Malaviya, Neubig, and Littell (2017512 −0.431 0.264 −1.235

## 6 Crosslingual Multi-SimLex Data Sets

A crucial advantage of having semantically aligned monolingual data sets across different languages is the potential to create crosslingual semantic similarity data sets. Such data sets allow for probing the quality of crosslingual representation learning algorithms (Camacho-Collados et al. 2017; Conneau et al. 2018a; Chen and Cardie 2018; Doval et al. 2018; Ruder, Vulić, and Søgaard 2019; Conneau and Lample 2019; Ruder, Søgaard, and Vulić 2019) as an intrinsic evaluation task. However, the crosslingual data sets previous work relied upon (Camacho-Collados et al. 2017) were limited to a homogeneous set of high-resource languages (e.g., English, German, Italian, Spanish) and a small number of concept pairs (all less than 1K pairs). We address both problems by 1) using a typologically more diverse language sample, and 2) relying on a substantially larger English data set as a source for the crosslingual data sets: 1,888 pairs in this work versus 500 pairs in the work of Camacho-Collados et al. (2017). As a result, each of our crosslingual data sets contains a substantially larger number of concept pairs, as shown in Table 11. The crosslingual Multi-Simlex data sets are constructed automatically, leveraging word pair translations and annotations collected in all 12 languages. This yields a total of 66 crosslingual data sets, one for each possible combination of languages. Table 11 provides the final number of concept pairs, which lie between 2,031 and 3,480 pairs for each crosslingual data set, whereas Table 10 shows some sample pairs with their corresponding similarity scores.

Table 10
Example concept pairs with their scores from a selection of crosslingual Multi-SimLex data sets.
PairConcept-1Concept-2ScorePairConcept-1Concept-2Score
cym-eng rhyddid liberty 5.37 cmn-est  optimistlikult 0.83
cym-pol plentynaidd niem?dry 2.15 fin-swa psykologia sayansi 2.20
swa-eng kutimiza accomplish 5.24 eng-fra normally quotidiennement 2.41
cmn-fra  flexible 4.08 fin-spa auto bicicleta 0.85
fin-spa tietämättömyys inteligencia 0.55 cmn-yue   4.78
spa-fra ganador candidat 2.15 cym-swa sefyllfa mazingira 1.90
est-yue takso  2.08 est-spa armee legión 3.25
eng-fin orange sitrushedelmä 3.43 fin-est halveksuva põlglik 5.55
spa-pol palabra wskazówka 0.55 cmn-cym  disgybl 4.45
pol-swa prawdopodobnie uwezekano 4.05 pol-eng grawitacja meteor 0.27
PairConcept-1Concept-2ScorePairConcept-1Concept-2Score
cym-eng rhyddid liberty 5.37 cmn-est  optimistlikult 0.83
cym-pol plentynaidd niem?dry 2.15 fin-swa psykologia sayansi 2.20
swa-eng kutimiza accomplish 5.24 eng-fra normally quotidiennement 2.41
cmn-fra  flexible 4.08 fin-spa auto bicicleta 0.85
fin-spa tietämättömyys inteligencia 0.55 cmn-yue   4.78
spa-fra ganador candidat 2.15 cym-swa sefyllfa mazingira 1.90
est-yue takso  2.08 est-spa armee legión 3.25
eng-fin orange sitrushedelmä 3.43 fin-est halveksuva põlglik 5.55
spa-pol palabra wskazówka 0.55 cmn-cym  disgybl 4.45
pol-swa prawdopodobnie uwezekano 4.05 pol-eng grawitacja meteor 0.27
Table 11
The sizes of all monolingual (main diagonal) and crosslingual data sets.
cmncymengestfinfrahebpolrusspaswayue
cmn 1,888 – – – – – – – – – – –
cym 3,085 1,888 – – – – – – – – – –
eng 3,151 3,380 1,888 – – – – – – – – –
est 3,188 3,305 3,364 1,888 – – – – – – – –
fin 3,137 3,274 3,352 3,386 1,888 – – – – – – –
fra 2,243 2,301 2,284 2,787 2,682 1,888 – – – – – –
heb 3,056 3,209 3,274 3,358 3,243 2,903 1,888 – – – – –
pol 3,009 3,175 3,274 3,310 3,294 2,379 3,201 1,888 – – – –
rus 3,032 3,196 3,222 3,339 3,257 2,219 3,226 3,209 1,888 – – –
spa 3,116 3,205 3,318 3,312 3,256 2,645 3,256 3,250 3,189 1,888 – –
swa 2,807 2,926 2,828 2,845 2,900 2,031 2,775 2,819 2,855 2,811 1,888 –
yue 3,480 3,062 3,099 3,080 3,063 2,313 3,005 2,950 2,966 3,053 2,821 1,888
cmncymengestfinfrahebpolrusspaswayue
cmn 1,888 – – – – – – – – – – –
cym 3,085 1,888 – – – – – – – – – –
eng 3,151 3,380 1,888 – – – – – – – – –
est 3,188 3,305 3,364 1,888 – – – – – – – –
fin 3,137 3,274 3,352 3,386 1,888 – – – – – – –
fra 2,243 2,301 2,284 2,787 2,682 1,888 – – – – – –
heb 3,056 3,209 3,274 3,358 3,243 2,903 1,888 – – – – –
pol 3,009 3,175 3,274 3,310 3,294 2,379 3,201 1,888 – – – –
rus 3,032 3,196 3,222 3,339 3,257 2,219 3,226 3,209 1,888 – – –
spa 3,116 3,205 3,318 3,312 3,256 2,645 3,256 3,250 3,189 1,888 – –
swa 2,807 2,926 2,828 2,845 2,900 2,031 2,775 2,819 2,855 2,811 1,888 –
yue 3,480 3,062 3,099 3,080 3,063 2,313 3,005 2,950 2,966 3,053 2,821 1,888

The automatic creation and verification of crosslingual data sets closely follows the procedure first outlined by Camacho-Collados, Pilehvar, and Navigli (2015) and later adopted by Camacho-Collados et al. (2017) (for semantic similarity) and Vulić, Ponzetto, and Glavaš (2019) (for graded lexical entailment). First, given two languages, we intersect their aligned concept pairs obtained through translation. For instance, starting from the aligned pairs attroupement – foule in French and rahvasumm – rahvahulk in Estonian, we construct two crosslingual pairs attroupement – rahvaluk and rahvasumm – foule. The scores of crosslingual pairs are then computed as averages of the two corresponding monolingual scores. Finally, in order to filter out concept pairs whose semantic meaning was not preserved during this operation, we retain only crosslingual pairs for which the corresponding monolingual scores (ss, st) differ at most by one fifth of the full scale (i.e., ∣ssst∣ ≤ 1.2). This heuristic mitigates the noise due to crosslingual semantic shifts (Camacho-Collados et al. 2017; Vulić, Ponzetto, and Glavaš 2019). We refer the reader to the work of Camacho-Collados, Pilehvar, and Navigli (2015) for a detailed technical description of the procedure.

To assess the quality of the resulting crosslingual data sets, we have conducted a verification experiment similar to Vulić, Ponzetto, and Glavaš (2019). We randomly sampled 300 concept pairs in the English-Spanish, English-French, and English-Mandarin crosslingual data sets. Subsequently, we asked bilingual native speakers to provide similarity judgments of each pair. The Spearman’s correlation score ρ between automatically induced and manually collected ratings achieves ρ ≥ 0.90 on all samples, which confirms the viability of the automatic construction procedure.

### Score and Class Distributions.

The summary of score and class distributions across all 66 crosslingual data sets are provided in Figure 4a and Figure 4b, respectively. First, it is obvious that the distribution over the four POS classes largely adheres to that of the original monolingual Multi-SimLex data sets, and that the variance is quite low: For example, the eng-fra data set contains the lowest proportion of nouns (49.21%) and the highest proportion of verbs (27.1%), adjectives (15.28%), and adverbs (8.41%). On the other hand, the distribution over similarity intervals in Figure 4a shows a much greater variance. This is again expected as this pattern resurfaces in monolingual data sets (see Table 6). It is also evident that the data are skewed toward lower-similarity concept pairs. However, due to the joint size of all crosslingual data sets (see Table 11), even the least represented intervals contain a substantial number of concept pairs. For instance, the rus-yue data set contains the least highly similar concept pairs (in the interval [4,6]) of all 66 crosslingual data sets. Nonetheless, the absolute number of pairs (138) in that interval for rus-yue is still substantial. If needed, this makes it possible to create smaller data sets that are balanced across the similarity spectra through sub-sampling.

Figure 4

(a) Rating distribution and (b) distribution of pairs over the four POS classes in crosslingual Multi-SimLex data sets averaged across each of the 66 language pairs (y-axes plot percentages as the total number of concept pairs varies across different crosslingual data sets). Minimum and maximum percentages for each rating interval and POS class are also plotted.

Figure 4

(a) Rating distribution and (b) distribution of pairs over the four POS classes in crosslingual Multi-SimLex data sets averaged across each of the 66 language pairs (y-axes plot percentages as the total number of concept pairs varies across different crosslingual data sets). Minimum and maximum percentages for each rating interval and POS class are also plotted.

## 7 Monolingual Evaluation of Representation Learning Models

After the numerical and qualitative analyses of the Multi-SimLex data sets provided in §§ 5.35.4, we now benchmark a series of representation learning models on the new evaluation data. We evaluate standard static word embedding algorithms such as fastText (Bojanowski et al. 2017), as well as a range of more recent text encoders pretrained on language modeling such as multilingual BERT (Devlin et al. 2019). These experiments provide strong baseline scores on the new Multi-SimLex data sets and offer a first large-scale analysis of pretrained encoders on word-level semantic similarity across diverse languages. In addition, the experiments, now enabled by Multi-SimLex, aim to answer several important questions. (Q1) Is it viable to extract high-quality word-level representations from pretrained encoders receiving subword-level tokens as input? Are such representations competitive with standard static word-level embeddings? (Q2) What are the implications of monolingual pretraining versus (massively) multilingual pretraining for performance? (Q3) Do lightweight unsupervised post-processing techniques improve word representations consistently across different languages? (Q4) Can we effectively transfer available external lexical knowledge from resource-rich languages to resource-lean languages in order to learn word representations that distinguish between true similarity and conceptual relatedness (see the discussion in §2.3)?

### 7.1 Models in Comparison

#### Static Word Embeddings in Different Languages.

First, we evaluate a standard method for inducing non-contextualized (i.e., static) word embeddings across a plethora of different languages: fastText (ft) vectors (Bojanowski et al. 2017) are currently the most popular and robust choice given 1) the availability of pretrained vectors in a large number of languages (Grave et al. 2018) trained on large Common Crawl (CC) plus Wikipedia (Wiki) data, and 2) their superior performance across a range of NLP tasks (Mikolov et al. 2018). In fact, fastText is an extension of the standard word-level CBOW and skip-gram word2vec models (Mikolov et al. 2013) that takes into account subword-level information, namely, the constituent character n-grams of each word (Zhu, Vulić, and Korhonen 2019). For this reason, fastText is also more suited for modeling rare words and morphologically rich languages.9

We rely on 300-dimensional ft word vectors trained on CC+Wiki and available online for 157 languages.10 The word vectors for all languages are obtained by CBOW with position-weights, with character n-grams of length 5, a window of size 5, 10 negative examples, and 10 training epochs. We also probe another (older) collection of ft vectors, pretrained on full Wikipedia dumps of each language.11 The vectors are 300-dimensional, trained with the skip-gram objective for 5 epochs, with 5 negative examples, a window size set to 5, and relying on all character n-grams from length 3 to 6. Following prior work, we trim the vocabularies for all languages to the 200K most frequent words and compute representations for multiword expressions by averaging the vectors of their constituent words.

#### Unsupervised Post-Processing.

Further, we consider a variety of unsupervised post-processing steps that can be applied post-training on top of any pretrained input word embedding space without any external lexical semantic resource. So far, the usefulness of such methods has been verified only on the English language through benchmarks for lexical semantics and sentence-level tasks (Mu, Bhat, and Viswanath 2018). In this article, we assess whether unsupervised post-processing is beneficial also in other languages. To this end, we apply the following post hoc transformations on the initial word embeddings:

1. Mean centering (mc) is applied after unit length normalization to ensure that all vectors have a zero mean, and is commonly applied in data mining and analysis (Bro and Smilde 2003; van den Berg et al. 2006).

2. All-but-the-top (abtt) (Mu, Bhat, and Viswanath 2018; Tang, Mousavi, and de Sa 2019) eliminates the common mean vector and a few top dominating directions (according to principal component analysis) from the input distributional word vectors, since they do not contribute toward distinguishing the actual semantic meaning of different words. The method contains a single (tunable) hyperparameter ddA, which denotes the number of the dominating directions to remove from the initial representations. Previous work has verified the usefulness of abtt in several English lexical semantic tasks such as semantic similarity, word analogies, and concept categorization, as well as in sentence-level text classification tasks (Mu, Bhat, and Viswanath 2018).

3. uncovec (Artetxe et al. 2018) adjusts the similarity order of an arbitrary input word embedding space, and can emphasize either syntactic or semantic information in the transformed vectors. In short, it transforms the input space X into an adjusted space XWα through a linear map Wα controlled by a single hyperparameter α. The nth-order similarity transformation of the input word vector space X (for which n = 1) can be obtained as Mn(X) = M1(XW(n − 1)/2), with Wα = QΓα, where Q and Γ are the matrices obtained via eigendecomposition of XTX = QΓQT. Γ is a diagonal matrix containing eigenvalues of XTX; Q is an orthogonal matrix with eigenvectors of XTX as columns. While the motivation for the uncovec methods does originate from adjusting discrete similarity orders, note that α is in fact a continuous real-valued hyperparameter that can be carefully tuned. For more technical details we refer the reader to the original work of Artetxe et al. (2018).

As mentioned, all post-processing methods can be seen as unsupervised retrofitting methods that, given an arbitrary input vector space X, produce a perturbed/transformed output vector space X, but, unlike common retrofitting methods (Faruqui et al. 2015; Mrkšić et al. 2017), the perturbation is completely unsupervised (i.e., self-contained) and does not inject any external (semantic similarity-oriented) knowledge into the vector space. Note that different perturbations can also be stacked: For example, we can apply uncovec and then use abtt on top of the output uncovec vectors. When using uncovec and abtt we always length-normalize and mean-center the data first (i.e., we apply the simple mc normalization). Finally, we tune the two hyperparameters dA (for abtt) and α (uncovec) on the English Multi-SimLex and use the same values on the data sets of all other languages; we report results with ddA = 3 or ddA = 10, and α = −0.3.

#### Contextualized Word Embeddings.

We also evaluate the capacity of unsupervised pretraining architectures based on language modeling objectives to reason over lexical semantic similarity. To the best of our knowledge, our article is the first study performing such analyses. State-of-the-art models such as bert (Devlin et al. 2019), xlm (Conneau and Lample 2019), or RoBERTa (Liu et al. 2019b) are typically very deep neural networks based on the Transformer architecture (Vaswani et al. 2017). They receive subword-level tokens as inputs (such as WordPieces (Schuster and Nakajima 2012)) to tackle data sparsity. In output, they return contextualized embeddings, dynamic representations for words in context.

To represent words or multiword expressions through a pretrained model, we follow prior work (Liu et al. 2019a) and compute an input item’s representation by 1) feeding it to a pretrained model in isolation; then 2) averaging the H hidden representations (bottom-to-top) for each of the item’s constituent subwords; and then finally 3) averaging the resulting subword representations to produce the final d-dimensional representation, where d is the embedding and hidden-layer dimensionality (e.g., d = 768 with bert). We opt for this approach due to its proven viability and simplicity (Liu et al. 2019a), as it does not require any additional corpora to condition the induction of contextualized embeddings.12 Other ways to extract the representations from pretrained models (Aldarmaki and Diab 2019; Wu et al. 2019; Cao, Kitaev, and Klein 2020) are beyond the scope of this work, and we will experiment with them in the future.

In other words, we treat each pretrained encoder enc as a black-box function to encode a single word or a multiword expression x in each language into a d-dimensional contextualized representation xenc ∈ℝd = enc(x) (e.g., d = 768 with bert). As multilingual pretrained encoders, we experiment with the multilingual bert model (m-bert) (Devlin et al. 2019) and xlm (Conneau and Lample 2019). m-bert is pretrained on monolingual Wikipedia corpora of 102 languages (comprising all Multi-SimLex languages) with a 12-layer Transformer network, and yields 768-dimensional representations. Because the concept pairs in Multi-SimLex are lowercased, we use the uncased version of m-bert.13m-bert comprises all Multi-SimLex languages, and its evident ability to perform crosslingual transfer (Pires, Schlinger, and Garrette 2019; Wu and Dredze 2019; Wang et al. 2020) also makes it a convenient baseline model for crosslingual experiments later in §8. The second multilingual model we consider, xlm-100,14 is pretrained on Wikipedia dumps of 100 languages, and encodes each concept into a 1,280-dimensional representation. In contrast to m-bert, xlm-100 drops the next-sentence prediction objective and adds a crosslingual masked language modeling objective. For both encoders, the representations of each concept are computed as averages over the first H = 4 hidden layers in all experiments.15

Besides m-bert and xlm, covering multiple languages, we also analyze the performance of “language-specific” bert and xlm models for the languages where they are available: Finnish, Spanish, English, Mandarin Chinese, and French. The main goal of this comparison is to study the differences in performance between multilingual “one-size-fits-all” encoders and language-specific encoders. For all experiments, we rely on the pretrained models released in the Transformers repository (Wolf et al. 2019).16

Unsupervised post-processing steps devised for static word embeddings (i.e., mean-centering, abtt, uncovec) can also be applied on top of contextualized embeddings if we predefine a vocabulary of word types V that will be represented in a word vector space X. We construct such V for each language as the intersection of word types covered by the corresponding CC+Wiki fastText vectors and the (single-word or multi-word) expressions appearing in the corresponding Multi-SimLex data set.

Finally, note that it is not feasible to evaluate a full range of available pretrained encoders within the scope of this work. Our main intention is to provide the first set of baseline results on Multi-SimLex by benchmarking a sample of most popular encoders, at the same time also investigating other important questions such as performance of static versus contextualized word embeddings, or multilingual versus language-specific pretraining. Another purpose of the experiments is to outline the wide potential and applicability of the Multi-SimLex data sets for multilingual and crosslingual representation learning evaluation.

### 7.2 Results and Discussion

The results we report are Spearman’s ρ coefficients of the correlation between the ranks derived from the scores of the evaluated models and the human scores provided in each Multi-SimLex data set. The main results with static and contextualized word vectors for all test languages are summarized in Table 12. The scores reveal several interesting patterns, and also pinpoint the main challenges for future work.

Table 12
A summary of results (Spearman’s ρ correlation scores) on the full monolingual Multi-SimLex data sets for 12 languages. We benchmark fastText word embeddings trained on two different corpora (CC+Wiki and only Wiki) as well as the multilingual m-bert model (see §7.1). Results with the initial word vectors are reported (i.e., without any unsupervised post-processing), as well as with different unsupervised post-processing methods, described in §7.1. The language codes are provided in Table 1. The numbers in the parentheses (gray rows) refer to the number of OOV concepts excluded from the computation. The highest scores for each language and per model are in bold.
Languages:cmncymengestfinfrahebpolrusspaswayue
fastText (CC+Wiki) (272) (151) (12) (319) (347) (43) (66) (326) (291) (46) (222) (–)
(1) ft:init .534 .363 .528 .469 .607 .578 .450 .405 .422 .511 .439 –
(2) ft:+mc .539 .393 .535 .473 .621 .584 .480 .412 .424 .516 .469 –
(3) ft:+abtt (−3) .557 .389 .536 .495 .642 .610 .501 .427 .459 .523 .473 –
(4) ft:+abtt (−10) .583 .384 .551 .476 .651 .623 .503 .455 .500 .542 .462 –
(5) ft:+uncovec .572 .387 .550 .465 .642 .595 .501 .435 .437 .525 .437 –
(1)+(2)+(5)+(3) .574 .386 .549 .476 .655 .604 .503 .442 .452 .528 .432 –
(1)+(2)+(5)+(4) .577 .376 .542 .455 .652 .613 .510 .466 .491 .540 .424 –

fastText (Wiki) (429) (282) (6) (343) (345) (73) (62) (354) (343) (57) (379) (677)
(1) ft:init .315 .318 .436 .400 .575 .444 .428 .370 .359 .432 .332 .376
(2) ft:+mc .373 .337 .445 .404 .583 .463 .447 .383 .378 .447 .373 .427
(3) ft:+abtt (−3) .459 .343 .453 .404 .584 .487 .447 .387 .394 .456 .423 .429
(4) ft:+abtt (−10) .496 .323 .460 .385 .581 .494 .460 .401 .400 .477 .406 .399
(5) ft:+uncovec .518 .328 .469 .375 .568 .483 .449 .389 .387 .469 .386 .394
(1)+(2)+(5)+(3) .526 .323 .470 .369 .564 .495 .448 .392 .392 .473 .388 .388
(1)+(2)+(5)+(4) .526 .307 .471 .355 .548 .495 .450 .394 .394 .476 .382 .396

m-bert (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0)
(1) m-bert:init .408 .033 .138 .085 .162 .115 .104 .069 .085 .145 .125 .404
(2) m-bert:+mc .458 .044 .256 .122 .173 .183 .128 .097 .123 .203 .128 .469
(3) m-bert:+abtt (−3) .487 .056 .321 .137 .200 .287 .144 .126 .197 .299 .135 .492
(4) m-bert:+abtt (−10) .456 .056 .329 .122 .164 .306 .121 .126 .183 .315 .136 .467
(5) m-bert:+uncovec .464 .063 .317 .144 .213 .288 .164 .144 .198 .287 .143 .464
(1)+(2)+(5)+(3) .464 .083 .326 .130 .201 .304 .149 .122 .199 .295 .148 .456
(1)+(2)+(5)+(4) .444 .086 .326 .112 .179 .305 .135 .127 .187 .285 .119 .447
AMIAA (Overall) .764 .742 .794 .715 .760 .812 .699 .723 .667 .703 .710 .792
Languages:cmncymengestfinfrahebpolrusspaswayue
fastText (CC+Wiki) (272) (151) (12) (319) (347) (43) (66) (326) (291) (46) (222) (–)
(1) ft:init .534 .363 .528 .469 .607 .578 .450 .405 .422 .511 .439 –
(2) ft:+mc .539 .393 .535 .473 .621 .584 .480 .412 .424 .516 .469 –
(3) ft:+abtt (−3) .557 .389 .536 .495 .642 .610 .501 .427 .459 .523 .473 –
(4) ft:+abtt (−10) .583 .384 .551 .476 .651 .623 .503 .455 .500 .542 .462 –
(5) ft:+uncovec .572 .387 .550 .465 .642 .595 .501 .435 .437 .525 .437 –
(1)+(2)+(5)+(3) .574 .386 .549 .476 .655 .604 .503 .442 .452 .528 .432 –
(1)+(2)+(5)+(4) .577 .376 .542 .455 .652 .613 .510 .466 .491 .540 .424 –

fastText (Wiki) (429) (282) (6) (343) (345) (73) (62) (354) (343) (57) (379) (677)
(1) ft:init .315 .318 .436 .400 .575 .444 .428 .370 .359 .432 .332 .376
(2) ft:+mc .373 .337 .445 .404 .583 .463 .447 .383 .378 .447 .373 .427
(3) ft:+abtt (−3) .459 .343 .453 .404 .584 .487 .447 .387 .394 .456 .423 .429
(4) ft:+abtt (−10) .496 .323 .460 .385 .581 .494 .460 .401 .400 .477 .406 .399
(5) ft:+uncovec .518 .328 .469 .375 .568 .483 .449 .389 .387 .469 .386 .394
(1)+(2)+(5)+(3) .526 .323 .470 .369 .564 .495 .448 .392 .392 .473 .388 .388
(1)+(2)+(5)+(4) .526 .307 .471 .355 .548 .495 .450 .394 .394 .476 .382 .396

m-bert (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0) (0)
(1) m-bert:init .408 .033 .138 .085 .162 .115 .104 .069 .085 .145 .125 .404
(2) m-bert:+mc .458 .044 .256 .122 .173 .183 .128 .097 .123 .203 .128 .469
(3) m-bert:+abtt (−3) .487 .056 .321 .137 .200 .287 .144 .126 .197 .299 .135 .492
(4) m-bert:+abtt (−10) .456 .056 .329 .122 .164 .306 .121 .126 .183 .315 .136 .467
(5) m-bert:+uncovec .464 .063 .317 .144 .213 .288 .164 .144 .198 .287 .143 .464
(1)+(2)+(5)+(3) .464 .083 .326 .130 .201 .304 .149 .122 .199 .295 .148 .456
(1)+(2)+(5)+(4) .444 .086 .326 .112 .179 .305 .135 .127 .187 .285 .119 .447
AMIAA (Overall) .764 .742 .794 .715 .760 .812 .699 .723 .667 .703 .710 .792

#### State-of-the-Art Representation Models.

The absolute scores of CC+Wiki ft, Wiki ft, and m-bert are not directly comparable, because these models have different coverage. In particular, Multi-SimLex contains some out-of-vocabulary (OOV) words whose static ft embeddings are not available.17 On the other hand, m-bert has perfect coverage. A general comparison between CC+Wiki and Wiki ft vectors, however, supports the intuition that larger corpora (such as CC+Wiki) yield higher correlations. Another finding is that a single massively multilingual model such as m-bert cannot produce semantically rich word-level representations. Whether this actually happens because the training objective is different—or because the need to represent 100+ languages reduces its language-specific capacity—is investigated further below.

The overall results also clearly indicate that (i) there are differences in performance across different monolingual Multi-SimLex data sets, and (ii) unsupervised post-processing is universally useful, and can lead to huge improvements in correlation scores for many languages. In what follows, we also delve deeper into these analyses.

#### Impact of Unsupervised Post-Processing.

First, the results in Table 12 suggest that applying dimension-wise mean centering to the initial vector spaces has positive impact on word similarity scores in all test languages and for all models, both static and contextualized (see the +mc rows in Table 12). Mimno and Thompson (2017) show that distributional word vectors have a tendency toward narrow clusters in the vector space (i.e., they occupy a narrow cone in the vector space and are therefore anisotropic (Mu, Bhat, and Viswanath 2018; Ethayarajh 2019)), and are prone to the undesired effect of hubness (Radovanović, Nanopoulos, and Ivanović 2010; Lazaridou, Dinu, and Baroni 2015).18 Applying dimension-wise mean centering has the effect of spreading the vectors across the hyperplane and mitigating the hubness issue, which consequently improves word-level similarity, as it emerges from the reported results. Previous work has already validated the importance of mean centering for clustering-based tasks (Suzuki et al. 2013), bilingual lexicon induction with crosslingual word embeddings (Artetxe, Labaka, and Agirre 2018a; Zhang et al. 2019; Vulić et al. 2019), and for modeling lexical semantic change (Schlechtweg et al. 2019). However, to the best of our knowledge, the results summarized in Table 12 are the first evidence that also confirms its importance for semantic similarity in a wide array of languages. In sum, as a general rule of thumb, we suggest always mean-centering representations for semantic tasks.

The results further indicate that additional post-processing methods such as abtt and uncovec on top of mean-centered vector spaces can lead to further gains in most languages. The gains are even visible for languages that start from high correlation scores: for instance, cmn with CC+Wiki ft increases from 0.534 to 0.583, from 0.315 to 0.526 with Wiki ft, and from 0.408 to 0.487 with m-bert. Similarly, for rus with CC+Wiki ft we can improve from 0.422 to 0.500, and for fra the scores improve from 0.578 to 0.613. There are additional similar cases reported in Table 12.

Overall, the unsupervised post-processing techniques seem universally useful across languages, but their efficacy and relative performance does vary across different languages. Note that we have not carefully fine-tuned the hyperparameters of the evaluated post-processing methods, so additional small improvements can be expected for some languages. The main finding, however, is that these post-processing techniques are robust to semantic similarity computations beyond English, and are truly language independent. For instance, removing dominant latent (PCA-based) components from word vectors emphasizes semantic differences between different concepts, as only shared non-informative latent semantic knowledge is removed from the representations.

In summary, pretrained word embeddings do contain more information pertaining to semantic similarity than revealed in the initial vectors. This way, we have corroborated the hypotheses from prior work (Mu, Bhat, and Viswanath 2018; Artetxe et al. 2018) that was not previously empirically verified on other languages due to a shortage of evaluation data; this gap has now been filled with the introduction of the Multi-SimLex data sets. In all follow-up experiments, we always explicitly denote which post-processing configuration is used in evaluation.

#### POS-Specific Subsets.

We present the results for subsets of word pairs grouped by POS class in Table 13. Prior work based on English data showed that representations for nouns are typically of higher quality than those for the other POS classes (Schwartz, Reichart, and Rappoport 2015,2016; Vulić et al. 2017b). We observe a similar trend in other languages as well. This pattern is consistent across different representation models and can be attributed to several reasons. First, verb representations need to express a rich range of syntactic and semantic behaviors rather than purely referential features (Gruber 1976; Levin 1993; Kipper et al. 2008). Second, low correlation scores on the adjective and adverb subsets in some languages (e.g., pol, cym, swa) might be due to their low frequency in monolingual texts, which yields unreliable representations. In general, the variance in performance across different word classes warrants further research in class-specific representation learning (Baker, Reichart, and Korhonen 2014; Vulić et al. 2017b). The scores further attest the usefulness of unsupervised post-processing as almost all class-specific correlation scores are improved by applying mean-centering and abtt. Finally, the results for m-bert and xlm-100 in Table 13 further confirm that massively multilingual pretraining cannot yield reasonable semantic representations for many languages: In fact, for some classes they display no correlation with human ratings at all.

Table 13
Spearman’s ρ correlation scores over the four POS classes represented in Multi-SimLex data sets. In addition to the word vectors considered earlier in Table 12, we also report scores for another contextualized model, xlm-100. The numbers in parentheses refer to the total number of POS-class pairs in the original eng data set and, consequently, in all other monolingual data sets. For the comparison with the average human performance, we refer the reader to the APIAA and AMIAA scores provided previously in Table 4 and Table 5, respectively.
Languages:cmncymengestfinfrahebpolrusspaswayue
fastText (CC+Wiki) ft:init
nouns (1,051) .561 .497 .592 .627 .709 .641 .560 .538 .526 .583 .544 .426
verbs (469) .511 .265 .408 .379 .527 .551 .458 .384 .464 .499 .391 .252
adj (245) .448 .338 .564 .401 .546 .616 .467 .284 .349 .401 .344 .288
adv (123) .622 .187 .482 .378 .547 .648 .491 .266 .514 .423 .172 .103

fastText (CC+Wiki) ft:+abtt (−3)
nouns .601 .512 .599 .621 .730 .653 .592 .585 .578 .605 .553 .431
verbs .583 .305 .454 .379 .575 .602 .520 .390 .475 .526 .381 .314
adj .526 .372 .601 .427 .592 .646 .483 .316 .409 .411 .402 .312
adv .675 .150 .504 .397 .546 .695 .491 .230 .495 .416 .223 .081

m-bert m-bert:+abtt (−3)
nouns .517 .091 .446 .191 .210 .364 .191 .188 .266 .418 .142 .539
verbs .511 .005 .200 .039 .077 .248 .038 .107 .181 .266 .091 .503
adj .227 .050 .226 .028 .128 .193 .044 .046 .002 .099 .192 .267
adv .282 .012 .343 .112 .173 .390 .326 .036 .046 .207 .161 .049

xlm-100 xlm:+abtt (−3)
all .498 .096 .270 .118 .203 .234 .195 .106 .170 .289 .130 .506
nouns .551 .132 .381 .193 .238 .234 .242 .184 .292 .378 .165 .559
verbs .544 .038 .169 .006 .190 .132 .136 .073 .095 .243 .047 .570
adj .356 .140 .256 .081 .179 .185 .150 .046 .022 .100 .220 .291
adv .284 .017 .040 .086 .043 .027 .221 .014 .022 .315 .095 .156
Languages:cmncymengestfinfrahebpolrusspaswayue
fastText (CC+Wiki) ft:init
nouns (1,051) .561 .497 .592 .627 .709 .641 .560 .538 .526 .583 .544 .426
verbs (469) .511 .265 .408 .379 .527 .551 .458 .384 .464 .499 .391 .252
adj (245) .448 .338 .564 .401 .546 .616 .467 .284 .349 .401 .344 .288
adv (123) .622 .187 .482 .378 .547 .648 .491 .266 .514 .423 .172 .103

fastText (CC+Wiki) ft:+abtt (−3)
nouns .601 .512 .599 .621 .730 .653 .592 .585 .578 .605 .553 .431
verbs .583 .305 .454 .379 .575 .602 .520 .390 .475 .526 .381 .314
adj .526 .372 .601 .427 .592 .646 .483 .316 .409 .411 .402 .312
adv .675 .150 .504 .397 .546 .695 .491 .230 .495 .416 .223 .081

m-bert m-bert:+abtt (−3)
nouns .517 .091 .446 .191 .210 .364 .191 .188 .266 .418 .142 .539
verbs .511 .005 .200 .039 .077 .248 .038 .107 .181 .266 .091 .503
adj .227 .050 .226 .028 .128 .193 .044 .046 .002 .099 .192 .267
adv .282 .012 .343 .112 .173 .390 .326 .036 .046 .207 .161 .049

xlm-100 xlm:+abtt (−3)
all .498 .096 .270 .118 .203 .234 .195 .106 .170 .289 .130 .506
nouns .551 .132 .381 .193 .238 .234 .242 .184 .292 .378 .165 .559
verbs .544 .038 .169 .006 .190 .132 .136 .073 .095 .243 .047 .570
adj .356 .140 .256 .081 .179 .185 .150 .046 .022 .100 .220 .291
adv .284 .017 .040 .086 .043 .027 .221 .014 .022 .315 .095 .156

#### Differences across Languages.

Naturally, the results from Tables 12 and 13 also reveal that there is variation in performance of both static word embeddings and pretrained encoders across different languages. Among other causes, the lowest absolute scores with ft are reported for languages with least resources available to train monolingual word embeddings, such as Kiswahili, Welsh, and Estonian. The low performance on Welsh is especially indicative: Figure 1 shows that the ratings in the Welsh data set match up very well with the English ratings, but we cannot achieve the same level of correlation in Welsh with Welsh ft word embeddings. Difference in performance between two closely related languages, est (low-resource) and fin (high-resource), provides additional evidence in this respect.

The highest reported scores with m-bert and xlm-100 are obtained for Mandarin Chinese and Yue Chinese: This effectively points to the weaknesses of massively multilingual training with a joint subword vocabulary spanning 102 and 100 languages. Because of the difference in scripts, “language-specific” subwords for yue and cmn do not need to be shared across a vast amount of languages and the quality of their representation remains unscathed. This effectively means that m-bert’s subword vocabulary contains plenty of cmn-specific and yue-specific subwords that are exploited by the encoder when producing m-bert-based representations. Simultaneously, higher scores with m-bert (and xlm in Table 13) are reported for resource-rich languages such as French, Spanish, and English, which are better represented in m-bert’s training data, while we observe large performance losses for lower-resource languages: These artifacts of massively multilingual training with m-bert and xlm and lower performance in low-resource languages was further validated recently (Lauscher et al. 2020; Wu and Dredze 2020). We also observe lower absolute scores (and a larger number of OOVs) for languages with very rich and productive morphological systems such as the two Slavic languages (Polish and Russian) and Finnish. Because Polish and Russian are known to have large Wikipedias and Common Crawl data (Conneau et al. 2019) (e.g., their Wikipedias are in the top 10 largest Wikipedias worldwide), the problem with coverage can be attributed exactly to the proliferation of morphological forms in those languages.

Finally, although Table 12 does reveal that unsupervised post-processing is useful for all languages, it also demonstrates that peak scores are achieved with different post-processing configurations. This finding suggests that a more careful language-specific fine-tuning is indeed needed to refine word embeddings toward semantic similarity. We plan to inspect the relationship between post-processing techniques and linguistic properties in more depth in future work.

#### Multilingual vs. Language-Specific Contextualized Embeddings.

Recent work has shown that—despite the usefulness of massively multilingual models such as m-bert and xlm-100 for zero-shot crosslingual transfer (Pires, Schlinger, and Garrette 2019; Wu and Dredze 2019)—stronger results in downstream tasks for a particular language can be achieved by pretraining language-specific models on language-specific data.

In this experiment, motivated by the low results of m-bert and xlm-100 (see again Table 13), we assess if monolingual pretrained encoders can produce higher-quality word-level representations than multilingual models. Therefore, we evaluate language-specific bert and xlm models for a subset of the Multi-SimLex languages for which such models are currently available: Finnish (Virtanen et al. 2019) (bert-base architecture, uncased), French (Le et al. 2019) (the FlauBERT model based on xlm), English (bert-base, uncased), Mandarin Chinese (bert-base) (Devlin et al. 2019), and Spanish (bert-base, uncased). In addition, we also evaluate a series of pretrained encoders available for English: (i) bert-base, bert-large, and bert-large with whole word masking (wwm) from the original work on BERT (Devlin et al. 2019), (ii) monolingual “English-specific” xlm (Conneau and Lample 2019), and (iii) two models that use parameter reduction techniques to build more compact encoders: albert-b uses a configuration similar to bert-base, while albert-l is similar to bert-large, but with an 18 × reduction in the number of parameters (Lan et al. 2020).19

From the results in Figure 5, it is clear that monolingual pretrained encoders yield much more reliable word-level representations. The gains are visible even for languages such as cmn, which showed reasonable performance with m-bert and are substantial on all test languages. This further confirms the validity of language-specific pretraining in lieu of multilingual training, if sufficient monolingual data are available. Additional comparisons along these axes are available in related work (Vulić et al. 2020). Moreover, a comparison of pretrained English encoders in Figure 5b largely follows the intuition: The larger bert-large model yields slight improvements over bert-base, and we can improve a bit more by relying on word-level (i.e., lexical-level) masking. Finally, light-weight albert model variants are quite competitive with the original bert models, with only modest drops reported, and albert-l again outperforms albert-b. Overall, it is interesting to note that the scores obtained with monolingual pretrained encoders are on a par with or even outperform static ft word embeddings: this is a very intriguing finding per se as it shows that such subword-level models trained on large corpora can implicitly capture rich lexical semantic knowledge.

Figure 5

(a) Monolingual vs. multilingual. A performance comparison between monolingual pretrained language encoders and massively multilingual encoders. For four languages (cmn, eng, fin, spa), we report the scores with monolingual uncased bert-base architectures and multilingual uncased m-bert model, while for fra we report the results of the multilingual xlm-100 architecture and a monolingual French FlauBERT model (Le et al. 2019), which is based on the same architecture as xlm-100. (b) Pretrained ENG encoders. A comparison of various pretrained encoders available for English. All these models are post-processed via abtt (−3).

Figure 5

(a) Monolingual vs. multilingual. A performance comparison between monolingual pretrained language encoders and massively multilingual encoders. For four languages (cmn, eng, fin, spa), we report the scores with monolingual uncased bert-base architectures and multilingual uncased m-bert model, while for fra we report the results of the multilingual xlm-100 architecture and a monolingual French FlauBERT model (Le et al. 2019), which is based on the same architecture as xlm-100. (b) Pretrained ENG encoders. A comparison of various pretrained encoders available for English. All these models are post-processed via abtt (−3).

#### Similarity-Specialized Word Embeddings.

Conflating distinct lexico-semantic relations is a well-known property of distributional representations (Turney and Pantel 2010; Melamud et al. 2016). Semantic specialization fine-tunes distributional spaces to emphasize a particular lexico-semantic relation in the transformed space by injecting external lexical knowledge (Glavaš, Ponti, and Vulić 2019). Explicitly discerning between true semantic similarity (as captured in Multi-SimLex) and broad conceptual relatedness benefits a number of tasks, as discussed in §2.1.20 Because most languages lack dedicated lexical resources, however, one viable strategy to steer monolingual word vector spaces to emphasize semantic similarity is through crosslingual transfer of lexical knowledge, usually through a shared crosslingual word vector space (Ruder, Vulić, and Søgaard 2019). Therefore, we evaluate the effectiveness of specialization transfer methods using Multi-SimLex as our multilingual test bed.

We evaluate a current state-of-the-art crosslingual specialization transfer method with minimal requirements, put forth recently by Ponti et al. (2019c).21 In a nutshell, their li-postspec method is a multistep procedure that operates as follows. First, the knowledge about semantic similarity is extracted from WordNet in the form of triplets, that is, linguistic constraints (w1, w2, r), where w1 and w2 are two concepts, and r is a relation between them obtained from WordNet (e.g., synonymy or antonymy). The goal is to “attract” synonyms closer to each other in the transformed vector space as they reflect true semantic similarity, and “repel” antonyms further apart. In the second step, the linguistic constraints are translated from English to the target language via a shared crosslingual word vector space. To this end, following Ponti et al. (2019c) we rely on crosslingual word embeddings (CLWEs) (Joulin et al. 2018) available online, which are based on Wiki ft vectors.22 Following that, a constraint refinement step is applied in the target language which aims to eliminate the noise inserted during the translation process. This is done by training a relation classification tool: It is trained again on the English linguistic constraints and then used on the translated target language constraints, where the transfer is again enabled via a shared crosslingual word vector space.23 Finally, a state-of-the-art monolingual specialization procedure from Ponti et al. (2018b) injects the (now target language) linguistic constraints into the target language distributional space.

The scores are summarized in Table 14. Semantic specialization with li-postspec leads to substantial improvements in correlation scores for the majority of the target languages, demonstrating the importance of external semantic similarity knowledge for semantic similarity reasoning. However, we also observe deteriorated performance for the three target languages that can be considered the lowest-resource ones in our set: cym, swa, yue. We hypothesize that this occurs due to the inferior quality of the underlying monolingual Wikipedia word embeddings, which generates a chain of error accumulations. In particular, poor distributional word estimates compromise the alignment of the embedding spaces, which in turn results in increased translation noise, and reduced refinement ability of the relation classifier. On a high level, this “poor get poorer” observation again points to the fact that one of the primary causes of low performance of resource-low languages in semantic tasks is the sheer lack of even unlabeled data for distributional training. On the other hand, as we see from Table 13, typological dissimilarity between the source and the target does not deteriorate the effectiveness of semantic specialization. In fact, li-postspec does yield substantial gains also for the typologically distant targets such as heb, cmn, and est. The critical problem indeed seems to be insufficient raw data for monolingual distributional training.

Table 14
The impact of vector space specialization for semantic similarity. The scores are reported using the current state-of-the-art specialization transfer li-postspec method of Ponti et al. (2019c), relying on English as a resource-rich source language and the external lexical semantic knowledge from the English WordNet.
Languages:cmncymengestfinfrahebpolrusspaswayue
fastText(Wiki) (429) (282) (6) (343) (345) (73) (62) (354) (343) (57) (379) (677)

ft:init .315 .318 – .400 .575 .444 .428 .370 .359 .432 .332 .376
li-postspec .584 .204 – .515 .619 .601 .510 .531 .547 .635 .238 .267
Languages:cmncymengestfinfrahebpolrusspaswayue
fastText(Wiki) (429) (282) (6) (343) (345) (73) (62) (354) (343) (57) (379) (677)

ft:init .315 .318 – .400 .575 .444 .428 .370 .359 .432 .332 .376
li-postspec .584 .204 – .515 .619 .601 .510 .531 .547 .635 .238 .267

## 8 Crosslingual Evaluation

Similar to monolingual evaluation in §7, we now evaluate several state-of-the-art crosslingual representation models on the suite of 66 automatically constructed crosslingual Multi-SimLex data sets. Again, note that evaluating a full range of crosslingual models available in the rich prior work on crosslingual representation learning is well beyond the scope of this article. We therefore focus our crosslingual analyses on several well-established and indicative state-of-the-art crosslingual models, again spanning both static and contextualized crosslingual word embeddings.

### 8.1 Models in Comparison

#### Static Word Embeddings.

We rely on a state-of-the-art mapping-based method for the induction of CLWEs: vecmap (Artetxe, Labaka, and Agirre 2018b). The core idea behind such mapping-based or projection-based approaches is to learn a post hoc alignment of independently trained monolingual word embeddings (Ruder, Vulić, and Søgaard 2019). Such methods have gained popularity because of their conceptual simplicity and competitive performance coupled with reduced bilingual supervision requirements: They support CLWE induction with only as much as a few thousand word translation pairs as the bilingual supervision (Mikolov, Le, and Sutskever 2013; Xing et al. 2015; Upadhyay et al. 2016; Ruder, Søgaard, and Vulić 2019). More recent work has shown that CLWEs can be induced with even weaker supervision from small dictionaries spanning several hundred pairs (Vulić and Korhonen 2016; Vulić et al. 2019), identical strings (Smith et al. 2017), or even only shared numerals (Artetxe, Labaka, and Agirre 2017). In the extreme, fully unsupervised projection-based CLWEs extract such seed bilingual lexicons from scratch on the basis of monolingual data only (Conneau et al. 2018a; Artetxe, Labaka, and Agirre 2018b; Hoshen and Wolf 2018; Alvarez-Melis and Jaakkola 2018; Chen and Cardie 2018; Mohiuddin and Joty 2019, inter alia).

Recent empirical studies (Glavaš et al. 2019; Vulić et al. 2019; Doval et al. 2019) have compared a variety of unsupervised and weakly supervised mapping-based CLWE methods, and vecmap emerged as the most robust and very competitive choice. Therefore, we focus on 1) its fully unsupervised variant (unsuper) in our comparisons. For several language pairs, we also report scores with two other vecmap model variants: 2) a supervised variant that learns a mapping based on an available seed lexicon (super), and 3) a supervised variant with self-learning (super+sl) that iteratively increases the seed lexicon and improves the mapping gradually. For a detailed description of these variants, we refer the reader to recent work (Artetxe, Labaka, and Agirre 2018b; Vulić et al. 2019). We again use CC+Wiki ft vectors as initial monolingual word vectors, except for yue where Wiki ft is used. The seed dictionaries of two different sizes (1k and 5k translation pairs) are based on PanLex (Kamholz, Pool, and Colowick 2014), and are taken directly from prior work (Vulić et al. 2019),24 or extracted from PanLex following the same procedure as in the prior work.

#### Contextualized CrossLingual Word Embeddings.

We again evaluate the capacity of (massively) multilingual pretrained language models, m-bert and xlm-100, to reason over crosslingual lexical similarity. Implicitly, such an evaluation also evaluates “the intrinsic quality” of shared crosslingual word-level vector spaces induced by these methods, and their ability to boost crosslingual transfer between different language pairs. We rely on the same procedure of aggregating the models’ subword-level parameters into word-level representations, already described in §7.1.

As in monolingual settings, we can apply unsupervised post-processing steps such as abtt to both static and contextualized crosslingual word embeddings.

### 8.2 Results and Discussion

#### Main Results and Differences across Language Pairs

A summary of the results on the 66 crosslingual Multi-SimLex data sets are provided in Table 15 and Figure 6a. The findings confirm several interesting findings from our previous monolingual experiments (§7.2), and also corroborate several hypotheses and findings from prior work, now on a large sample of language pairs and for the task of crosslingual semantic similarity.

Table 15
Spearman’s ρ correlation scores on all 66 crosslingual data sets. The scores below the main diagonal are computed based on crosslingual word embeddings (CLWEs) induced by aligning CC+Wiki ft in all languages (except for yue where we use Wiki ft) in a fully unsupervised way (i.e., without any bilingual supervision). We rely on a standard CLWE mapping-based (i.e., alignment) approach: vecmap (Artetxe, Labaka, and Agirre 2018b). The scores above the main diagonal are computed by obtaining 768-dimensional word-level vectors from pretrained multilingual BERT (m-bert) following the procedure described in §7.1. For both fully unsupervised vecmap and m-bert, we report the results with unsupervised postprocessing enabled: All 2 × 66 reported scores are obtained using the +abbt (−10) variant.
cmncymengestfinfrahebpolrusspaswayue
cmn  .076 .348 .139 .154 .392 .190 .207 .227 .300 .049 .484
cym .041  .087 .017 .049 .095 .033 .072 .085 .089 .002 .083
eng .565 .004  .168 .159 .401 .171 .182 .236 .309 .014 .357
est .014 .097 .335  .143 .161 .100 .113 .083 .134 .025 .124
fin .049 .020 .542 .530  .195 .077 .110 .111 .157 .029 .167
fra .224 .015 .662 .559 .533  .191 .229 .297 .382 .038 .382
heb .202 .110 .516 .465 .445 .469  .095 .154 .181 .038 .185
pol .121 .028 .464 .415 .465 .534 .412  .139 .183 .013 .205
rus .032 .037 .511 .408 .476 .529 .430 .390  .248 .037 .226
spa .546 .048 .498 .450 .490 .600 .462 .398 .419  .055 .313
swa −.01 .116 .029 .006 .013 -.05 .033 .052 .035 .045  .043
yue .004 .047 .059 .004 .002 .059 .001 .074 .032 .089 −.02
cmncymengestfinfrahebpolrusspaswayue
cmn  .076 .348 .139 .154 .392 .190 .207 .227 .300 .049 .484
cym .041  .087 .017 .049 .095 .033 .072 .085 .089 .002 .083
eng .565 .004  .168 .159 .401 .171 .182 .236 .309 .014 .357
est .014 .097 .335  .143 .161 .100 .113 .083 .134 .025 .124
fin .049 .020 .542 .530  .195 .077 .110 .111 .157 .029 .167
fra .224 .015 .662 .559 .533  .191 .229 .297 .382 .038 .382
heb .202 .110 .516 .465 .445 .469  .095 .154 .181 .038 .185
pol .121 .028 .464 .415 .465 .534 .412  .139 .183 .013 .205
rus .032 .037 .511 .408 .476 .529 .430 .390  .248 .037 .226
spa .546 .048 .498 .450 .490 .600 .462 .398 .419  .055 .313
swa −.01 .116 .029 .006 .013 -.05 .033 .052 .035 .045  .043
yue .004 .047 .059 .004 .002 .059 .001 .074 .032 .089 −.02
Figure 6

Further performance analyses of crosslingual Multi-SimLex data sets. (a) Spearman’s ρ correlation scores averaged over all 66 crosslingual Multi-SimLex data sets for two pretrained multilingual encoders (m-bert and xlm). The scores are obtained with different configurations that exclude (init) or enable unsupervised post-processing. (b) A comparison of various pretrained encoders available for the English-French language pair; see the main text for a short description of each benchmarked pretrained encoder.

Figure 6

Further performance analyses of crosslingual Multi-SimLex data sets. (a) Spearman’s ρ correlation scores averaged over all 66 crosslingual Multi-SimLex data sets for two pretrained multilingual encoders (m-bert and xlm). The scores are obtained with different configurations that exclude (init) or enable unsupervised post-processing. (b) A comparison of various pretrained encoders available for the English-French language pair; see the main text for a short description of each benchmarked pretrained encoder.

First, we observe that the fully unsupervised vecmap model, despite being the most robust fully unsupervised method at present, fails to produce a meaningful crosslingual word vector space for a large number of language pairs (see the bottom triangle of Table 15): Many correlation scores are in fact no-correlation results, accentuating the problem of fully unsupervised crosslingual learning for typologically diverse languages and with fewer amounts of monolingual data (Vulić et al. 2019). The scores are particularly low across the board for lower-resource languages such as Welsh and Kiswahili. It also seems that the lack of monolingual data is a larger problem than typological dissimilarity between language pairs, as we do observe reasonably high correlation scores with vecmap for language pairs such as cmn-spa, heb-est, and rus-fin. However, typological differences (e.g., morphological richness) still play an important role as we observe very low scores when pairing cmn with morphologically rich languages such as fin, est, pol, and rus. Similar to prior work of Vulić et al. (2019) and Doval et al. (2019), given the fact that unsupervised vecmap is the most robust unsupervised CLWE method at present (Glavaš et al. 2019), our results again question the usefulness of fully unsupervised approaches for a large number of languages, and call for further developments in the area of unsupervised and weakly supervised crosslingual representation learning.

The scores of m-bert and xlm-10025 lead to similar conclusions as in the monolingual settings. Reasonable correlation scores are achieved only for a small subset of resource-rich language pairs (e.g., eng, fra, spa, cmn) that dominate the multilingual m-bert training. Interestingly, the scores indicate a much higher performance of language pairs where yue is one of the languages when we use m-bert instead of vecmap. This boils down again to the fact that yue, due to its specific language script, has a good representation of its words and subwords in the shared m-bert vocabulary. At the same time, a reliable vecmap mapping between yue and other languages cannot be found due to a small monolingual yue corpus. In cases when vecmap does not yield a degenerate crosslingual vector space starting from two monolingual ones, the final correlation scores seem substantially higher than the ones obtained by the single massively multilingual m-bert model.

Finally, the results in Figure 6a again verify the usefulness of unsupervised post-processing also in crosslingual settings. We observe improved performance with both m-bert and xlm-100 when mean centering (+mc) is applied, and further gains can be achieved by using abtt on the mean-centered vector spaces. A similar finding also holds for static crosslingual word embeddings,26 where applying abbt (−10) yields higher scores on 61 of 66 language pairs.

#### Fully Unsupervised vs. Weakly Supervised CrossLingual Embeddings.

The results in Table 15 indicate that fully unsupervised crosslingual learning fails for a large number of language pairs. However, recent work (Vulić et al. 2019) has noted that these sub-optimal non-alignment solutions with the unsuper model can be avoided by relying on (weak) crosslingual supervision spanning only several thousands or even hundreds of word translation pairs. Therefore, we examine 1) if we can further improve the results on crosslingual Multi-SimLex resorting to (at least some) crosslingual supervision for resource-rich language pairs; and 2) if such available word-level supervision can also be useful for a range of languages that displayed near-zero performance in Table 15. In other words, we test if recent “tricks of the trade” used in the rich literature on CLWE learning reflect in gains on crosslingual Multi-SimLex data sets.

First, we reassess the findings established on the bilingual lexicon induction task (Søgaard, Ruder, and Vulić 2018; Vulić et al. 2019): Using at least some crosslingual supervision is always beneficial compared to using no supervision at all. We report improvements over the unsuper model for all 10 language pairs in Table 16, even though the unsuper method initially produced strong correlation scores. The importance of self-learning increases with decreasing available seed dictionary size, and the +sl model always outperforms unsuper with 1k seed pairs; we observe the same patterns also with even smaller dictionary sizes than reported in Table 16 (250 and 500 seed pairs). Along the same line, the results in Table 17 indicate that at least some supervision is crucial for the success of static CLWEs on resource-leaner language pairs. We note substantial improvements on all language pairs; in fact, the vecmap model is able to learn a more reliable mapping starting from clean supervision. We again note large gains with self-learning.

Table 16
Results on a selection of crosslingual Multi-SimLex data sets where the fully unsupervised (unsuper) CLWE variant yields reasonable performance. We also show the results with supervised vecmap without self-learning (super) and with self-learning (+sl), with two seed dictionary sizes: 1k and 5k pairs; see §8.1 for more detail. Highest scores for each language pair are in bold.
cmn-engeng-fraeng-spaeng-rusest-finest-hebfin-hebfra-spapol-ruspol-spa
unsuper .565 .662 .498 .511 .510 .465 .445 .600 .390 .398

super (1k) .575 .602 .453 .376 .378 .363 .442 .588 .399 .406
+sl (1k) .577 .703 .547 .548 .591 .513 .488 .639 .439 .456

super (5k) .587 .704 .542 .535 .518 .473 .585 .631 .455 .463
+sl (5k) .581 .707 .548 .551 .556 .525 .589 .645 .432 .476
cmn-engeng-fraeng-spaeng-rusest-finest-hebfin-hebfra-spapol-ruspol-spa
unsuper .565 .662 .498 .511 .510 .465 .445 .600 .390 .398

super (1k) .575 .602 .453 .376 .378 .363 .442 .588 .399 .406
+sl (1k) .577 .703 .547 .548 .591 .513 .488 .639 .439 .456

super (5k) .587 .704 .542 .535 .518 .473 .585 .631 .455 .463
+sl (5k) .581 .707 .548 .551 .556 .525 .589 .645 .432 .476
Table 17
Results on a selection of crosslingual Multi-SimLex data sets where the fully unsupervised (unsuper) CLWE variant fails to learn a coherent shared crosslingual space. See also the caption of Table 16.
cmn-fincmn-ruscmn-yuecym-fincym-fracym-polfin-swa
unsuper .049 .032 .004 .020 .015 .028 .013

super (1k) .410 .388 .372 .384 .475 .326 .206
+sl (1k) .590 .537 .458 .471 .578 .380 .264
cmn-fincmn-ruscmn-yuecym-fincym-fracym-polfin-swa
unsuper .049 .032 .004 .020 .015 .028 .013

super (1k) .410 .388 .372 .384 .475 .326 .206
+sl (1k) .590 .537 .458 .471 .578 .380 .264

#### Multilingual vs. Bilingual Contextualized Embeddings.

Similar to the monolingual settings, we also inspect whether massively multilingual training in fact dilutes the knowledge necessary for crosslingual reasoning on a particular language pair. Therefore, we compare the 100-language xlm-100 model with i) a variant of the same model trained via masked language modeling on a smaller set of 17 languages (xlm-17); ii) a variant of the same model trained specifically for the particular language pair via masked language modeling (xlm-2); and iii) a variant of the bilingual xlm-2 model that also leverages bilingual knowledge from parallel sentences during joint training (xlm-2++, trained based on masked language modeling and translation language modeling objectives). We again use the pretrained models made available by Conneau and Lample (2019), and we refer to the original work for further technical details (e.g., regarding the core difference between pretraining objectives for xlm-2 and xlm-2++).

The results are summarized in Figure 6b, and they confirm the intuition that massively multilingual pretraining can damage performance even on resource-rich languages and language pairs. We observe a steep rise in performance when the multilingual model is trained on a much smaller set of languages (17 vs. 100), and further improvements can be achieved by training a dedicated bilingual model. Finally, leveraging bilingual parallel data seems to offer additional slight gains, but a tiny difference between xlm-2 and xlm-2++ also suggests that this rich bilingual information is not used in the optimal way within the xlm architecture for semantic similarity.

In summary, these results indicate that, in order to improve performance in crosslingual transfer tasks, more work should be invested into 1) pretraining dedicated language pair-specific models, and 2) creative ways of leveraging available crosslingual supervision (e.g., word translation pairs, parallel or comparable corpora) (Liu et al. 2019a; Wu et al. 2019; Cao, Kitaev, and Klein 2020) with pretraining paradigms such as bert and xlm. Using such crosslingual supervision could lead to similar benefits as indicated by the results obtained with static crosslingual word embeddings (see Table 16 and Table 17). We believe that Multi-SimLex can serve as a valuable means to track and guide future progress in this research area.

## 9 Conclusion and Future Work

We have presented Multi-SimLex, a resource containing human judgments on the semantic similarity of word pairs for 12 monolingual and 66 crosslingual data sets. The languages covered are typologically diverse and include also under-resourced ones, such as Welsh and Kiswahili. The resource covers an unprecedented amount of 1,888 word pairs, carefully balanced according to their similarity score, frequency, concreteness, part-of-speech class, and lexical field. In addition to Multi-Simlex, we release the detailed protocol we followed to create this resource. We hope that our consistent guidelines will encourage researchers to translate and annotate Multi-Simlex-style data sets for additional languages. This can help and create a hugely valuable, large-scale semantic resource for multilingual NLP research.

The core Multi-SimLex we release with this article already enables researchers to carry out novel linguistic analysis as well as establishes a benchmark for evaluating representation learning models. Based on our preliminary analyses, we found that speakers of closely related languages tend to express equivalent similarity judgments. In particular, geographical proximity seems to play a greater role than family membership in determining the similarity of judgments across languages. Moreover, we tested several state-of-the-art word embedding models, both static and contextualized ones, as well as several (supervised and unsupervised) post-processing techniques, on Multi-SimLex. This enables future endeavors to improve multilingual representation learning with challenging baselines. In addition, our results provide several important insights for research on both monolingual and crosslingual word representations:

1. Unsupervised post-processing techniques (mean centering, elimination of top principal components, adjusting similarity orders) are always beneficial independently of the language, although the combination leading to the best scores is language-specific and hence needs to be tuned.

2. Similarity rankings obtained from word embeddings for nouns are better aligned with human judgments than all the other part-of-speech classes considered here (verbs, adjectives, and, for the first time, adverbs). This confirms previous generalizations based on experiments on English.

3. The factor having the greatest impact on the quality of word representations is the availability of raw texts to train them in the first place, rather than language properties (such as family, geographical area, typological features).

4. Massively multilingual pretrained encoders such as m-bert (Devlin et al. 2019) and xlm-100 (Conneau and Lample 2019) fare quite poorly on our benchmark, whereas pretrained encoders dedicated to a single language are more competitive with static word embeddings such as fastText (Bojanowski et al. 2017). Moreover, for language-specific encoders, parameter reduction techniques reduce performance only marginally.

5. Techniques to inject clean lexical semantic knowledge from external resources into distributional word representations were proven to be effective in emphasizing the relation of semantic similarity. In particular, methods capable of transferring such knowledge from resource-rich to resource-lean languages (Ponti et al. 2019c) increased the correlation with human judgments for most languages, except for those with limited unlabelled data.

Future work can expand our preliminary, yet large-scale study on the ability of pretrained encoders to reason over word-level semantic similarity in different languages. For instance, we have highlighted how sharing the same encoder parameters across multiple languages may harm performance. However, it remains unclear if, and to what extent, the input language embeddings present in xlm-100 but absent in m-bert help mitigate this issue. In addition, pretrained language embeddings can be obtained both from typological databases (Littell et al. 2017) and from neural architectures (Malaviya, Neubig, and Littell 2017). Plugging these embeddings into the encoders in lieu of embeddings trained end-to-end as suggested by prior work (Tsvetkov et al. 2016; Ammar et al. 2016; Ponti et al. 2019b) might extend the coverage to more resource-lean languages.

Another important follow-up analysis might involve the comparison of the performance of representation learning models on multilingual data sets for both word-level semantic similarity and sentence-level natural language understanding. In particular, Multi-SimLex fills a gap in available resources for multilingual NLP and might help un-derstand how lexical and compositional semantics interact if put alongside existing re-sources such as XNLI (Conneau et al. 2018b) for natural language inference or PAWS-X (Yang et al. 2019) for crosslingual paraphrase identification. Further, the Multi-SimLex annotation could turn out to be a unique source of evidence to study the effects of polysemy in human judgments on semantic similarity: For equivalent word pairs in multiple languages, are the similarity scores affected by how many senses the two words (or multiword expressions) incorporate? Finally, similar to recent work on multilingual data set creation in NLP (see §3), Multi-SimLex has been created through translation from English, aiming to ensure as much comparability as possible between concept pairs in different languages; however, a very interesting path for future work is studying how different departure points (e.g., English vs. Hebrew vs. Mandarin as the source language) affect the obtained data, and if there are any differing or persistent translation artifacts (Artetxe, Labaka, and Agirre 2020). In the long run, how can we measure to what degree shared cultural, societal, and economic models shape and affect semantic similarity reasoning (Quinn and Holland 1987)?

Although Multi-SimLex makes a large leap toward more inclusive and larger-scale lexical semantic evaluation resources, there are other interesting research challenges remaining related to data collection, which are envisioned as future research—for example, developing similar resources for non-written signed languages. Another clear idea is creating a dedicated resource which targets similarity of multiword expressions mono- and crosslingually (e.g., measuring similarity between martial law and coup d’état).

In light of the success of initiatives like Universal Dependencies for multilingual treebanks, we hope that making Multi-SimLex and its guidelines available will encourage other researchers to expand our current sample of languages (e.g., Arabic has been added since the completion of this article). We particularly encourage creation and submission of comparable Multi-SimLex data sets for under-resourced and typologically diverse languages in future work. In particular, we have made a Multi-Simlex community Web site available to facilitate easy creation, gathering, dissemination, and use of annotated data sets: https://multisimlex.com/.

## Acknowledgments

This work is supported by the ERC Consolidator Grant LEXICAL: Lexical Acquisition Across Languages (no 648909). Thierry Poibeau is partly supported by a PRAIRIE 3IA Institute fellowship (“Investissements d’avenir” program, reference ANR-19-P3IA-0001).

## Notes

1

This lexical relation is, somewhat imprecisely, also termed true or pure semantic similarity (Hill, Reichart, and Korhonen 2015; Kiela, Hill, and Clark 2015); see the ensuing discussion in §2.1.

2

Note that there is also inherent intra-language variability that can affect concept categorization, e.g., Vejdemo (2018) studies the monolingual variability in the domain of colors. Our annotation protocols and models do not specifically cater to nor measure the fine-grained intra-language phenomenon.

3

More formally, colexification is a phenomenon when different meanings can be expressed by the same word in a language (François 2008). For instance, the two senses that are distinguished in English as time and weather are co-lexified in Croatian: the word vrijeme is used in both cases.

4

There is a very small number of adjective and verb pairs extracted from CARD-660 and SEMEVAL-500 as well. For instance, the total number of verbs is 469, since we augment the original 222 SimLex-999 verb pairs with 244 SimVerb-3500 pairs and 3 SEMEVAL-500 pairs; and similarly for adjectives.

5

Unlike SEMEVAL-500 and CARD-660, we do not explicitly control for the equal representation of concept pairs across each similarity interval for several reasons: a) Multi-SimLex contains a substantially larger number of concept pairs, so it is possible to extract balanced samples from the full data; b) such balance, even if imposed on the English data set, would be distorted in all other monolingual and crosslingual data sets; c) balancing over similarity intervals arguably does not reflect a true distribution “in the wild” where most concepts are only loosely related or completely unrelated.

6

All translators and annotators are native speakers of each target language, who are either bilingual or proficient in English. They were reached through personal contacts of the authors or were recruited from the pool of international students at the University of Cambridge.

7

Frequency lists were obtained from fastText word vectors, which are sorted by frequency: https://fasttext.cc/docs/en/crawl-vectors.html.

8

For the extraction of these features, we used lang2vec: github.com/antonisa/lang2vec.

9

We have also trained standard word-level CBOW and skip-gram with negative sampling (SGNS) on full Wikipedia dumps for several languages, but our preliminary experiments have verified that they under-perform compared to fastText. This finding is consistent with other recent studies demonstrating the usefulness of subword-level information (Vania and Lopez 2017; Mikolov et al. 2018; Zhu, Vulić, and Korhonen 2019; Zhu et al. 2019) Therefore, we do not report the results with CBOW and SGNS for brevity.

12

We also tested another encoding method where we fed pairs instead of single words/concepts into the pretrained encoder. The rationale is that the other concept in the pair can be used as a disambiguation signal. However, this method consistently led to sub-par performance across all experimental runs.

15

In our preliminary experiments on several language pairs, we have also verified that this choice is superior to: a) using the output of only the embedding layer, and b) averaging over all hidden layers (i.e., H = 12 for the bert-base architecture). Likewise, using the special prepended ‘[CLS]’ token rather than the constituent sub-words to encode a concept also led to much worse performance across the board.

16

github.com/huggingface/transformers. The full list of currently supported pretrained encoders is available here: huggingface.co/models.

17

We acknowledge that it is possible to approximate word-level representations of OOVs with ft by summing the constituent n-gram embeddings as proposed by Bojanowski et al. (2017). However, we do not perform this step as the resulting embeddings are typically of much lower quality than non-OOV embeddings (Zhu, Vulić, and Korhonen 2019).

18

Hubness can be defined as the tendency of some points/vectors (i.e., “hubs”) to be nearest neighbors of many points in a high-dimensional (vector) space (Radovanović, Nanopoulos, and Ivanović 2010; Lazaridou, Dinu, and Baroni 2015; Conneau et al. 2018a).

19

All models and their further specifications are available at the following link: https://huggingface.co/models.

20

For an overview of specialization methods for semantic similarity, we refer the interested reader to the recent tutorial (Glavaš, Ponti, and Vulić 2019).

21

We have also evaluated other specialization transfer methods (e.g., Glavaš and Vulić 2018; Ponti et al. 2018b), but they are consistently outperformed by the method of Ponti et al. (2019c).

22

https://fasttext.cc/docs/en/aligned-vectors.html; for target languages for which there are no pretrained CLWEs, we induce them following the same procedure of Joulin et al. (2018).

23

We again follow Ponti et al. (2019c) and use a state-of-the-art relation classifier (Glavaš and Vulić 2018). We refer the reader to the original work for additional technical details related to the classifier design.

25

The xlm-100 scores are not reported for brevity; they largely follow the patterns observed with m-bert. The aggregated scores between the two encoders are also very similar, as indicated by Figure 6a.

26

Note that vecmap does mean centering by default as one of its preprocessing steps prior to learning the mapping function (Artetxe, Labaka, and Agirre 2018b; Vulić et al. 2019).

## References

,
Oliver
,
Makarucha
,
Graham
Neubig
,
Steven
Bird
, and
Trevor
Cohn
.
2017
.
Cross-lingual word embeddings for low-resource language modeling
. In
Proceedings of EACL
, pages
937
947
.
Valencia
. DOI: https://doi.org/10.18653/v1/E17-1088
Agirre
,
Eneko
,
Enrique
Alfonseca
,
Keith
Hall
,
Jana
Kravalová
,
Marius
Pasca
, and
Aitor
Soroa
.
2009
.
A study on similarity and relatedness using distributional and WordNet-based approaches
. In
Proceedings of NAACL-HLT
, pages
19
27
.
Boulder, CO.
DOI: https://doi.org/10.3115/1620754.1620758
Aldarmaki
,
Hanan
and
Mona
Diab
.
2019
.
Context-aware cross-lingual mapping
. In
Proceedings of NAACL-HLT
, pages
3906
3911
.
Minneapolis, MN
. DOI: https://doi.org/10.18653/v1/N19-1391
Alvarez-Melis
,
David
and
Tommi
Jaakkola
.
2018
.
Gromov-Wasserstein alignment of word embedding spaces
. In
Proceedings of EMNLP
, pages
1881
1890
.
Brussels
. DOI: https://doi.org/10.18653/v1/D18-1214
Ammar
,
Waleed
,
George
Mulcaire
,
Miguel
Ballesteros
,
Chris
Dyer
, and
Noah
Smith
.
2016
.
Many languages, one parser
.
Transactions of the ACL
,
4
:
431
444
. DOI: https://doi.org/10.1162/tacl_a_00109
Artetxe
,
Mikel
,
Gorka
Labaka
, and
Eneko
Agirre
.
2017
.
Learning bilingual word embeddings with (almost) no bilingual data
. In
Proceedings of ACL
, pages
451
462
.
Vancouver
. DOI: https://doi.org/10.18653/v1/P17-1042
Artetxe
,
Mikel
,
Gorka
Labaka
, and
Eneko
Agirre
.
2018a
.
Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations
. In
Proceedings of AAAI
, pages
5012
5019
.
New Orleans, LA
.
Artetxe
,
Mikel
,
Gorka
Labaka
, and
Eneko
Agirre
.
2018b
.
A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings
. In
Proceedings of ACL
, pages
789
798
.
Melbourne
. DOI: https://doi.org/10.18653/v1/P18-1073
Artetxe
,
Mikel
,
Gorka
Labaka
, and
Eneko
Agirre
.
2020
.
Translation artifacts in cross-lingual transfer learning
. In
Proceedings of EMNLP
, volume
abs/2004.04721
.
Online
.
Artetxe
,
Mikel
,
Gorka
Labaka
,
Inigo
Lopez-Gazpio
, and
Eneko
Agirre
.
2018
.
Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation
. In
Proceedings of CoNLL
, pages
282
291
.
Brussels
. DOI: https://doi.org/10.18653/v1/K18-1028
Artetxe
,
Mikel
,
Sebastian
Ruder
, and
Dani
Yogatama
.
2019
.
On the cross-lingual transferability of monolingual representations
.
CoRR
,
abs/1910.11856
. DOI: https://doi.org/10.18653/v1/2020.acl-main.421
Baker
,
Collin F.
,
Charles J.
Fillmore
, and
John B.
Lowe
.
1998
.
The Berkeley FrameNet project
. In
Proceedings of ACL
, pages
86
90
.
Montreal
. DOI: https://doi.org/10.3115/980845.980860
Baker
,
Simon
,
Roi
Reichart
, and
Anna
Korhonen
.
2014
.
An unsupervised model for instance level subcategorization acquisition
. In
Proceedings of EMNLP
, pages
278
289
.
Doha
. DOI: https://doi.org/10.3115/v1/D14-1034, PMID: 25178667
Bapna
,
Ankur
and
Orhan
Firat
.
2019
.
Simple, scalable adaptation for neural machine translation
. In
Proceedings of EMNLP
, pages
1538
1548
.
Hong Kong
. DOI: https://doi.org/10.18653/v1/N19-1191
Baroni
,
Marco
,
Silvia
Bernardini
,
Ferraresi
, and
Eros
Zanchetta
.
2009
.
The WaCky wide web: A collection of very large linguistically processed web-crawled corpora
.
Language Resources and Evaluation
,
43
(
3
):
209
226
. DOI: https://doi.org/10.1007/s10579-009-9081-4
Baroni
,
Marco
and
Alessandro
Lenci
.
2010
.
Distributional memory: A general framework for corpus-based semantics
.
Computational Linguistics
,
36
(
4
):
673
721
. DOI: https://doi.org/10.1162/coli_a_00016
Barzegar
,
Siamak
,
Brian
Davis
,
Manel
Zarrouk
,
Siegfried
Handschuh
, and
André
Freitas
.
2018
.
SemR-11: A multi-lingual gold-standard for semantic similarity and relatedness for eleven languages
. In
Proceedings of LREC
, pages
3912
3916
.
Miyazaki
.
Bjerva
,
Johannes
and
Isabelle
Augenstein
.
2018
.
From phonology to syntax: Unsupervised linguistic typology at different levels with language embeddings
. In
Proceedings of NAACL-HLT
, pages
907
916
.
New Orleans, LA
. DOI: https://doi.org/10.18653/v1/N18-1083
Bjerva
,
Johannes
,
Robert
Östling
,
Maria Han
Veiga
,
Jörg
Tiedemann
, and
Isabelle
Augenstein
.
2019
.
What do language representations really represent?
Computational Linguistics
,
45
(
2
):
381
389
. DOI: https://doi.org/10.1162/coli_a_00351
Bojanowski
,
Piotr
,
Edouard
Grave
,
Armand
Joulin
, and
Tomas
Mikolov
.
2017
.
Enriching word vectors with subword information
.
Transactions of the ACL
,
5
:
135
146
. DOI: https://doi.org/10.1162/tacl_a_00051
Bro
,
Rasmus
and
Age K.
Smilde
.
2003
.
Centering and scaling in component analysis
.
Journal of Chemometrics
,
17
(
1
):
16
33
. DOI: https://doi.org/10.1002/cem.773
Bruni
,
Elia
,
Nam-Khanh
Tran
, and
Marco
Baroni
.
2014
.
Multimodal distributional semantics
.
Journal of Artificial Intelligence Research
,
49
:
1
47
. DOI: https://doi.org/10.1613/jair.4135
Budanitsky
,
Alexander
and
Graeme
Hirst
.
2006
.
Evaluating WordNet-based measures of lexical semantic relatedness
.
Computational Linguistics
,
32
(
1
):
13
47
. DOI: https://doi.org/10.1162/coli.2006.32.1.13
,
Jose
and
Roberto
Navigli
.
2017
.
BabelDomains: Large-scale domain labeling of lexical resources
. In
Proceedings of EACL
, pages
223
228
.
Valencia
. DOI: https://doi.org/10.18653/v1/E17-2036
,
Jose
,
Pilehvar
,
Nigel
Collier
, and
Roberto
Navigli
.
2017
.
Semeval-2017 task 2: Multilingual and cross-lingual semantic word similarity
. In
Proceedings of SEMEVAL
, pages
15
26
.
Vancouver
. DOI: https://doi.org/10.18653/v1/S17-2002
,
José
,
Pilehvar
, and
Roberto
Navigli
.
2015
.
A framework for the construction of monolingual and cross-lingual word similarity datasets
. In
Proceedings of ACL
, pages
1
7
.
Beijing
. DOI: https://doi.org/10.3115/v1/P15-2001
Cao
,
Steven
,
Nikita
Kitaev
, and
Dan
Klein
.
2020
.
Multilingual alignment of contextual word representations
. In
Proceedings of ICLR
.
Online
.
Chen
,
Danqi
and
Christopher D.
Manning
.
2014
.
A fast and accurate dependency parser using neural networks
. In
Proceedings of EMNLP
, pages
740
750
.
Doha
. DOI: https://doi.org/10.3115/v1/D14-1082
Chen
,
Xilun
and
Claire
Cardie
.
2018
.
Unsupervised multilingual word embeddings
. In
Proceedings of EMNLP
, pages
261
270
.
Brussels
. DOI: https://doi.org/10.18653/v1/D18-1024
Chen
,
Xinying
and
Kim
Gerdes
.
2017
.
Classifying languages by dependency structure. Typologies of delexicalized universal dependency treebanks
. In
Proceedings of the 4th International Conference on Dependency Linguistics (DepLing)
, pages
54
63
.
Pisa
.
Cimiano
,
Philipp
,
Andreas
Hotho
, and
Steffen
Staab
.
2005
.
Learning concept hierarchies from text corpora using formal concept analysis
.
Journal of Artificial Intelligence Research
,
24
:
305
339
. DOI: https://doi.org/10.1613/jair.1648
Clark
,
Jonathan H.
,
Eunsol
Choi
,
Michael
Collins
,
Dan
Garrette
,
Tom
Kwiatkowski
,
Vitaly
Nikolaev
, and
Jennimaria
Palomaki
.
2020
.
TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages
.
Transactions of the ACL
,
8
:
454
470
. DOI: https://doi.org/10.1162/tacl_a_00317
Collobert
,
Ronan
and
Jason
Weston
.
2008
.
A unified architecture for natural language processing: Deep neural networks with multitask learning
. In
Proceedings of ICML
, pages
160
167
.
Helsinki
. DOI: https://doi.org/10.1145/1390156.1390177
Collobert
,
Ronan
,
Jason
Weston
,
Léon
Bottou
,
Michael
Karlen
,
Koray
Kavukcuoglu
, and
Pavel P.
Kuksa
.
2011
.
Natural language processing (almost) from scratch
.
Journal of Machine Learning Research
,
12
:
2493
2537
.
Conneau
,
Alexis
,
Kartikay
Khandelwal
,
Naman
Goyal
,
Vishrav
Chaudhary
,
Guillaume
Wenzek
,
Francisco
Guzmán
,
Edouard
Grave
,
Myle
Ott
,
Luke
Zettlemoyer
, and
Veselin
Stoyanov
.
2019
.
Unsupervised cross-lingual representation learning at scale
.
CoRR
,
abs/1911.02116
. DOI: https://doi.org/10.18653/v1/2020.acl-main.747
Conneau
,
Alexis
and
Guillaume
Lample
.
2019
.
Cross-lingual language model pretraining
. In
Proceedings of NeurIPS
, pages
7057
7067
.
Vancouver
.
Conneau
,
Alexis
,
Guillaume
Lample
,
Marc’Aurelio
Ranzato
,
Ludovic
Denoyer
, and
Hervé
Jégou
.
2018a
.
Word translation without parallel data
. In
Proceedings of ICLR
.
Vancouver
.
Conneau
,
Alexis
,
Ruty
Rinott
,
Guillaume
Lample
,
Williams
,
Samuel
Bowman
,
Holger
Schwenk
, and
Veselin
Stoyanov
.
2018b
.
XNLI: Evaluating cross-lingual sentence representations
. In
Proceedings of EMNLP
, pages
2475
2485
.
Brussels
. DOI: https://doi.org/10.18653/v1/D18-1269
Coseriu
,
Eugenio
.
1967
.
Lexikalische solidaritäten
.
Poetica
.
1
:
293
303
.
Cruse
,
David Alan
.
1986
.
Lexical Semantics
,
Cambridge University Press
.
De Deyne
Simon
, and
Gert
Storms
.
2008
.
Word associations: Network and semantic properties
.
Behavior Research Methods
,
40
(
1
):
213
231
. DOI: https://doi.org/10.3758/BRM.40.1.213, PMID: 18411545
Devlin
,
Jacob
,
Ming-Wei
Chang
,
Kenton
Lee
, and
Kristina
Toutanova
.
2019
.
BERT: Pre-training of deep bidirectional transformers for language understanding
. In
Proceedings of NAACL-HLT
, pages
4171
4186
.
Minneapolis, MN
.
Doitch
,
Amichay
,
Ram
Yazdi
,
Tamir
Hazan
, and
Roi
Reichart
.
2019
.
Perturbation based learning for structured NLP tasks with application to dependency parsing
.
Transactions of the ACL
,
7
:
643
659
. DOI: https://doi.org/10.1162/tacl_a_00291
Doval
,
Yerai
,
Jose
,
Luis
Espinosa-Anke
, and
Steven
Schockaert
.
2018
.
Improving cross-lingual word embeddings by meeting in the middle
. In
Proceedings of EMNLP
, pages
294
304
.
Brussels
. DOI: https://doi.org/10.18653/v1/D18-1027
Doval
,
Yerai
,
Jose
,
Luis
Espinosa-Anke
, and
Steven
Schockaert
.
2019
.
On the robustness of unsupervised and semi-supervised cross-lingual word embedding learning
.
CoRR
,
abs/1908.07742
.
Dryer
,
Matthew S.
2013
.
Order of subject, object and verb
. In
Dryer
,
Matthew S.
and
Martin
Haspelmath
, editors,
The World Atlas of Language Structures Online
,
Max Planck Institute for Evolutionary Anthropology
,
Leipzig
, pages
330
333
.
Ercan
,
Gökhan
and
Olcay Taner
Yıldız
.
2018
.
AnlamVer: Semantic model evaluation dataset for Turkish - Word similarity and relatedness
. In
Proceedings of COLING
, pages
3819
3836
.
Santa Fe, NM
.
Ethayarajh
,
Kawin
.
2019
.
How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings
. In
Proceedings of EMNLP
, pages
55
65
.
Hong Kong
. DOI: https://doi.org/10.18653/v1/D19-1006
Evans
,
Nicholas
.
2011
.
Semantic Typology
. In
The Oxford Handbook of Linguistic Typology
.
Oxford University Press
, pages
504
533
. DOI: https://doi.org/10.1093/oxfordhb/9780199281251.013.0024
Faruqui
,
Manaal
,
Jesse
Dodge
,
Sujay Kumar
Jauhar
,
Chris
Dyer
,
Eduard
Hovy
, and
Noah A.
Smith
.
2015
.
Retrofitting word vectors to semantic lexicons
. In
Proceedings of NAACL-HLT
, pages
1606
1615
.
Denver, CO
. DOI: https://doi.org/10.3115/v1/N15-1184
Fellbaum
,
Christiane
.
1998
.
WordNet
,
MIT Press
. DOI: https://doi.org/10.7551/mitpress/7287.001.0001
Finkelstein
,
Lev
,
Evgeniy
Gabrilovich
,
Yossi
Matias
,
Ehud
Rivlin
,
Zach
Solan
,
Wolfman
, and
Eytan
Ruppin
.
2002
.
Placing search in context: The concept revisited
.
ACM Transactions on Information Systems
,
20
(
1
):
116
131
. DOI: https://doi.org/10.1145/503104.503110
Firth
,
John R
.
1957
.
A synopsis of linguistic theory, 1930-1955
.
Studies in Linguistic Analysis
.
Blackwell
, pages
1
32
.
François
,
Alexandre
.
2008
.
Semantic maps and the typology of colexification
. In
Martine
Vanhove
, editor,
From Polysemy to Semantic Change: Towards a Typology of Lexical Semantic Associations, Studies in Language Companion Series
,
106
.
John Benjamins Publishing Company
, pages
163
215
. DOI: https://doi.org/10.1075/slcs.106.09fra
Gerz
,
Daniela
,
Ivan
Vulić
,
Felix
Hill
,
Roi
Reichart
, and
Anna
Korhonen
.
2016
.
SimVerb-3500: A large-scale evaluation set of verb similarity
. In
Proceedings of EMNLP
, pages
2173
2182
.
Austin, TX
. DOI: https://doi.org/10.18653/v1/D16-1235
Glavaš
,
Goran
,
Edoardo Maria
Ponti
, and
Ivan
Vulić
.
2019
.
Semantic specialization of distributional word vectors
. In
Proceedings of EMNLP: Tutorial Abstracts
.
Hong Kong
. See https://www.aclweb.org/anthology/D19-2007.
Glavaš
,
Goran
and
Ivan
Vulić
.
2018
.
Discriminating between lexico-semantic relations with the specialization tensor model
. In
Proceedings of NAACL-HLT
, pages
181
187
.
New Orleans, LA
. DOI: https://doi.org/10.18653/v1/N18-2029
Glavaš
,
Goran
and
Ivan
Vulić
.
2018
.
Explicit retrofitting of distributional word vectors
. In
Proceedings of ACL
, pages
34
45
.
Melbourne
. DOI: https://doi.org/10.18653/v1/P18-1004
Glavaš
,
Goran
,
Robert
Litschko
,
Sebastian
Ruder
, and
Ivan
Vulić
.
2019
.
How to (properly) evaluate cross-lingual word embeddings: On strong baselines, comparative analyses, and some misconceptions
. In
Proceedings of ACL
, pages
710
721
.
Florence
. DOI: https://doi.org/10.18653/v1/P19-1070
Grave
,
Edouard
,
Piotr
Bojanowski
,
Prakhar
Gupta
,
Armand
Joulin
, and
Tomas
Mikolov
.
2018
.
Learning word vectors for 157 languages
. In
Proceedings of LREC
, pages
3483
3487
.
Miyazaki
.
Gruber
,
Jeffrey
.
1976
.
Lexical Structures in Syntax and Semantics
, volume
25
,
North-Holland
.
Harris
,
Zellig S.
1951
.
Methods in Structural Linguistics
,
University of Chicago Press
.
Hill
,
Felix
,
KyungHyun
Cho
,
Anna
Korhonen
, and
Yoshua
Bengio
.
2016
.
Learning to understand phrases by embedding the dictionary
.
Transactions of the ACL
,
4
:
17
30
. DOI: https://doi.org/10.1162/tacl_a_00080
Hill
,
Felix
,
Roi
Reichart
, and
Anna
Korhonen
.
2015
.
SimLex-999: Evaluating semantic models with (genuine) similarity estimation
.
Computational Linguistics
,
41
(
4
):
665
695
. DOI: https://doi.org/10.1162/tacl_a_00080
Hoshen
,
Yedid
and
Lior
Wolf
.
2018
.
. In
Proceedings of EMNLP
, pages
469
478
.
Belgium
. DOI: https://doi.org/10.18653/v1/D18-1043
Hu
,
Junjie
,
Sebastian
Ruder
,
Siddhant
,
Graham
Neubig
,
Orhan
Firat
, and
Melvin
Johnson
.
2020
.
XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization
. In
Proceedings of ICML
, pages
4411
4421
.
Online
.
Huang
,
Junjie
,
Fanchao
Qi
,
Chenghao
Yang
,
Zhiyuan
Liu
, and
Maosong
Sun
.
2019
.
COS960: A Chinese word similarity dataset of 960 word pairs
.
CoRR
,
abs/1906.00247
.
Joshi
,
Pratik
,
Sebastin
Santy
,
Amar
Budhiraja
,
Kalika
Bali
, and
Monojit
Choudhury
.
2020
.
The state and fate of linguistic diversity and inclusion in the NLP world
. In
Proceedings of ACL
, pages
6282
6293
.
Online
. DOI: https://doi.org/10.18653/v1/2020.acl-main.560
Joulin
,
Armand
,
Piotr
Bojanowski
,
Tomas
Mikolov
,
Hervé
Jégou
, and
Edouard
Grave
.
2018
.
Loss in translation: Learning bilingual word mapping with a retrieval criterion
. In
Proceedings of EMNLP
, pages
2979
2984
.
Brussels
. DOI: https://doi.org/10.18653/v1/D18-1330
Kamath
,
Aishwarya
,
Jonas
Pfeiffer
,
Edoardo Maria
Ponti
,
Goran
Glavaš
, and
Ivan
Vulić
.
2019
.
Specializing distributional vectors of all words for lexical entailment
. In
Proceedings of the 4th Workshop on Representation Learning for NLP
, pages
72
83
.
Florence
. DOI: https://doi.org/10.18653/v1/W19-4310
Kamholz
,
David
,
Jonathan
Pool
, and
Susan M.
Colowick
.
2014
.
PanLex: Building a resource for panlingual lexical translation
. In
Proceedings of LREC
, pages
3145
3150
.
Reykjavik
.
Kay
,
Paul
and
Luisa
Maffi
.
2013
.
Green and blue
. In
Dryer
,
Matthew S.
and
Martin
Haspelmath
, editors,
The World Atlas of Language Structures Online
,
Max Planck Institute for Evolutionary Anthropology
,
Leipzig
, pages
540
541
.
Kiela
,
Douwe
and
Stephen
Clark
.
2014
.
A systematic study of semantic vector space model parameters
. In
Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC)
, pages
21
30
.
Gothenburg
. DOI: https://doi.org/10.3115/v1/W14-1503
Kiela
,
Douwe
,
Felix
Hill
, and
Stephen
Clark
.
2015
.
Specializing word embeddings for similarity or relatedness
. In
Proceedings of EMNLP
, pages
2044
2048
.
Lisbon
. DOI: https://doi.org/10.18653/v1/D15-1242
Kim
,
Joo-Kyung
,
Marie-Catherine
de Marneffe
, and
Eric
Fosler-Lussier
.
2016
.
Adjusting word embeddings with semantic intensity orders
. In
Proceedings of the 1st Workshop on Representation Learning for NLP
, pages
62
69
.
Berlin
. DOI: https://doi.org/10.18653/v1/W16-1607
Kim
,
Joo-Kyung
,
Gokhan
Tur
,
Asli
Celikyilmaz
,
Bin
Cao
, and
Ye-Yi
Wang
.
2016
.
Intent detection using semantically enriched word embeddings
. In
SLT
, pages
414
419
.
San Diego, CA
. DOI: https://doi.org/10.1109/SLT.2016.7846297
Kipper
,
Karin
,
Anna
Korhonen
,
Neville
Ryant
, and
Martha
Palmer
.
2008
.
A large-scale classification of English verbs
.
Language Resources and Evaluation
,
42
(
1
):
21
40
. DOI: https://doi.org/10.1007/s10579-007-9048-2
Kipper
,
Karin
,
Benjamin
Snyder
, and
Martha
Palmer
.
2004
.
Extending a verb-lexicon using a semantically annotated corpus
. In
Proceedings of LREC
, pages
1557
1560
.
Lisbon
.
Kipper Schuler
,
Karin
,
.
2005
.
VerbNet: A broad-coverage, comprehensive verb lexicon
. Ph.D. thesis,
University of Pennsylvania
.
Kondratyuk
,
Dan
and
Milan
Straka
.
2019
.
75 languages, 1 model: Parsing Universal Dependencies universally
. In
Proceedings of EMNLP-IJCNLP
, pages
2779
2795
.
Hong Kong
. DOI: https://doi.org/10.18653/v1/D19-1279
Lan
,
Zhenzhong
,
Mingda
Chen
,
Sebastian
Goodman
,
Kevin
Gimpel
,
Piyush
Sharma
, and
Soricut
.
2020
.
ALBERT: A lite BERT for self-supervised learning of language representations
. In
Proceedings of ICLR
.
Online
.
Lauscher
,
Anne
,
Vinit
Ravishankar
,
Ivan
Vulić
, and
Goran
Glavaš
.
2020
.
From zero to hero: On the limitations of zero-shot cross-lingual transfer with multilingual transformers
. In
Proceedings of EMNLP
, pages
4483
4499
.
Online
.
Lauscher
,
Anne
,
Ivan
Vulić
,
Edoardo Maria
Ponti
,
Anna
Korhonen
, and
Goran
Glavaš
.
2019
.
Informing unsupervised pretraining with external linguistic knowledge
.
arXiv preprint arXiv:1909.02339
.
Lazaridou
,
Angeliki
,
Georgiana
Dinu
, and
Marco
Baroni
.
2015
.
Hubness and pollution: Delving into cross-space mapping for zero-shot learning
. In
Proceedings of ACL
, pages
270
280
.
Beijing
. DOI: https://doi.org/10.3115/v1/P15-1027
Le
,
Hang
,
Loïc
Vial
,
Jibril
Frej
,
Vincent
Segonne
,
Maximin
Coavoux
,
Benjamin
Lecouteux
,
Alexandre
Allauzen
,
Benoît
Crabbé
,
Laurent
Besacier
, and
Didier
Schwab
.
2019
.
FlauBERT: Unsupervised language model pre-training for French
.
CoRR
,
abs/1912.05372
.
Leviant
,
Ira
and
Roi
Reichart
.
2015
.
Separated by an un-common language: Towards judgment language informed vector space modeling
.
CoRR
,
abs/1508.00106
.
Levin
,
Beth
.
1993
.
English Verb Classes and Alternation, A Preliminary Investigation
.
The University of Chicago Press
.
Levy
,
Omer
and
Yoav
Goldberg
.
2014
.
Dependency-based word embeddings
. In
Proceedings of ACL
, pages
302
308
.
Baltimore, MD
. DOI: https://doi.org/10.3115/v1/P14-2050, PMID: 25270273
Lewis
,
Patrick S. H.
,
Barlas
Oguz
,
Ruty
Rinott
,
Sebastian
Riedel
, and
Holger
Schwenk
.
2019
.
MLQA: Evaluating cross-lingual extractive question answering
.
CoRR
,
abs/1910.07475
. DOI: https://doi.org/10.18653/v1/2020.acl-main.653
Liang
,
Yaobo
,
Nan
Duan
,
Yeyun
Gong
,
Ning
Wu
,
Fenfei
Guo
,
Weizhen
Qi
,
Ming
Gong
,
Linjun
Shou
,
Daxin
Jiang
,
Guihong
Cao
,
Xiaodong
Fan
,
Bruce
Zhang
,
Rahul
Agrawal
,
Edward
Cui
,
Sining
Wei
,
Taroon
Bharti
,
Ying
Qiao
,
Jiun-Hung
Chen
,
Winnie
Wu
,
Shuguang
Liu
,
Fan
Yang
,
Rangan
Majumder
, and
Ming
Zhou
.
2020
.
XGLUE: A new benchmark dataset for cross-lingual pre-training, understanding and generation
.
CoRR
,
abs/2004.01401
.
Lin
,
Yu Hsiang
,
Chian-Yu
Chen
,
Jean
Lee
,
Zirui
Li
,
Yuyan
Zhang
,
Mengzhou
Xia
,
Shruti
Rijhwani
,
Junxian
He
,
Zhisong
Zhang
,
Xuezhe
Ma
,
Antonios
Anastasopoulos
,
Patrick
Littell
, and
Graham
Neubig
.
2019
.
Choosing transfer languages for cross-lingual learning
. In
Proceedings of ACL
, pages
3125
3135
.
Florence
. DOI: https://doi.org/10.18653/v1/P19-1301
Littell
,
Patrick
,
David R
Mortensen
,
Ke
Lin
,
Katherine
Kairis
,
Carlisle
Turner
, and
Lori
Levin
.
2017
.
Uriel and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors
. In
Proceedings of EACL
, pages
8
14
.
Valencia
. DOI: https://doi.org/10.18653/v1/E17-2002
Liu
,
Qianchu
,
Diana
McCarthy
,
Ivan
Vulić
, and
Anna
Korhonen
.
2019a
.
Investigating cross-lingual alignment methods for contextualized embeddings with token-level evaluation
. In
Proceedings of CoNLL
, pages
33
43
.
Hong Kong
. DOI: https://doi.org/10.18653/v1/K19-1004
Liu
,
Yinhan
,
Myle
Ott
,
Naman
Goyal
,
Jingfei
Du
,
Mandar
Joshi
,
Danqi
Chen
,
Omer
Levy
,
Mike
Lewis
,
Luke
Zettlemoyer
, and
Veselin
Stoyanov
.
2019b
.
RoBERTa: A robustly optimized BERT pretraining approach
.
CoRR
,
abs/1907.11692
.
Lucas
,
Margery
.
2000
.
Semantic priming without association: A meta-analytic review
.
Psychonomic Bulletin & Review
,
7
(
4
):
618
630
. DOI: https://doi.org/10.3758/BF03212999, PMID: 11206202
Luong
,
Thang
,
Richard
Socher
, and
Christopher
Manning
.
2013
.
Better word representations with recursive neural networks for morphology
. In
Proceedings of CoNLL
, pages
104
113
.
Sofia
.
Lyons
,
John
.
1977
.
Semantics
, volume
2
.
Cambridge University Press
. DOI: https://doi.org/10.1017/CBO9781139165693
Majid
,
Asifa
,
Melissa
Bowerman
,
Miriam van
, and
James S.
Boster
.
2007
.
The semantic categories of cutting and breaking events: A cross-linguistic perspective
.
Cognitive Linguistics
,
18
(
2
):
133
152
. DOI: https://doi.org/10.1515/COG.2007.005
Malaviya
,
Chaitanya
,
Graham
Neubig
, and
Patrick
Littell
.
2017
.
Learning language representations for typology prediction
. In
Proceedings of EMNLP
, pages
2529
2535
.
Copenhagen
. DOI: https://doi.org/10.18653/v1/D17-1268
Mantel
,
Nathan
.
1967
.
The detection of disease clustering and a generalized regression approach
.
Cancer Research
,
27
(
2 Part 1
):
209
220
.
McKeown
,
Kathleen R.
,
Regina
Barzilay
,
David
Evans
,
Vasileios
Hatzivassiloglou
,
Judith L.
Klavans
,
Ani
Nenkova
,
Carl
Sable
,
Barry
Schiffman
, and
Sergey
Sigelman
.
2002
.
Tracking and summarizing news on a daily basis with Columbia’s newsblaster
. In
Proceedings of HLT
, pages
280
285
.
San Diego, CA
. DOI: https://doi.org/10.3115/1289189.1289212
Melamud
,
Oren
,
David
McClosky
,
Siddharth
Patwardhan
, and
Mohit
Bansal
.
2016
.
The role of context types and dimensionality in learning word embeddings
. In
Proceedings of NAACL-HLT
, pages
1030
1040
.
San Diego, CA
. DOI: https://doi.org/10.18653/v1/N16-1118
Mikolov
,
Tomas
,
Edouard
Grave
,
Piotr
Bojanowski
,
Christian
Puhrsch
, and
Armand
Joulin
.
2018
.
Advances in pre-training distributed word representations
. In
Proceedings of LREC
, pages
52
55
.
Miyazaki
.
Mikolov
,
Tomas
,
Quoc V.
Le
, and
Ilya
Sutskever
.
2013
.
Exploiting similarities among languages for machine translation
.
arXiv preprint, CoRR
,
abs/1309.4168
.
Mikolov
,
Tomas
,
Ilya
Sutskever
,
Kai
Chen
,
Gregory S.
, and
Jeffrey
Dean
.
2013
.
Distributed representations of words and phrases and their compositionality
. In
Proceedings of NeurIPS
, pages
3111
3119
.
South Lake Tahoe, NV
.
Miller
,
George A.
1995
.
WordNet: A lexical database for English
.
Communications of the ACM
. pages
39
41
. DOI: https://doi.org/10.1145/219717.219748
Mimno
,
David
and
Laure
Thompson
.
2017
.
The strange geometry of skip-gram with negative sampling
. In
Proceedings of EMNLP
, pages
2873
2878
.
Copenhagen
. DOI: https://doi.org/10.18653/v1/D17-1308
Mohiuddin
,
Tasnim
and
Shafiq
Joty
.
2019
.
Revisiting adversarial autoencoder for unsupervised word translation with cycle consistency and improved training
. In
Proceedings of NAACL-HLT
, pages
3857
3867
.
Minneapolis, MN
. DOI: https://doi.org/10.18653/v1/N19-1386
Mrkšić
,
Nikola
,
Diarmuid Ó
Séaghdha
,
Blaise
Thomson
,
Milica
Gašić
,
Lina Maria
Rojas-Barahona
,
Pei-Hao
Su
,
David
Vandyke
,
Tsung-Hsien
Wen
, and
Steve
Young
.
2016
.
Counter-fitting word vectors to linguistic constraints
. In
Proceedings of NAACL-HLT
, pages
142
148
.
San Diego, CA
. DOI: https://doi.org/10.18653/v1/N16-1018
Mrkšić
,
Nikola
,
Ivan
Vulić
,
Diarmuid Ó
Séaghdha
,
Ira
Leviant
,
Roi
Reichart
,
Milica
Gašić
,
Anna
Korhonen
, and
Steve
Young
.
2017
.
Semantic specialisation of distributional word vector spaces using monolingual and cross-lingual constraints
.
Transactions of the ACL
,
5
:
309
324
. DOI: https://doi.org/10.1162/tacl_a_00063
Mu
,
Jiaqi
,
Suma
Bhat
, and
Pramod
Viswanath
.
2018
.
All-but-the-top: Simple and effective postprocessing for word representations
. In
Proceedings of ICLR
.
Vancouver
.
Mykowiecka
,
Agnieszka
,
Małgorzata
Marciniak
, and
Piotr
Rychlik
.
2018
.
SimLex-999 for Polish
. In
Proceedings of LREC
, pages
2398
2402
.
Miyazaki
.
Nelson
,
Douglas L.
,
Cathy L.
McEvoy
, and
Thomas A.
Schreiber
.
2004
.
The University of South Florida free association, rhyme, and word fragment norms
.
Behavior Research Methods
,
36
(
3
):
402
407
. DOI: https://doi.org/10.3758/BF03195588, PMID: 15641430
Netisopakul
,
Ponrudee
,
Gerhard
Wohlgenannt
, and
Aleksei
Pulich
.
2019
.
Word similarity datasets for Thai: Construction and evaluation
.
CoRR
,
abs/1904.04307
. DOI: https://doi.org/10.1109/ACCESS.2019.2944151
Nivre
,
Joakim
,
Mitchell
Abrams
,
željko
Agić
,
Lars
Ahrenberg
,
Gabrielė
Aleksandravičiūtė
,
Lene
Antonsen
,
Katya
Aplonova
,
Maria Jesus
Aranzabe
,
Gashaw
Arutie
,
Masayuki
Asahara
, et al.
2019
.
Universal Dependencies 2.4
.
LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
.
Östling
,
Robert
and
Jörg
Tiedemann
.
2017
.
Continuous multilinguality with language vectors
. In
Proceedings of EACL
, pages
644
649
.
Valencia, Spain
. DOI: https://doi.org/10.18653/v1/E17-2102
Pearson
,
Karl
.
1901
.
On lines and planes of closest fit to systems of points in space
.
The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science
,
2
(
11
):
559
572
. DOI: https://doi.org/10.1080/14786440109462720
Peters
,
Matthew
,
Mark
Neumann
,
Mohit
Iyyer
,
Matt
Gardner
,
Christopher
Clark
,
Kenton
Lee
, and
Luke
Zettlemoyer
.
2018
.
Deep contextualized word representations
. In
Proceedings of NAACL-HLT
, pages
2227
2237
.
New Orleans, LA
. DOI: https://doi.org/10.18653/v1/N18-1202
Pilehvar
,
,
Dimitri
Kartsaklis
,
Victor
Prokhorov
, and
Nigel
Collier
.
2018
.
Card-660: Cambridge rare word dataset - a reliable benchmark for infrequent word representation models
. In
Proceedings of EMNLP
, pages
1391
1401
.
Brussels
. DOI: https://doi.org/10.18653/v1/D18-1169
Pires
,
Telmo
,
Eva
Schlinger
, and
Dan
Garrette
.
2019
.
How multilingual is multilingual BERT?
In
Proceedings of ACL
, pages
4996
5001
.
Florence
. DOI: https://doi.org/10.18653/v1/P19-1493
Ponti
,
Edoardo Maria
,
Goran
Glavaš
,
Olga
Majewska
,
Qianchu
Liu
,
Ivan
Vulić
, and
Anna
Korhonen
.
2020
.
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning
.
arXiv preprint arXiv:2005.00333
.
Online
.
Ponti
,
Edoardo Maria
,
Helen
O’Horan
,
Yevgeni
Berzak
,
Ivan
Vulić
,
Roi
Reichart
,
Thierry
Poibeau
,
Ekaterina
Shutova
, and
Anna
Korhonen
.
2019a
.
Modeling language variation and universals: A survey on typological linguistics for natural language processing
.
Computational Linguistics
,
45
(
3
):
559
601
. DOI: https://doi.org/10.1162/coli_a_00357
Ponti
,
Edoardo Maria
,
Roi
Reichart
,
Anna
Korhonen
, and
Ivan
Vulić
.
2018a
.
Isomorphic transfer of syntactic structures in cross-lingual NLP
. In
Proceedings of ACL
, pages
1531
1542
.
Melbourne
. DOI: https://doi.org/10.18653/v1/P18-1142
Ponti
,
Edoardo Maria
,
Ivan
Vulić
,
Ryan
Cotterell
,
Roi
Reichart
, and
Anna
Korhonen
.
2019b
.
Towards zero-shot language modeling
. In
Proceedings of EMNLP-IJCNLP
, pages
2893
2903
.
Hong Kong
. DOI: https://doi.org/10.18653/v1/D19-1288
Ponti
,
Edoardo Maria
,
Ivan
Vulić
,
Goran
Glavaš
,
Nikola
Mrkšić
, and
Anna
Korhonen
.
2018b
.
Adversarial propagation and zero-shot cross-lingual transfer of word vector specialization
. In
Proceedings of EMNLP
, pages
282
293
.
Brussels
. DOI: https://doi.org/10.18653/v1/D18-1026
Ponti
,
Edoardo Maria
,
Ivan
Vulić
,
Goran
Glavaš
,
Roi
Reichart
, and
Anna
Korhonen
.
2019c
.
Cross-lingual semantic specialization via lexical relation induction
. In
Proceedings of EMNLP
, pages
2206
2217
.
Hong Kong
. DOI: https://doi.org/10.18653/v1/D19-1226
Quinn
,
Naomi
and
Dorothy
Holland
.
1987
.
Culture and cognition
.
Cultural Models in Language and Thought
, pages
3
40
. DOI: https://doi.org/10.1017/CBO9780511607660.002, PMID: 3552892
,
Miloš
,
Alexandros
Nanopoulos
, and
Mirjana
Ivanović
.
2010
.
Hubs in space: Popular nearest neighbors in high-dimensional data
.
Journal of Machine Learning Research
,
11
:
2487
2531
.
Rasooli
,
and
Michael
Collins
.
2017
.
Cross-lingual syntactic transfer with limited resources
.
Transactions of the ACL
,
5
:
279
293
. DOI: https://doi.org/10.1162/tacl_a_00061
Ren
,
Liliang
,
Kaige
Xie
,
Lu
Chen
, and
Kai
Yu
.
2018
.
Towards universal dialogue state tracking
. In
Proceedings of EMNLP
, pages
2780
2786
.
Brussels
. DOI: https://doi.org/10.18653/v1/D18-1299
Roemmele
,
Melissa
,
Bejan
, and
Andrew S.
Gordon
.
2011
.
Choice of plausible alternatives: An evaluation of commonsense causal reasoning
. In
Proceedings of the 2011 AAAI Spring Symposium Series
.
Palo Alto, CA
.
Rotman
,
Guy
and
Roi
Reichart
.
2019
.
Deep contextualized self-training for low resource dependency parsing
.
Transactions of the ACL
,
7
:
695
713
. DOI: https://doi.org/10.1162/tacl_a_00294
Ruder
,
Sebastian
,
Anders
Søgaard
, and
Ivan
Vulić
.
2019
.
Unsupervised cross-lingual representation learning
. In
Proceedings of ACL: Tutorial Abstracts
, pages
31
38
.
Florence
. DOI: https://doi.org/10.18653/v1/P19-4007
Ruder
,
Sebastian
,
Ivan
Vulić
, and
Anders
Søgaard
.
2019
.
A survey of cross-lingual embedding models
.
Journal of Artificial Intelligence Research
,
65
:
569
631
. DOI: https://doi.org/10.1613/jair.1.11640
Rzymski
,
Christoph
,
Tiago
Tresoldi
,
Simon J.
Greenhill
,
Mei-Shin
Wu
,
Nathanael E.
Schweikhard
,
Maria
Koptjevskaja-Tamm
,
Volker
Gast
,
Timotheus A.
Bodt
,
Abbie
Hantgan
,
Gereon A.
Kaiping
, et al.
2020
.
The database of cross-linguistic colexifications, reproducible analysis of cross-linguistic polysemies
.
Scientific Data
,
7
(
1
):
1
12
. DOI: https://doi.org/10.1038/s41597-019-0341-x, PMID: 31932593, PMCID: PMC6957499
Sahlgren
,
Magnus
.
2006
.
The Word-Space Model: Using Distributional Analysis to Represent Syntagmatic and Paradigmatic Relations Between Words in High-Dimensional Vector Spaces
. Ph.D. thesis,
Stockholm University
.
Sakaizawa
,
Yuya
and
Mamoru
Komachi
.
2018
.
Construction of a Japanese word similarity dataset
. In
Proceedings of LREC
, pages
948
951
.
Miyazaki
.
Schlechtweg
,
Dominik
,
Anna
Hätty
,
Marco Del
Tredici
, and
Sabine Schulte im
Walde
.
2019
.
A wind of change: Detecting and evaluating lexical semantic change across times and domains
. In
Proceedings of ACL
, pages
732
746
.
Florence
. DOI: https://doi.org/10.18653/v1/P19-1072
Schuster
,
Mike
and
Kaisuke
Nakajima
.
2012
.
Japanese and Korean voice search
. In
International Conference on Acoustics, Speech and Signal Processing
, pages
5149
5152
.
Kyoto
. DOI: https://doi.org/10.1109/ICASSP.2012.6289079
Schwartz
,
Roy
,
Roi
Reichart
, and
Ari
Rappoport
.
2015
.
Symmetric pattern based word embeddings for improved word similarity prediction
. In
Proceedings of CoNLL
, pages
258
267
.
Beijing
. DOI: https://doi.org/10.18653/v1/K15-1026
Schwartz
,
Roy
,
Roi
Reichart
, and
Ari
Rappoport
.
2016
.
Symmetric patterns and coordinations: Fast and enhanced representations of verbs and adjectives
. In
Proceedings of NAACL-HLT
, pages
499
505
.
San Diego, CA
. DOI: https://doi.org/10.18653/v1/N16-1060, PMCID: PMC4832855
Smith
,
Samuel L.
,
David H. P.
Turban
,
Steven
Hamblin
, and
Nils Y.
Hammerla
.
2017
.
Offline bilingual word vectors, orthogonal transformations and the inverted softmax
. In
Proceedings of ICLR (Conference Track)
.
Toulon
.
Snyder
,
Benjamin
and
Regina
Barzilay
.
2010
.
Climbing the tower of Babel: Unsupervised multilingual learning
. In
Proceedings of ICML
, pages
29
36
.
Haifa
.
Søgaard
,
Anders
,
Sebastian
Ruder
, and
Ivan
Vulić
.
2018
.
On the limitations of unsupervised bilingual dictionary induction
. In
Proceedings of ACL
, pages
778
788
.
Melbourne
. DOI: https://doi.org/10.18653/v1/P18-1072
Suzuki
,
Ikumi
,
Kazuo
Hara
,
Masashi
Shimbo
,
Marco
Saerens
, and
Kenji
Fukumizu
.
2013
.
Centering similarity measures to reduce hubs
. In
Proceedings of EMNLP
, pages
613
623
.
Seattle, WA
.
Tang
,
Shuai
,
Mahta
Mousavi
, and
Virginia R.
de Sa
.
2019
.
An empirical study on post-processing methods for word embeddings
.
CoRR
,
abs/1905.10971
.
Trier
,
Jost
.
1931
.
Der Deutsche Wortschatz im Sinnbezirk des Verstandes: Die Geschichte eines sprachlichen Feldes. 1. Von den Anfängen bis zum Beginn des 13. Jahrhunderts
. Ph.D. thesis,
University of Bonn
.
Tsvetkov
,
Yulia
,
Sunayana
Sitaram
,
Manaal
Faruqui
,
Guillaume
Lample
,
Patrick
Littell
,
David
Mortensen
,
Alan W.
Black
,
Lori
Levin
, and
Chris
Dyer
.
2016
.
Polyglot neural language models: A case study in cross-lingual phonetic representation learning
. In
Proceedings of NAACL-HLT
, pages
1357
1366
.
San Diego, CA
. DOI: https://doi.org/10.18653/v1/N16-1161
Turney
,
Peter D
.
2012
.
Domain and function: A dual-space model of semantic relations and compositions
.
Journal of Artificial Intelligence Research
,
44
:
533
585
. DOI: https://doi.org/10.1613/jair.3640
Turney
,
Peter D.
and
Patrick
Pantel
.
2010
.
From frequency to meaning: vector space models of semantics
.
Journal of Artifical Intelligence Research
,
37
(
1
):
141
188
. DOI: https://doi.org/10.1613/jair.2934
,
Shyam
,
Manaal
Faruqui
,
Chris
Dyer
, and
Dan
Roth
.
2016
.
Cross-lingual models of word embeddings: An empirical comparison
. In
Proceedings of ACL
, pages
1661
1670
.
Berlin
. DOI: https://doi.org/10.18653/v1/P16-1157
van den
Berg
,
Robert
A.
,
Huub C. J.
Hoefsloot
,
Johan A.
Westerhuis
,
Age K.
Smilde
, and
Mariët J. van der
Werf
.
2006
.
Centering, scaling, and transformations: Improving the biological information content of metabolomics data
.
BMC Genomics
,
7
(
1
):
142
.
Miyazaki
. DOI: https://doi.org/10.1186/1471-2164-7-142, PMID: 16762068, PMCID: PMC1534033
Vania
,
Clara
and
Lopez
.
2017
.
From characters to words to in between: Do we capture morphology?
In
Proceedings of ACL
, pages
2016
2027
.
Vancouver
. DOI: https://doi.org/10.18653/v1/P17-1184
Vaswani
,
Ashish
,
Noam
Shazeer
,
Niki
Parmar
,
Jakob
Uszkoreit
,
Llion
Jones
,
Aidan N.
Gomez
,
Lukasz
Kaiser
, and
Illia
Polosukhin
.
2017
.
Attention is all you need
. In
Proceedings of NeurIPS
, pages
6000
6010
.
Long Beach, CA
.
Vejdemo
,
Susanne
.
2018
.
Lexical change often begins and ends in semantic peripheries: Evidence from color linguistics
.
Pragmatics & Cognition
,
25
(
1
):
50
85
. DOI: https://doi.org/10.1075/pc.00005.vej
Venekoski
,
Viljami
and
Jouko
Vankka
.
2017
.
Finnish resources for evaluating language model semantics
. In
Proceedings of NODALIDA
, pages
231
236
.
Gothenburg
.
Virtanen
,
Antti
,
Jenna
Kanerva
,
Rami
Ilo
,
Jouni
Luoma
,
Juhani
Luotolahti
,
Tapio
Salakoski
,
Filip
Ginter
, and
Sampo
Pyysalo
.
2019
.
Multilingual is not enough: BERT for Finnish
.
CoRR
,
abs/1912.07076
.
Vulić
,
Ivan
,
Daniela
Gerz
,
Douwe
Kiela
,
Felix
Hill
, and
Anna
Korhonen
.
2017a
.
HyperLex: A large-scale evaluation of graded lexical entailment
.
Computational Linguistics
,
43
(
4
):
781
835
. DOI: https://doi.org/10.1162/COLI_a_00301
Vulić
,
Ivan
,
Goran
Glavaš
,
Roi
Reichart
, and
Anna
Korhonen
.
2019
.
Do we really need fully unsupervised cross-lingual embeddings?
In
Proceedings of EMNLP
, pages
4407
4418
.
Hong Kong
. DOI: https://doi.org/10.18653/v1/D19-1449
Vulić
,
Ivan
,
Douwe
Kiela
, and
Anna
Korhonen
.
2017
.
Evaluation by association: A systematic study of quantitative word association evaluation
. In
Proceedings of EACL
, pages
163
175
.
Valencia
. DOI: https://doi.org/10.18653/v1/E17-1016
Vulić
,
Ivan
and
Anna
Korhonen
.
2016
.
On the role of seed lexicons in learning bilingual word embeddings
. In
Proceedings of ACL
, pages
247
257
.
Berlin
. DOI: https://doi.org/10.18653/v1/P16-1024
Vulić
,
Ivan
,
Edoardo Maria
Ponti
,
Robert
Litschko
,
Goran
Glavaš
, and
Anna
Korhonen
.
2020
.
Probing pretrained language models for lexical semantics
. In
Proceedings of EMNLP
, pages
7222
7240
.
Hong Kong
.
Online
.
Vulić
,
Ivan
,
Simone Paolo
Ponzetto
, and
Goran
Glavaš
.
2019
.
Multilingual and cross-lingual graded lexical entailment
. In
Proceedings of ACL
, pages
4963
4974
.
Florence
. DOI: https://doi.org/10.18653/v1/P19-1490
Vulić
,
Ivan
,
Roy
Schwartz
,
Ari
Rappoport
,
Roi
Reichart
, and
Anna
Korhonen
.
2017b
.
Automatic selection of context configurations for improved class-specific word representations
. In
Proceedings of CoNLL
, pages
112
122
.
Vancouver
. DOI: https://doi.org/10.18653/v1/K17-1013
Wang
,
Zihan
,
Karthikeyan
K
,
Stephen
Mayhew
, and
Dan
Roth
.
2020
.
Cross-lingual ability of multilingual BERT: An empirical study
. In
Proceedings of ICLR
.
Online
.
Wieting
,
John
,
Mohit
Bansal
,
Kevin
Gimpel
, and
Karen
Livescu
.
2015
.
From paraphrase database to compositional paraphrase model and back
.
Transactions of the ACL
,
3
:
345
358
. DOI: https://doi.org/10.1162/tacl_a_00143
Williams
,
,
Nikita
Nangia
, and
Samuel
Bowman
.
2018
.
A broad-coverage challenge corpus for sentence understanding through inference
. In
Proceedings of NAACL-HLT
, pages
1112
1122
.
New Orleans, LA
. DOI: https://doi.org/10.18653/v1/N18-1101
Wolf
,
Thomas
,
Lysandre
Debut
,
Victor
Sanh
,
Julien
Chaumond
,
Clement
Delangue
,
Anthony
Moi
,
Pierric
Cistac
,
Tim
Rault
,
R’emi
Louf
,
Morgan
Funtowicz
, and
Jamie
Brew
.
2019
.
HuggingFace’s Transformers: State-of-the-art natural language processing
.
ArXiv
,
abs/1910.03771
.
Wu
,
Shijie
,
Alexis
Conneau
,
Haoran
Li
,
Luke
Zettlemoyer
, and
Veselin
Stoyanov
.
2019
.
Emerging cross-lingual structure in pretrained language models
.
CoRR
,
abs/1911.01464
.
Wu
,
Shijie
and
Mark
Dredze
.
2019
.
Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT
. In
Proceedings of EMNLP
, pages
833
844
.
Hong Kong
. DOI: https://doi.org/10.18653/v1/D19-1077
Wu
,
Shijie
and
Mark
Dredze
.
2020
.
Are all languages created equal in multilingual BERT?
In
Proceedings of the 5th Workshop on Representation Learning for NLP
, pages
120
130
.
Online
. DOI: https://doi.org/10.18653/v1/2020.repl4nlp-1.16
Wu
,
Zhibiao
and
Martha
Palmer
.
1994
.
Verb semantics and lexical selection
. In
Proceedings of ACL
, pages
133
138
.
Las Cruces, NM
. DOI: https://doi.org/10.3115/981732.981751
Xing
,
Chao
,
Dong
Wang
,
Chao
Liu
, and
Yiye
Lin
.
2015
.
Normalized word embedding and orthogonal transform for bilingual word translation
. In
Proceedings of NAACL-HLT
, pages
1006
1011
.
Denver, CO
. DOI: https://doi.org/10.3115/v1/N15-1104
Yang
,
Yinfei
,
Yuan
Zhang
,
Chris
Tar
, and
Jason
Baldridge
.
2019
.
PAWS-X: A cross-lingual adversarial dataset for paraphrase identification
. In
Proceedings of EMNLP
, pages
3687
3692
.
Hong Kong
. DOI: https://doi.org/10.18653/v1/D19-1382
Zeman
,
Daniel
,
Jan
Hajič
,
Martin
Popel
,
Martin
Potthast
,
Milan
Straka
,
Filip
Ginter
,
Joakim
Nivre
, and
Slav
Petrov
.
2018
.
CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies
. In
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
, pages
1
21
.
Brussels
.
Zhang
,
Mozhi
,
Keyulu
Xu
,
Ken-ichi
Kawarabayashi
,
Stefanie
Jegelka
, and
Jordan
Boyd-Graber
.
2019
.
Are girls neko or shōjo? Cross-lingual alignment of non-isomorphic embeddings with iterative normalization
. In
Proceedings of ACL
, pages
3180
3189
.
Minneapolis, MN
. DOI: https://doi.org/10.18653/v1/P19-1307
Zhang
,
Yuan
,
Jason
Baldridge
, and
Luheng
He
.
2019
.
PAWS: Paraphrase adversaries from word scrambling
. In
Proceedings of NAACL-HLT
, pages
1298
1308
.
Hong Kong
.
Zhu
,
Yi
,
Benjamin
Heinzerling
,
Ivan
Vulić
,
Michael
Strube
,
Roi
Reichart
, and
Anna
Korhonen
.
2019
.
On the importance of subword information for morphological tasks in truly low-resource languages
. In
Proceedings of CoNLL
, pages
216
226
.
Hong Kong
. DOI: https://doi.org/10.18653/v1/K19-1021
Zhu
,
Yi
,
Ivan
Vulić
, and
Anna
Korhonen
.
2019
.
A systematic study of leveraging subword information for learning word representations
. In
Proceedings of NAACL-HLT
, pages
912
932
.
Brussels
. DOI: https://doi.org/10.18653/v1/N19-1097

## Author notes

All data are available at https://multisimlex.com/

Equal contribution.

This is an open-access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits you to copy and redistribute in any medium or format, for non-commercial use only, provided that the original work is not remixed, transformed, or built upon, and that appropriate credit to the original source is given. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.