Pre-trained language models (LMs) encode rich information about linguistic structure but their knowledge about lexical polysemy remains unclear. We propose a novel experimental setup for analyzing this knowledge in LMs specifically trained for different languages (English, French, Spanish, and Greek) and in multilingual BERT. We perform our analysis on datasets carefully designed to reflect different sense distributions, and control for parameters that are highly correlated with polysemy such as frequency and grammatical category. We demonstrate that BERT-derived representations reflect words’ polysemy level and their partitionability into senses. Polysemy-related information is more clearly present in English BERT embeddings, but models in other languages also manage to establish relevant distinctions between words at different polysemy levels. Our results contribute to a better understanding of the knowledge encoded in contextualized representations and open up new avenues for multilingual lexical semantics research.

Pre-trained contextual language models have advanced the state of the art in numerous natural language understanding tasks (Devlin et al., 2019; Peters et al., 2018). Their success has motivated a large number of studies exploring what these models actually learn about language (Voita et al., 2019a; Clark et al., 2019; Voita et al., 2019b; Tenney et al., 2019). The bulk of this interpretation work relies on probing tasks that serve to predict linguistic properties from the representations generated by the models (Linzen, 2018; Rogers et al., 2020). The focus was initially puton linguistic aspects pertaining to grammar and syntax (Linzen et al., 2016; Hewitt and Manning, 2019; Hewitt and Liang, 2019). The first probing tasks addressing semantic knowledge explored phenomena in the syntax-semantics interface, such as semantic role labeling and coreference (Tenney et al., 2019; Kovaleva et al., 2019), and the symbolic reasoning potential of LM representations (Talmor et al., 2020).

Lexical meaning was largely overlooked in early interpretation work, but is now attracting increasing attention. Pre-trained LMs have been shown to successfully leverage sense annotated data for disambiguation (Wiedemann et al., 2019; Reif et al., 2019). The interplay between word type and token-level information in the hidden representations of LSTM LMs has also been explored (Aina et al., 2019), as well as the similarity estimates that can be drawn from contextualized representations without directly addressing word meaning (Ethayarajh, 2019). In recent work, Vulić et al. (2020) probe BERT representations for lexical semantics, addressing out-of-context word similarity. Whether these models encode knowledge about lexical polysemy and sense distinctions is, however, still an open question. Our work aims to fill this gap.

We propose methodology for exploring the knowledge about word senses in contextualized representations. Our approach follows a rigorous experimental protocol proper to lexical semantic analysis, which involves the use of datasets carefully designed to reflect different sense distributions. This allows us to investigate the knowledge models acquire during training, and the influence of context variation on token representations. We account for the strong correlation between word frequency and number of senses (Zipf, 1945), and for the relationship between grammatical category and polysemy, by balancing the frequency and part of speech (PoS) distributions in our datasets and applying a frequency-based model to polysemy prediction.

Importantly, our investigation encompasses monolingual models in different languages (English, French, Spanish, and Greek) and multilingual BERT (mBERT). We demonstrate that BERT contextualized representations encode an impressive amount of knowledge about polysemy, and are able to distinguish monosemous (mono) from polysemous (poly) words in a variety of settings and configurations (cf. Figure 1). Importantly, we demonstrate that representations derived from contextual LMs encode knowledge about words’ polysemy acquired through pre-training which is combined with information from new contexts of use (Sections 36). Additionally, we show that BERT representations can serve to determine how easy it is to partition a word’s semantic space into senses (Section 7).

Figure 1:

BERT distinguishes monosemous (mono) from polysemous (poly) words in all layers. Representations for a poly word are obtained from sentences reflecting up to ten different senses (poly-bal), the same sense (poly-same), or natural occurrence in a corpus (poly-rand).

Figure 1:

BERT distinguishes monosemous (mono) from polysemous (poly) words in all layers. Representations for a poly word are obtained from sentences reflecting up to ten different senses (poly-bal), the same sense (poly-same), or natural occurrence in a corpus (poly-rand).

Close modal

Our methodology can serve for the analysis of words and datasets from different topics, domains and languages. Knowledge about words’ polysemy and sense partitionability has numerous practical implications: It can guide decisions towards a sense clustering or a per-instance approach in applications (Reisinger and Mooney, 2010; Neelakantan et al., 2014; Camacho-Collados and Pilehvar, 2018); point to words with stable semantics which can be safe cues for disambiguation in running text (Leacock et al., 1998; Agirre and Martinez, 2004; Loureiro and Camacho-Collados, 2020); determine the needs in terms of context size for disambiguation (e.g., in queries, chatbots); help lexicographers define the number of entries for a word to be present in a resource, and plan the time and effort needed in semantic annotation tasks (McCarthy et al., 2016). It could also guide cross-lingual transfer, serving to identify polysemous words for which transfer might be harder. Finally, analyzing words’ semantic space can be highly useful for the study of lexical semantic change (Rosenfeld and Erk, 2018; Dubossarsky et al., 2019; Giulianelli et al., 2020; Schlechtweg et al., 2020). We make our code and datasets available to enable comparison across studies and to promote further research in these directions.1

The knowledge pre-trained contextual LMs encode about lexical semantics has only recently started being explored. Works by Reif et al. (2019) and Wiedemann et al. (2019) propose experiments using representations built from Wikipedia and the SemCor corpus (Miller et al., 1993), and show that BERT can organize word usages in the semantic space in a way that reflects the meaning distinctions present in the data. It is also shown that BERT can perform well in the word sense disambiguation (WSD) task by leveraging the sense-related information available in these resources. These works address the disambiguation capabilities of the model but do not show what BERT actually knows about words’ polysemy, which is the main axis of our work. In our experiments, sense annotations are not used to guide the models into establishing sense distinctions, but rather for creating controlled conditions that allow us to analyze BERT’s inherent knowledge of lexical polysemy.

Our proposed experimental setup is aimed at investigating the polysemy information encoded in the representations built at different layers of deep pre-trained LMs. Our approach basically relies on the similarity of contextualized representations, which amounts to word usage similarity (Usim) estimation, a classical task in lexical semantics (Erk et al., 2009; Huang et al., 2012; Erk et al., 2013). The Usim task precisely involves predicting the similarity of word instances in context without use of sense annotations. BERT has been shown to be particularly good at this task (Garí Soler et al., 2019; Pilehvar and Camacho-Collados, 2019). Our experiments allow us to explore and understand what this ability is due to.

### 3.1 Dataset Creation

We build our English dataset using SemCor 3.0 (Miller et al., 1993), a corpus manually annotated with WordNet senses (Fellbaum, 1998). It is important to note that we do not use the annotations for training or evaluating any of the models. These only serve to control the composition of the sentence pools that are used for generating contextualized representations, and to analyze the results. We form sentence pools for monosemous (mono) and polysemous (poly) words that occur at least ten times in SemCor.3 For each mono word, we randomly sample ten of its instances in the corpus. For each poly word, we form three sentence pools of size ten reflecting different sense distributions:

• Balanced (poly-bal). We sample a sentence for each sense of the word in SemCor until a pool of ten sentences is formed.

• Random (poly-rand). We randomly sample ten poly word instances from SemCor. We expect this pool to be highly biased towards a specific sense due to the skewed frequency distribution of word senses (Kilgarriff, 2004; McCarthy et al., 2004). This configuration is closer to the expected natural occurrence of senses in a corpus, it thus serves to estimate the behaviour of the models in a real-world setting.

• Same sense (poly-same). We sample ten sentences illustrating only one sense of the poly word. Although the composition of this pool is similar to that of the mono pool (i.e. all instances describe the same sense) we call it poly-same because it describes one sense of a polysemous word.4 Specifically, we want to explore whether BERT representations derived from these instances can serve to distinguish mono from poly words.

The controlled composition of the poly sentence pools allows us to investigate the behavior of the models when they are exposed to instances of polysemous words describing the same or different senses. There are 1,765 poly words in SemCor with at least 10 sentences available.5 We randomly subsample 418 from these in order to balance the mono and poly classes. Our English dataset is composed of 836 mono and poly words, and their instances in 8,195 unique sentences. Table 1 shows a sample of the sentences in different pools. For French, Spanish, and Greek, we retrieve sentences from the Eurosense corpus (Delli Bovi et al., 2017) which contains texts from Europarl automatically annotated with BabelNet word senses (Navigli and Ponzetto, 2012).6 We extract sentences from the high-precision version7 of Eurosense, and create sentence pools in the same way as in English, balancing the number of monosemous and polysemous words (418). We determine the number of senses for a word as the number of its Babelnet senses that are mapped to a WordNet sense.8

Table 1:

Example sentences for the monosemous noun hotel and the polysemous noun room.

SettingWordSenseSentences
mono hotel.n inn The walk ended, inevitably, right in front of his hotel building.
inn Maybe he’s at the hotel

poly-same room.n chamber The room vibrated as if a giant hand had rocked it.
chamber (…) Tell her to come to Adam’s room (…)

poly-bal room.n chamber (…) he left the room, walked down the hall (…)
space It gives them room to play and plenty of fresh air.
opportunity Even here there is room for some variation, for metal surfaces vary (…)
SettingWordSenseSentences
mono hotel.n inn The walk ended, inevitably, right in front of his hotel building.
inn Maybe he’s at the hotel

poly-same room.n chamber The room vibrated as if a giant hand had rocked it.
chamber (…) Tell her to come to Adam’s room (…)

poly-bal room.n chamber (…) he left the room, walked down the hall (…)
space It gives them room to play and plenty of fresh air.
opportunity Even here there is room for some variation, for metal surfaces vary (…)

### 3.2 Contextualized Word Representations

We experiment with representations generated by three English models: BERT (Devlin et al., 2019),9 ELMo (Peters et al., 2018), and context2vec (Melamud et al., 2016). BERT is a Transformer architecture (Vaswani et al., 2017) that is jointly trained for a masked LM and a next sentence prediction task. Masked LM inolves a Cloze-style task, where the model needs to guess randomly masked words by jointly conditioning on their left and right context. We use the bert-base- uncased and bert-base-cased models, pre-trained on the BooksCorpus (Zhu et al., 2015) and English Wikipedia. ELMo is a bi-LSTM LM trained on Wikipedia and news crawl data from WMT 2008-2012. We use 1024-d representations from the 5.5B model.10 Context2vec is a neural model based on word2vec’s CBOW architecture (Mikolov et al., 2013) which learns embeddings of wide sentential contexts using a bi-LSTM. The model produces representations for words and their context. We use the context representations from a 600-d context2vec model pre-trained on the ukWaC corpus (Baroni et al., 2009).11

For French, Spanish, and Greek, we use BERT models specifically trained for each language: Flaubert (flaubert_base_uncased) (Le et al., 2020), BETO (bert-base-spanish-wwm-uncased) (Caete et al., 2020), and Greek BERT (bert-base-greek-uncased- v1) (Koutsikakis et al., 2020). We also use the bert- base-multilingual-cased model for each of the four languages. mBERT was trained on Wikipedia data of 104 languages.12 All BERT models generate 768-d representations.

### 3.3 The Self-Similarity Metric

All models produce representations that describe word meaning in specific contexts of use. For each instance i of a target word w in a sentence, we extract its representation from: (i) each of the 12 layers of a BERT model;13 (ii) each of the three ELMo layers; and (iii) context2vec. We calculate self-similarity (SelfSim) (Ethayarajh, 2019) for w in a sentence pool p and a layer l, by taking the average of the pairwise cosine similarities of the representations of its instances in l:
$SelfSiml(w)=1|I|2−|I|∑i∈I∑j∈Ij≠icos(xwli,xwlj)$
(1)
In formula 1, |I| is the number of instances for w (ten in our experiments); xwli and xwlj are the representations for instances i and j of w in layer l. The SelfSim score is in the range [−1, 1]. We report the average SelfSim for all w’s in a pool p. We expect it to be higher for monosemous words and words with low polysemy than for highly polysemous words. We also expect the poly- same pool to have a higher average SelfSim score than the other poly pools which contain instances of different senses.

Contextualization has a strong impact on SelfSim since it introduces variation in the token-level representations, making them more dissimilar. The SelfSim value for a word would be 1 with non-contextualized (or static) embeddings, as all its instances would be assigned the same vector. In contextual models, SelfSim is lower in layers where the impact of the context is stronger (Ethayarajh, 2019). It is, however, important to note that contextualization in BERT models is not monotonic, as shown by previous studies of the models’ internal workings (Voita et al., 2019a; Ethayarajh, 2019). Our experiments presented in the next section provide additional evidence in this respect.

### 3.4 Results and Discussion

#### 3.4.1 Mono-Poly in English

Figure 2 shows the average SelfSim value obtained for each sentence pool with representations produced by BERT models. The thin lines in the first plot illustrate the average SelfSim score calculated for mono and poly words using representations from each layer of the uncased English BERT model. We observe a clear distinction of words according to their polysemy: SelfSim is higher for mono than for poly words across all layers and sentence pools. BERT establishes a clear distinction even between the mono and poly-same pools, which contain instances of only one sense. This distinction is important; it suggests that BERT encodes information about a word’s monosemous or polysemous nature regardless of the sentences that are used to derive the contextualized representations. Specifically, BERT produces less similar representations for word instances in the poly-same pool compared to mono, reflecting that poly words can have different meanings.

Figure 2:

Average SelfSim obtained with monolingual BERT models (top row) and mBERT (bottom row) across all layers (horizontal axis). In the first plot, thick lines correspond to the cased model.

Figure 2:

Average SelfSim obtained with monolingual BERT models (top row) and mBERT (bottom row) across all layers (horizontal axis). In the first plot, thick lines correspond to the cased model.

Close modal

We also observe a clear ordering of the three poly sentence pools: Average SelfSim is higher in poly-same, which only contains instances of one sense, followed by mid-range values in poly-rand, and gets its lowest values in the balanced setting (poly-bal). This is noteworthy given that poly-rand contains a mix of senses but with a stronger representation of w’s most frequent sense than poly-bal (71% vs. 47%).14

Our results demonstrate that BERT representations encode two types of lexical semantic knowledge: information about the polysemous nature of words acquired through pre-training (as reflected in the distinction between mono and poly-same), and information from the particular instances of a word used to create the contextualized representations (as shown by the finer-grained distinctions between different poly settings). BERT’s knowledge about polysemy can be due to differences in the types of context where words of different polysemy levels occur. We expect poly words to be seen in more varied contexts than mono words, reflecting their different senses. BERT encodes this variation with the LM objective through exposure to large amounts of data, and this is reflected in the representations. The same ordering pattern is observed with mBERT (lower part of Figure 2) and with ELMo (Figure 3(a)). With context2vec, average SelfSim in mono is 0.40, 0.38 in poly-same, 0.37 in poly-rand, and 0.35 in poly-bal. This suggests that these models also have some inherent knowledge about lexical polysemy, but differences are less clearly marked than in BERT.

Figure 3:

Comparison of average SelfSim obtained for mono and poly words using ELMo representations (a), and for words in different polysemy bands in the poly-rand sentence pool (b).

Figure 3:

Comparison of average SelfSim obtained for mono and poly words using ELMo representations (a), and for words in different polysemy bands in the poly-rand sentence pool (b).

Close modal

Using the cased model leads to an overall increase in SelfSim and to smaller differences between bands, as shown by the thick lines in the first plot of Figure 2. Our explanation for the lower distinction ability of the bert-base-cased model is that it encodes sparser information about words than the uncased model. It was trained on a more diverse set of strings, so many WPs are present in both their capitalized and non- capitalized form in the vocabulary. In spite of that, it has a smaller vocabulary size (29K WPs) than the uncased model (30.5K). Also, a higher number of WPs correspond to word parts than in the uncased model (6,478 vs 5,829).

We test the statistical significance of the mono/poly-rand distinction using unpaired two-samples t-tests when the normality assumption is met (as determined with Shapiro Wilk’s tests). Otherwise, we run a Mann Whitney U test, the non-parametrical alternative of this t-test. In order to lower the probability of type I errors (false positives) that increases when performing multiple tests, we correct p-values using the Benjamini–Hochberg False Discovery Rate (FDR) adjustment (Benjamini and Hochberg, 1995). Our results show that differences are significant across all embedding types and layers (α = 0.01).

The decreasing trend in SelfSim observed for BERT in Figure 2, and the peak in layer 11, confirm the phases of context encoding and token reconstruction observed by Voita et al. (2019a).15 In earlier layers, context variation makes representations more dissimilar and SelfSim decreases. In the last layers, information about the input token is recovered for LM prediction and similarity scores are boosted. Our results show clear distinctions across all BERT and ELMo layers. This suggests that lexical information is spread throughout the layers of the models, and contributes new evidence to the discussion on the localization of semantic information inside the models (Rogers et al., 2020; Vulić et al., 2020).

#### 3.4.2 Mono-Poly in Other Languages

The top row of Figure 2 shows the average SelfSim obtained for French, Spanish, and Greek words using monolingual models. Flaubert, BETO, and Greek BERT representations clearly distinguish mono and poly words, but average SelfSim values for different poly pools are much closer than in English. BETO seems to capture these fine-grained distinctions slightly better than the French and Greek models. The second row of the figure shows results obtained with mBERT representations. We observe the highly similar average SelfSim values assigned to different poly pools, which show that distinction is harder than in monolingual models.

Statistical tests show that the difference between SelfSim values in mono and poly-rand is significant in all layers of BETO, Flaubert, Greek BERT, and mBERT for Spanish and French.16 The magnitude of the difference in Greek BERT is, however, smaller compared to the other models (0.03 vs. 0.09 in BETO at the layers with the biggest difference in average SelfSim).

### 4.1 SelfSim-based Ranking

In this set of experiments, we explore the impact of words’ degree of polysemy on the representations. We control for this factor by grouping words into three polysemy bands as in McCarthy et al. (2016), which correspond to a specific number of senses (k): low: 2 ≤ k ≤ 3, mid: 4 ≤ k ≤ 6, high: k > 6. For English, the three bands are populated with a different number of words: low: 551, mid: 663, high: 551. In the other languages, we form bands containing 300 words each.17 In Figure 4, we compare mono words with words in each polysemy band in terms of their average SelfSim. Values for mono words are taken from Section 3. For poly words, we use representations from the poly-rand sentence pool, which better approximates natural occurrence in a corpus. For comparison, we report in Figure 5 results obtained in English using sentences from the poly-same and poly-bal pools.18

Figure 4:

Average SelfSim obtained with monolingual BERT models (top row) and mBERT (bottom row) for mono and poly words in different polysemy bands. Representations are derived from sentences in the poly-rand pool.

Figure 4:

Average SelfSim obtained with monolingual BERT models (top row) and mBERT (bottom row) for mono and poly words in different polysemy bands. Representations are derived from sentences in the poly-rand pool.

Close modal
Figure 5:

Comparison of BERT average SelfSim for mono and poly words in different polysemy bands in the poly-bal and poly-same sentence pools.

Figure 5:

Comparison of BERT average SelfSim for mono and poly words in different polysemy bands in the poly-bal and poly-same sentence pools.

Close modal

In English, the pattern is clear in all plots: SelfSim is higher for mono than for poly words in any band, confirming that BERT is able to distinguish mono from poly words at different polysemy levels. The range of SelfSim values for a band is inversely proportional to its k: Words in low get higher values than words in high. The results denote that the meaning of highly polysemous words is more variable (lower SelfSim) than the meaning of words with fewer senses. As expected, scores are higher and inter-band similarities are closer in poly-same (cf. Figure 5(b)) compared with poly-rand and poly-bal, where distinctions are clearer. The observed differences confirm that BERT can predict the polysemy level of words, even from instances describing the same sense.

We observe similar patterns with ELMo (cf. Figure 3(b)) and context2vec representations in poly-rand,19 but smaller absolute inter- band differences. In poly-same, both models fail to correctly order the bands. Overall, our results highlight that BERT encodes higher quality knowledge about polysemy. We test the significance of the inter-band differences detected in poly-rand using the same approach as in Section 3.4.1. These are significant in all but a few20 layers of the models.

The bands are also correctly ranked in the other three languages but with smaller inter-band differences than in English, especially in Greek where clear distinctions are only made in a few middle layers. This variation across languages can be explained to some extent by the quality of the automatic EuroSense annotations, which has a direct impact on the quality of the sentence pools. Results of a manual evaluation conducted by Delli Bovi et al. (2017) showed that WSD precision is ten points higher in English (81.5) and Spanish (82.5) than in French (71.8). The Greek portion, however, has not been evaluated.

Plots in the second row of Figure 4 show results obtained using mBERT. Similarly to the previous experiment (Section 3.4), mBERT overall makes less clear distinctions than the monolingual models. The low and mid bands often get similar SelfSim values, which are close to mono in French and Greek. Still, inter- band differences are significant in most layers of mBERT and the monolingual French, Spanish, and Greek models.21

### 4.2 Anisotropy Analysis

In order to better understand the reasons behind the smaller inter-band differences observed with mBERT, we conduct an additional analysis of the models’ anisotropy. We create 2,183 random word pairs from the English mono, low, mid and high bands, and 1,318 pairs in each of the other languages.22 We calculate the cosine similarity between two random instances of the words in each pair and take the average over all pairs (RandSim). The plots in the left column of Figure 6 show the results. We observe a clear difference in the scores obtained by monolingual models (solid lines) and mBERT (dashed lines). Clearly, mBERT assigns higher similarities to random words, an indication that its semantic space is more anisotropic than the one built by monolingual models. High anisotropy means that representations occupy a narrow cone in the vector space, which results in lower quality similarity estimates and in the model’s limited potential to establish clear semantic distinctions.

Figure 6:

The left plots show the similarity between random words in models for each language. Plots on the right side show the difference between the similarity of random words and SelfSim in poly-rand.

Figure 6:

The left plots show the similarity between random words in models for each language. Plots on the right side show the difference between the similarity of random words and SelfSim in poly-rand.

Close modal
We also compare RandSim to the average SelfSim obtained for poly words in the poly- rand sentence pool (cf. Section 3.1). In a quality semantic space, we would expect SelfSim (between same word instances) to be much higher than RandSim. The right column of Figure 6 shows the difference between these two scores. diff in a layer l is calculated as in Equation (2):
$diffl=AvgSelfSiml(poly-rand)−RandSiml$
(2)
We observe that the difference is smaller in the space built by mBERT, which is more anisotropic than the space built by monolingual models. This is particularly obvious in the upper layers of the model. This result confirms the lower quality of mBERT’s semantic space compared to monolingual models.

Finally, we believe that another factor behind the worse mBERT performance is that the multilingual WP vocabulary is mostly English- driven, resulting in arbitrary partitionings of words in the other languages. This word splitting procedure must have an impact on the quality of the lexical information in mBERT representations.

Given the strong correlation between word frequency and number of senses (Zipf, 1945), we explore the impact of frequency on BERT representations. Our goal is to determine the extent to which it influences the good mono/poly detection results obtained in Sections 3.4 and 4.1.

### 5.1 Dataset Composition

We perform this analysis in English using frequency information from Google Ngrams (Brants and Franz, 2006). For French, Spanish, and Greek, we use frequency counts gathered from the OSCAR corpus (Suárez et al., 2019). We split the words into four ranges (F) corresponding to the quartiles of frequencies in each dataset. Each range f in F contains the same number of words. We provide detailed information about the composition of the English dataset in Figure 7.23Figure 7(a) shows that mono words are much less frequent than poly words. Figure 7(b) shows the distribution of different PoS categories in each band. Nouns are the prevalent category in all bands and verbs are less present among mono words (10.8%), as expected. Finally, adverbs are hardly represented in the high polysemy band (1.2% of all words).

Figure 7:

Composition of the English word bands in terms of frequency (a) and grammatical category (b).

Figure 7:

Composition of the English word bands in terms of frequency (a) and grammatical category (b).

Close modal

### 5.2 Self-Sim by Frequency Range and PoS

We examine the average BERT SelfSim per frequency range in poly-rand. Due to space constraints, we only report detailed results for the English BERT model in Figure 8 (plot (a)). The clear ordering by range suggests that BERT can successfully distinguish words by their frequency, especially in the last layers. Plot (b) in Figure 8 shows the average SelfSim for words of each PoS category. Verbs have the lowest SelfSim which is not surprising given that they are highly polysemous (as shown in Figure 7(b)). We observe the same trend for monolingual models in the other three languages.

Figure 8:

Average SelfSim for English words of different frequencies and part of speech categories with BERT representations.

Figure 8:

Average SelfSim for English words of different frequencies and part of speech categories with BERT representations.

Close modal

### 5.3 Controlling for Frequency and PoS

We conduct an additional experiment where we control for the composition of the poly bands in terms of grammatical category and word frequency. We call these two settings pos-bal and freq-bal. We define npos, the smallest number of words of a specific PoS that can be found in a band. We form the pos-bal bands by subsampling from each band the same number of words (npos) of that PoS. For example, all pos-bal bands have nn nouns and nv verbs. We follow a similar procedure to balance the bands by frequency in the freq-bal setting. In this case, nf is the minimum number of words of a specific frequency range f that can be found in a band. We form the freq-bal dataset by subsampling from each band the same number of words (nf) of a given range f in F.

Table 2 shows the distribution of words per PoS and frequency range in the pos-bal and freq- bal settings for each language. The table reads as follows: The English pos-bal bands contain 198 nouns, 45 verbs, 64 adjectives, and 7 adverbs; similarly for the other two languages. Greek is not included in this pos-based analysis because all sense-annotated Greek words in EuroSense are nouns. In freq-bal, each English band contains 40 words that occur less than 7.1M times in Google Ngrams, and so on and so forth.

Table 2:

Content of the polysemy bands in the pos-bal and freq-bal settings. All bands for a language contain the same number of words of a specific grammatical category or frequency range. M stands for a million and m for a thousand occurrences of a word in a corpus.

pos-bal
en 198 45 64
fr 171 32 29
es 167 22 40

freq-bal
en 7.1M 20M 49M 682M
40 99 62 39
fr 23m 70m 210m 41M
17 43 67 38
es 64m 233m 793m 59M
12 39 58 48
el 14m 40m 111m 1.9M
13 41 70 42
pos-bal
en 198 45 64
fr 171 32 29
es 167 22 40

freq-bal
en 7.1M 20M 49M 682M
40 99 62 39
fr 23m 70m 210m 41M
17 43 67 38
es 64m 233m 793m 59M
12 39 58 48
el 14m 40m 111m 1.9M
13 41 70 42

We examine the average SelfSim values obtained for words in each band in poly-rand. Figure 9 shows the results for monolingual models. We observe that the mono and poly words in the pos-bal and freq-bal bands are ranked similarly to Figure 4. This shows that BERT’s polysemy predictions do not rely on frequency or part of speech. The only exception is Greek BERT, which cannot establish correct inter-band distinctions when the influence of frequency is neutralized in the freq-bal setting. A general observation that applies to all models is that although inter-band distinctions become less clear, the ordering of the bands is preserved. We observe the same trend with ELMo and context2vec.

Figure 9:

Average SelfSim inside the poly bands balanced for frequency (freq-bal) and part of speech (pos-bal). SelfSim is calculated using representations generated by monolingual BERT models from sentences in each language-specific pool. We do not balance the Greek dataset for PoS because it only contains nouns.

Figure 9:

Average SelfSim inside the poly bands balanced for frequency (freq-bal) and part of speech (pos-bal). SelfSim is calculated using representations generated by monolingual BERT models from sentences in each language-specific pool. We do not balance the Greek dataset for PoS because it only contains nouns.

Close modal

Statistical tests show that all inter-band distinctions established by English BERT are still significant in most layers of the model.24 This is not the case for ELMo and context2vec, which can distinguish between mono and poly words but fail to establish significant distinctions between polysemy bands in the balanced settings. For French and Spanish, the statistical analysis shows that all distinctions in pos-bal are significant in at least one layer of the models. The same applies to the mono$→$poly distinction in freq-bal but finer-grained distinctions disappear.25

Our finding that word instance similarity differs across polysemy bands suggests that this feature can be useful for classification. In this section, we probe the representations for polysemy using a classification experiment where we test their ability to guess whether a word is polysemous, and which poly band it falls in. We use the poly-rand sentence pools and a standard train/dev/test split (70%/15%/15%) of the data. For the mono/poly distinction (i.e., the data used in Section 3), this results in 584/126/126 words per set in each language. To guarantee a fair evaluation, we make sure there is no overlap between the lemmas in the three sets. We use two types of features: (i) the average SelfSim for a word; and (ii) all pairwise cosine similarities collected for its instances, which results in 45 features per word (pairCos). We train a binary logistic regression classifier for each type of representation and feature.

As explained in Section 4, the three poly bands (low, mid, and high) and mono contain a different number of words. For classification into polysemy bands, we balance each class by randomly subsampling words from each band. In total, we use 1,168 words for training, 252 for development, and 252 for testing (70%/15%/15%) in English. In the other languages, we use a split of 840/180/180 words. We train multi-class logistic regression classifiers with the two types of features, SelfSim and pairCos. We compare the results of the classifiers to a baseline that predicts always the same class, and to a frequency- based classifier which only uses the words’ log frequency in Google Ngrams, or in the OSCAR corpus, as a feature.

Table 3 presents classification accuracy on the test set. We report results obtained with the best layer for each representation type and feature as determined on the development sets. In English, best accuracy is obtained by BERT in both the binary (0.79) and multiclass settings (0.49), followed by mBERT (0.77 and 0.46). Despite its simplicity, the frequency-based classifier obtains better results than context2vec and ELMo, and performs on par with mBERT in the binary setting. This shows that frequency information is highly relevant for the mono-poly distinction. All classifiers outperform the same class baseline. These results are very encouraging, showing that BERT embeddings can be used to determine whether a word has multiple meanings, and provide a rough indication of its polysemy level. Results in the other three languages are not as high as those obtained in English, but most models give higher results than the frequency-based classifier.26

Table 3:

Accuracy of binary (mono/poly) and multi-class (poly bands) classifiers using SelfSim and pairCos features on the test sets. Comparison to a baseline that predicts always the same class and a classifier that only uses log frequency as feature. Subscripts denote the layers used.

mono/polypoly bands
ModelSelfSimpairCosSelfSimpairCos
en BERT 0.7610 0.798 0.4910 0.4610
mBERT 0.778 0.758 0.4612 0.4312
ELMo 0.692 0.633 0.372 0.343
context2vec 0.61 0.61 0.34 0.31

Frequency 0.77 0.41

FR Flaubert 0.587 0.556 0.298 0.279
mBERT 0.669 0.649 0.387 0.388

Frequency 0.61 0.37

ES BETO 0.709 0.667 0.426 0.485
mBERT 0.6911 0.647 0.389 0.437

Frequency 0.67 0.41

el GreekBERT 0.704 0.644 0.344 0.386
mBERT 0.607 0.657 0.3211 0.349

Frequency 0.63 0.35

Baseline 0.50 0.25
mono/polypoly bands
ModelSelfSimpairCosSelfSimpairCos
en BERT 0.7610 0.798 0.4910 0.4610
mBERT 0.778 0.758 0.4612 0.4312
ELMo 0.692 0.633 0.372 0.343
context2vec 0.61 0.61 0.34 0.31

Frequency 0.77 0.41

FR Flaubert 0.587 0.556 0.298 0.279
mBERT 0.669 0.649 0.387 0.388

Frequency 0.61 0.37

ES BETO 0.709 0.667 0.426 0.485
mBERT 0.6911 0.647 0.389 0.437

Frequency 0.67 0.41

el GreekBERT 0.704 0.644 0.344 0.386
mBERT 0.607 0.657 0.3211 0.349

Frequency 0.63 0.35

Baseline 0.50 0.25

We have shown that representations from pre- trained LMs encode rich information about words’ degree of polysemy. They can successfully distinguish mono from poly lemmas, and predict the polysemy level of words. Our previous experiments involved a set of controlled settings representing different sense distributions and polysemy levels. In this section, we explore whether these representations can also point to the clusterability of poly words in an uncontrolled setting.

Instances of some poly words are easier to group into interpretable clusters than others. This is, for example, a simple task for the ambiguous noun rock which can express two clearly separate senses (stone and music), but harder for book, which might refer to the content or object senses of the word (e.g., I read a book vs. I bought a book). In what follows, we test the ability of contextualized representations to estimate how easy this task is for a specific word, that is, its partitionability into senses.

Following McCarthy et al. (2016), we use the clusterability metrics proposed by Ackerman and Ben-David (2009) to measure the ease of clustering word instances into senses. McCarthy et al. base their clustering on the similarity of manual meaning-preserving annotations (lexical substitutes and translations). Instances of different senses, such as: Put granola bars in a bowl vs. That’s not a very high bar, present no overlap in their in-context substitutes: {snack, biscuit, block, slab} vs. {pole, marker, hurdle, barrier, level, obstruction}. Semantically related instances, on the contrary, share a different number of substitutes depending on their proximity. The need for manual annotations, however, constrains the method’s applicability to specific datasets.

We propose to extend and scale up the McCarthy et al. (2016) clusterability approach using contextualized representations, in order to make it applicable to a larger vocabulary. These experiments are carried out in English due to the lack of evaluation data in other languages.

### 7.2 Data

We run our experiments on the usage similarity (Usim) dataset (Erk et al., 2013) for comparison with previous work. Usim contains ten instances for 56 target words of different PoS from the SemEval Lexical Substitution dataset (McCarthy and Navigli, 2007). Word instances are manually annotated with pairwise similarity scores on a scale from 1 (completely different) to 5 (same meaning).

We represent target word instances in Usim in two ways: using contextualized representations generated by BERT, context2vec, and ELMo (BERT-rep, c2v-rep, ELMo-rep);27 using substitute-based representations with automatically generated substitutes. The substitute-based approach allows for a direct comparison with the method of McCarthy et al. (2016). They represent each instance i of a word w in Usim as a vector $i→$, where each substitute s assigned to w over all its instances (iI) becomes a dimension (ds). For a given i, the value for each ds is the number of annotators who proposed substitute s. ds contains a zero entry if s was not proposed for i. We refer to this type of representation as Gold-sub. We generate our substitute-based representations with BERT using the simple “word similarity” approach in Zhou et al. (2019). For an instance i of word w in context C, we rank a set of candidate substitutes S = {s1,s2,…,sn} based on the cosine similarity of the BERT representations for i and for each substitute sjS in the same context C. We use representations from the last layer of the model. As candidate substitutes, we use the unigram paraphrases of w in the Paraphrase Database (PPDB) XXL package (Ganitkevitch et al., 2013; Pavlick et al., 2015).28

For each instance i of w, we obtain a ranking R of all substitutes in S. We remove low-quality substitutes (i.e., noisy paraphrases or substitutes referring to a different sense of w) using the filtering approach proposed by Garí Soler et al. (2019). We check each pair of substitutes in subsequent positions in R, starting from the top; if a pair is unrelated in PPDB, all substitutes from that position onwards are discarded. The idea is that good quality substitutes should be both high- ranked and semantically related. We build vectors as in McCarthy et al. (2016), using the cosine similarity assigned by BERT to each substitute as a value. We call this representation BERT-sub.

### 7.3 Sense Clustering

The clusterability metrics that we use are metrics initially proposed for estimating the quality of the optimal clustering that can be obtained from a dataset; the better the quality of this clustering, the higher the clusterability of the dataset it is derived from (Ackerman and Ben-David, 2009).

In order to estimate the clusterability of a word w, we thus need to first cluster its instances in the data. We use the k-means algorithm which requires the number of senses for a lemma. This is, of course, different for every lemma in our dataset. We define the optimal number of clusters k for a lemma in a data-driven manner using the Silhouette coefficient (sil) (Rousseeuw, 1987), without recourse to external resources.29 For a data point i, sil compares the intra-cluster distance (i.e., the average distance from i to every other data point in the same cluster) with the average distance of i to all points in its nearest cluster. The sil value for a clustering is obtained by averaging sil for all data points, and it ranges from −1 to 1. We cluster each type of representation for w using k-means with a range of k values (2 ≤ k ≤ 10), and retain the k of the clustering with the highest mean sil. Additionally, since BERT representations’ cosine similarity correlates well with usage similarity (Garí Soler et al., 2019), we experiment with Agglomerative Clustering with average linkage directly on the cosine distance matrix obtained with BERT representations (BERT-Agg). For comparison, we also use Agglomerative Clustering on the gold usage similarity scores from Usim, transformed into distances (Gold-Agg).

### 7.4 Clusterability Metrics

We use in our experiments the two best performing metrics from McCarthy et al. (2016): Variance Ratio (vr) (Zhang, 2001) and Separability (sep) (Ostrovsky et al., 2012). vr calculates the ratio of the within- and between-cluster variance for a given clustering solution. sep measures the difference in loss between two clusterings with k − 1 and k clusters and its range is [0,1). We use k-means’ sum of squared distances of data points to their closest cluster center as the loss. Details about these two metrics are given in Appendix 10.30 We also experiment with sil as a clusterability metric, as it can assess cluster validity. For vr and sil, a higher value indicates higher clusterability. The inverse applies to sep.

We calculate Spearman’s ρ correlation between the results of each clusterability metric and two gold standard measures derived from Usim: Uiaa and Umid. Uiaa is the inter-annotator agreement for a lemma in terms of average pairwise Spearman’s correlation between annotators’ judgments. Higher Uiaa values indicate higher clusterability, meaning that sense partitions are clearer and easier to agree upon. Umid is the proportion of mid-range judgments (between 2 and 4) assigned by annotators to all instances of a target word. It indicates how often usages donot have identical (5) or completely different (1) meaning. Therefore, higher Umid values indicate lower clusterability.

### 7.5 Results and Discussion

The clusterability results are given in Table 4. Agglomerative Clustering on the gold Usim similarity scores (Gold-Agg) gives best results on the Uiaa evaluation in combination with the sil clusterability metric (ρ = 0.80). This is unsurprising, since Uiaa and Umid are derived from the same Usim scores. From our automatically generated representations, the strongest correlation with Uiaa (0.69) is obtained with BERT-Agg and the sil clusterability metric. The sil metric also works well with BERT-rep achieving the strongest correlation with Umid (−0.46). It constitutes, thus, a good alternative to the sep and vr metrics used in previous studies when combined with BERT representations.

Table 4:

Spearman’s ρ correlation between automatic metrics and gold standard clusterability estimates. Significant correlations (where the null hypothesis ρ = 0 is rejected with α < 0.05) are marked with *. The arrows indicate the expected direction of correlation for each metric. Subscripts indicate the layer that achieved best performance. The two strongest correlations obtained with each gold standard measure are in boldface.

GoldMetricBERT-REPc2v-REPELMo-REPBERT-SUBGold-SUBBERT-AggGold-AGG
Uiaa SEP ↘ −0.48*10 −0.12 −0.242 −0.03 −0.20 −0.48*11 –
vr ↗ 0.1712 0.14 0.192 0.09 0.34* 0.33*12 –
sil ↗ 0.61*11 0.06 0.212 0.10 0.32* 0.69*10 0.80

Umid sep ↗ 0.43*9 −0.01 0.083 0.05 0.16 0.43*9 –
vr ↘ −0.249 −0.08 −0.153 −0.15 −0.24 −0.32*5 –
sil ↘ 0.46*10 0.05 −0.062 −0.11 −0.38* −0.44*8 0.48
GoldMetricBERT-REPc2v-REPELMo-REPBERT-SUBGold-SUBBERT-AggGold-AGG
Uiaa SEP ↘ −0.48*10 −0.12 −0.242 −0.03 −0.20 −0.48*11 –
vr ↗ 0.1712 0.14 0.192 0.09 0.34* 0.33*12 –
sil ↗ 0.61*11 0.06 0.212 0.10 0.32* 0.69*10 0.80

Umid sep ↗ 0.43*9 −0.01 0.083 0.05 0.16 0.43*9 –
vr ↘ −0.249 −0.08 −0.153 −0.15 −0.24 −0.32*5 –
sil ↘ 0.46*10 0.05 −0.062 −0.11 −0.38* −0.44*8 0.48

Interestingly, the correlations obtained using raw BERT contextualized representations are much higher than the ones observed with McCarthy et al. (2016)’s representations that rely on manual substitutes (Gold-sub). These were in the range of 0.20–0.34 for Uiaa and 0.16–0.38 for Umid (in absolute value). The results demonstrate that BERT representations offer good estimates of the partitionability of words into senses, improving over substitute annotations. As expected, the substitution-based approach performs better with clean manual substitutes (Gold-sub) than with automatically generated ones (BERT-sub).

We present a per layer analysis of the correlations obtained with the best performing BERT representations (BERT-Agg) and the sil metric in Figure 10. We report the absolute values of the correlation coefficient for a more straightforward comparison. For Uiaa, the higher layers of the model make the best predictions: Correlations increase monotonically up to layer 10, and then they show a slight decrease. Umid prediction shows a more irregular pattern: It peaks at layers 3 and 8, and decreases again in the last layers.

Figure 10:

Spearman’s ρ correlations between the gold standard Uiaa and Umid scores, and clusterability estimates obtained using Agglomerative Clustering on a cosine distance matrix of BERT representations.

Figure 10:

Spearman’s ρ correlations between the gold standard Uiaa and Umid scores, and clusterability estimates obtained using Agglomerative Clustering on a cosine distance matrix of BERT representations.

Close modal

We have shown that contextualized BERT representations encode rich information about lexical polysemy. Our experimental results suggest that this high quality knowledge about words, which allows BERT to detect polysemy in different configurations and across all layers, is acquired during pre-training. Our findings hold for the English BERT as well as for BERT models in other languages, as shown by our experiments on French, Spanish, and Greek, and to a lesser extent for multilingual BERT. Moreover, English BERT representations can be used to obtain a good estimation of a word’s partitionability into senses. These results open up new avenues for research in multilingual semantic analysis, and we can consider various theoretical and application-related extensions for this work.

The polysemy and sense-related knowledge revealed by the models can serve to develop novel methodologies for improved cross-lingual alignment of embedding spaces and cross-lingual transfer, pointing to more polysemous (or less clusterable) words for which transfer might be harder. Predicting the polysemy level of words can also be useful for determining the context needed for acquiring representations that properly reflect the meaning of word instances in running text. From a more theoretical standpoint, we expect this work to be useful for studies on the organization of the semantic space in different languages and on lexical semantic change.

This work has been supported by the French National Research Agency under project ANR-16- CE33-0013. The work is also part of the FoTran project, funded by the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 771113). We thank the anonymous reviewers and the TACL Action Editor for their careful reading of our paper, their thorough reviews, and their helpful suggestions.

Variance Ratio. First, the variance of a cluster y is calculated:
$σ2(Y)=1|y|∑iϵy(yi−y-)2$
(3)
where $y-$ denotes the centroid of cluster y. Then the within-cluster variance W and the between-cluster variance B of a clustering solution C are calculated in the following way:
$W(C)=∑j=1kpjσ2(xj)$
(4)
$B(C)=∑j=1kpj(x-j−x-)2$
(5)
where x is the set of all data points and $pj=|xj||x|$. xj are the data points in cluster j. Finally, the VR of a clustering C is obtained as the ratio between B(C) and W(C):
$VR=B(C)W(C)$
(6)
Separability (sep). In an optimal clustering Ck of the dataset x with k clusters, sep is defined as follows:
$SEP(x,k)=loss(Ck)loss(Ck−1)$
(7)
1

Our code and data are available at https://github.com/ainagari/monopoly.

2

This issue affects all tested models and is particularly present in the last layers of GPT-2, resulting in highly similar representations even for random words.

3

We find the number of senses for a word of a specific part of speech (PoS) in WordNet 3.0, which we access through the NLTK interface (Bird et al., 2009).

4

The polysemous words are the same as in poly-bal and poly-rand.

5

We use sentences of up to 100 words.

6

BabelNet is a multilingual semantic network built from multiple lexicographic and encyclopedic resources, such as WordNet and Wikipedia.

7

The high coverage version of Eurosense is larger than the high-precision one, but disambiguation is less accurate.

8

This filtering serves to exclude BabelNet senses that correspond to named entities and are not useful for our purposes (such as movie or album titles), and to run these experiments under similar conditions across languages.

9

We use Huggingface transformers (Wolf et al., 2020).

12

The mBERT model developers recommend using the cased version of the model rather than the uncased one, especially for languages with non-Latin alphabets, because it fixes normalization issues. More details about this model can be found here: https://github.com/google-research/bert/blob/master/multilingual.md.

13

We also tried different combinations of the last four layers, but this did not improve the results. When a word is split into multiple wordpieces (WPs), we obtain its representation by averaging the WPs.

14

Numbers are macro-averages for words in the pools.

15

They study the information flow in the Transformer estimating the MI between representations at different layers.

16

In mBERT for Greek, the difference is significant in ten layers.

17

We only used 418 of these poly words in Section 3 in order to have balanced mono and poly pools.

18

We omit the plots for poly-bal and poly-same for the other models due to space constraints.

19

Average SelfSim values for context2vec in the poly-rand setting: low: 0.37, mid: 0.36, high: 0.36.

20

low$→$mid in ELMo’s third layer, and mid$→$high in context2vec and in BERT’s first layer.

21

With the exception of mono$→$low in mBERT for Greek, and low$→$mid in Flaubert and in mBERT for French.

22

1,318 is the total number of words across bands in French, Spanish, and Greek.

23

The composition of each band is the same as in Sections 3 and 4.

24

Note that the sample size in this analysis is smaller compared to that used in Sections 3.4 and 4.1.

25

With a few exceptions: mono$→$low and mid$→$high are significant in all BETO layers.

26

Only exceptions are Greek mBERT in the multi-class setting, and Flaubert in both settings.

27

We do not use the first layer of ELMo in this experiment. It is character-based, so most representations of a lemma are identical and we cannot obtain meaningful clusters.

28

We use PPDB (http://www.paraphrase.org) to reduce variability in our substitute sets, compared to the ones that would be proposed by looking at the whole vocabulary.

29

We do not use McCarthy et al.’s graph-based approach because it is not compatible with all our representation types.

30

Note that the vr and sep metrics are not compatible with Gold-Agg which relies on Usim similarity scores, because we need vectors for their calculation. For BERT-Agg, we calculate vr and sep using BERT embeddings.

Margareta
Ackerman
and
Shai
Ben-David
.
2009
.
Clusterability: A Theoretical Study
.
Journal of Machine Learning Research
,
5
:
1
8
.
Yossi
,
Einat
Kermany
,
Yonatan
,
Ofer
Lavi
, and
Yoav
Goldberg
.
2017
.
Fine- grained analysis of sentence embeddings using auxiliary prediction tasks
. In
Proceedings of ICLR
.
Toulon, France
.
Eneko
Agirre
and
David
Martinez
.
2004
.
Unsupervised WSD based on automatically retrieved examples: The importance of bias
. In
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing
, pages
25
32
,
Barcelona, Spain
.
Association for Computational Linguistics
.
Laura
Aina
,
Kristina
Gulordava
, and
Gemma
Boleda
.
2019
.
Putting words in context: LSTM language models and lexical ambiguity
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
3342
3348
,
Florence, Italy
.
Association for Computational Linguistics
.
Marco
Baroni
,
Silvia
Bernardini
,
Ferraresi
, and
Eros
Zanchetta
.
2009
.
The WaCky Wide Web: A collection of very large linguistically processed web-crawled corpora
.
Journal of Language Resources and Evaluation
,
43
(
3
):
209
226
.
Yoav
Benjamini
and
Yosef
Hochberg
.
1995
.
Controlling the false discovery rate: A practical and powerful approach to multiple testing
.
Journal of the Royal Statistical Society: Series B (Methodological)
,
57
(
1
):
289
300
.
Steven
Bird
,
Ewan
Klein
, and
Edward
Loper
.
2009
.
Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit
.
O’Reilly Media, Inc.
,
Beijing
.
Thorsten
Brants
and
Alex
Franz
.
2006
.
Web 1T 5- gram Version 1
. In
LDC2006T13
.
.
Linguistic Data Consortium
.
José
and
Pilehvar
.
2018
.
From word to sense embeddings: A survey on vector representations of meaning
.
Journal of Artificial Intelligence Research
,
63
:
743
788
.
Jos
Caete
,
Gabriel
Chaperon
,
Rodrigo
Fuentes
, and
Jorge
Prez
.
2020
.
Spanish pre-trained BERT model and evaluation data
. In
Proceedings of the ICLR 2020 Workshop on Practical ML for Developing Countries (PML4DC)
.
.
Kevin
Clark
,
Urvashi
Khandelwal
,
Omer
Levy
, and
Christopher D.
Manning
.
2019
.
What does BERT look at? An analysis of BERT’s attention
. In
Proceedings of the ACL 2019 Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP
, pages
276
286
,
Florence, Italy
.
Association for Computational Linguistics
.
Alexis
Conneau
,
German
Kruszewski
,
Guillaume
Lample
,
Loïc
Barrault
, and
Marco
Baroni
.
2018
.
What you can cram into a single \$&!#* vector: Probing sentence embeddings for linguistic properties
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
2126
2136
,
Melbourne, Australia
.
Association for Computational Linguistics
.
Claudio Delli
Bovi
,
Jose
,
Alessandro
Raganato
, and
Roberto
Navigli
.
2017
.
EuroSense: Automatic harvesting of multilingual sense annotations from parallel text
. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
, pages
594
600
,
.
Association for Computational Linguistics
.
Jacob
Devlin
,
Ming-Wei
Chang
,
Kenton
Lee
, and
Kristina
Toutanova
.
2019
.
BERT: Pre-training of deep bidirectional transformers for language understanding
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
4171
4186
,
Minneapolis, Minnesota
.
Association for Computational Linguistics
.
Haim
Dubossarsky
,
Simon
Hengchen
,
Nina
Tahmasebi
, and
Dominik
Schlechtweg
.
2019
.
Time-Out: Temporal referencing for robust modeling of lexical semantic change
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
457
470
,
Florence, Italy
.
Association for Computational Linguistics
.
Katrin
Erk
,
Diana
McCarthy
, and
Nicholas
Gaylord
.
2009
.
Investigations on word senses and word usages
. In
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP
, pages
10
18
,
Suntec, Singapore
.
Association for Computational Linguistics
.
Katrin
Erk
,
Diana
McCarthy
, and
Nicholas
Gaylord
.
2013
.
Measuring word meaning in context
.
Computational Linguistics
,
39
(
3
):
511
554
.
Kawin
Ethayarajh
.
2019
.
How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
55
65
,
Hong Kong, China
.
Association for Computational Linguistics
.
Christiane
Fellbaum
.
1998
.
WordNet: An Electronic Lexical Database
.
Language, Speech, and Communication
.
Cambridge, MA
.
MIT Press
.
Juri
Ganitkevitch
,
Benjamin Van
Durme
, and
Chris
Callison-Burch
.
2013
.
PPDB: The para phrase database
. In
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
758
764
,
Atlanta, Georgia
.
Association for Computational Linguistics
.
Aina Garí
Soler
,
Marianna
Apidianaki
, and
Alexandre
Allauzen
.
2019
.
Word usage similarity estimation with sentence representations and automatic substitutes
. In
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)
, pages
9
21
,
Minneapolis, Minnesota
.
Association for Computational Linguistics
.
Mario
Giulianelli
,
Marco Del
Tredici
, and
Raquel
Fernández
.
2020
.
Analysing lexical semantic change with contextualized word representations
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
3960
3973
,
Online
.
Association for Computational Linguistics
.
John
Hewitt
and
Percy
Liang
.
2019
.
Designing and interpreting probes with control tasks
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
2733
2743
,
Hong Kong, China
.
Association for Computational Linguistics
.
John
Hewitt
and
Christopher D.
Manning
.
2019
.
A structural probe for finding syntax in word representations
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
4129
4138
,
Minneapolis, Minnesota
.
Association for Computational Linguistics
.
Eric
Huang
,
Richard
Socher
,
Christopher
Manning
, and
Andrew
Ng
.
2012
.
Improving word representations via global context and multiple word prototypes
. In
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
873
882
,
Jeju Island, Korea
.
Association for Computational Linguistics
.
Alexander
Jakubowski
,
Milica
Gasic
, and
Marcus
Zibrowius
.
2020
.
Topology of word embeddings: Singularities reflect polysemy
. In
Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics
, pages
103
113
,
Barcelona, Spain
(Online)
.
Association for Computational Linguistics
.
Kilgarriff
.
2004
.
How dominant is the commonest sense of a word?
In
Lecture Notes in Computer Science (vol. 3206), Text, Speech and Dialogue
,
Sojka
Petr
,
Kopeček
Ivan
,
Pala
Karel
(eds.), pages
103
112
,
Springer
.
Berlin, Heidelberg
.
John
Koutsikakis
,
Ilias
Chalkidis
,
Prodromos
Malakasiotis
, and
Ion
Androutsopoulos
.
2020
.
GREEK-BERT: The Greeks visiting Sesame Street
. In
Proceedings of the 11th Hellenic Conference on Artificial Intelligence (SETN 2020)
, pages
110
117
,
Athens, Greece
.
Olga
Kovaleva
,
Alexey
Romanov
,
Anna
Rogers
, and
Anna
Rumshisky
.
2019
.
Revealing the dark secrets of BERT
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
4365
4374
,
Hong Kong, China
.
Association for Computational Linguistics
.
Hang
Le
,
Loïc
Vial
,
Jibril
Frej
,
Vincent
Segonne
,
Maximin
Coavoux
,
Benjamin
Lecouteux
,
Alexandre
Allauzen
,
Benoît
Crabbé
,
Laurent
Besacier
, and
Didier
Schwab
.
2020
.
FlauBERT: Unsupervised language model Pre-training for French
. In
Proceedings of The 12th Language Resources and Evaluation Conference
, pages
2479
2490
,
Marseille, France
.
European Language Resources Association
.
Claudia
Leacock
,
Martin
Chodorow
, and
George A.
Miller
.
1998
.
Using corpus statistics and WordNet relations for sense identification
.
Computational Linguistics
,
24
(
1
):
147
165
.
Tal
Linzen
.
2018
.
What can linguistics and deep learning contribute to each other?
arXiv preprint:1809.04179v2
.
Tal
Linzen
,
Emmanuel
Dupoux
, and
Yoav
Goldberg
.
2016
.
Assessing the ability of LSTMs to learn syntax-sensitive dependencies
.
Transactions of the Association for Computational Linguistics
,
4
:
521
535
.
Daniel
Loureiro
and
Jose
.
2020
.
Don’t neglect the obvious: On the role of unambiguous words in word sense disambiguation
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
3514
3520
,
Online
.
Association for Computational Linguistics
.
Diana
McCarthy
,
Marianna
Apidianaki
, and
Katrin
Erk
.
2016
.
Word sense clustering and clusterability
.
Computational Linguistics
,
42
(
2
):
245
275
.
Diana
McCarthy
,
Rob
Koeling
,
Julie
Weeds
, and
John
Carroll
.
2004
.
Finding predominant word senses in untagged text
. In
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)
, pages
279
286
.
Barcelona, Spain
.
Diana
McCarthy
and
Roberto
Navigli
.
2007
.
. In
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)
, pages
48
53
,
Prague, Czech Republic
.
Association for Computational Linguistics
.
Oren
Melamud
,
Jacob
Goldberger
, and
Ido
Dagan
.
2016
.
context2vec: Learning generic context embedding with bidirectional LSTM
. In
Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning
, pages
51
61
,
Berlin, Germany
.
Association for Computational Linguistics
.
Tomas
Mikolov
,
Kai
Chen
,
Greg
, and
Jeffrey
Dean
.
2013
.
Efficient estimation of word representations in vector space
.
arXiv preprint:1301.3781v3
.
George A.
Miller
,
Claudia
Leacock
,
Randee
Tengi
, and
Ross T.
Bunker
.
1993
.
A semantic concordance
. In
Human Language Technology: Proceedings of a Workshop Held at Plainsboro
.
New Jersey
.
Roberto
Navigli
and
Simone Paolo
Ponzetto
.
2012
.
BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network
.
Artificial Intelligence
,
193
:
217
250
.
Arvind
Neelakantan
,
Jeevan
Shankar
,
Alexandre
Passos
, and
Andrew
McCallum
.
2014
.
Efficient non-parametric estimation of multiple embeddings per word in vector space
. In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
1059
1069
,
Doha, Qatar
.
Association for Computational Linguistics
.
Rafail
Ostrovsky
,
Yuval
Rabani
,
Leonard J.
Schulman
, and
Chaitanya
Swamy
.
2012
.
The effectiveness of Lloyd-type methods for the k-means problem
.
Journal of the ACM (JACM)
,
59
(
6
):
28
.
Ellie
Pavlick
,
Pushpendre
Rastogi
,
Juri
Ganitkevitch
,
Benjamin Van
Durme
, and
Chris
Callison-Burch
.
2015
.
PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification
. In
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
, pages
425
430
,
Beijing, China
.
Association for Computational Linguistics
.
Matthew
Peters
,
Mark
Neumann
,
Mohit
Iyyer
,
Matt
Gardner
,
Christopher
Clark
,
Kenton
Lee
, and
Luke
Zettlemoyer
.
2018
.
Deep contextualized word representations
. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
, pages
2227
2237
,
New Orleans, Louisiana
.
Association for Computational Linguistics
.
Pilehvar
and
Jose Camacho-
.
2019
.
WiC: The Word-in-Context dataset for evaluating context-sensitive meaning representations
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
1267
1273
,
Minneapolis, Minnesota
.
Association for Computational Linguistics
.
Tiago
Pimentel
,
Rowan Hall
Maudslay
,
Damian
Blasi
, and
Ryan
Cotterell
.
2020
.
Speakers fill lexical semantic gaps with context
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
4004
4015
,
Online
.
Association for Computational Linguistics
.
Alec
,
Jeffrey
Wu
,
Rewon
Child
,
David
Luan
,
Dario
Amodei
, and
Ilya
Sutskever
.
2019
.
Language models are unsupervised multitask learners
.
OpenAI Blog
,
1
(
8
):
9
.
Emily
Reif
,
Ann
Yuan
,
Martin
Wattenberg
,
Fernanda B.
Viegas
,
Andy
Coenen
,
Pearce
, and
Been
Kim
.
2019
.
Visualizing and measuring the geometry of BERT
. In
Advances in Neural Information Processing Systems
, pages
8592
8600
,
.
Joseph
Reisinger
and
Raymond J.
Mooney
.
2010
.
Multi-prototype vector-space models of word meaning
. In
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
, pages
109
117
,
Los Angeles, California
.
Association for Computational Linguistics
.
Anna
Rogers
,
Olga
Kovaleva
, and
Anna
Rumshisky
.
2020
.
A primer in BERTology: What we know about how BERT works
.
Transactions of the Association for Computational Linguistics
,
8
:
842
866
.
Alex
Rosenfeld
and
Katrin
Erk
.
2018
.
Deep neural models of semantic shift
. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
, pages
474
484
,
New Orleans, Louisiana
.
Association for Computational Linguistics
.
Peter J.
Rousseeuw
.
1987
.
Silhouettes: A graphical aid to the interpretation and validation of cluster analysis
.
Journal of Computational and Applied Mathematics
,
20
:
53
65
.
Dominik
Schlechtweg
,
Barbara
McGillivray
,
Simon
Hengchen
,
Haim
Dubossarsky
, and
Nina
Tahmasebi
.
2020
.
SemEval-2020 Task 1: Unsupervised lexical semantic change detection
. In
Proceedings of the Fourteenth Workshop on Semantic Evaluation
, pages
1
23
,
Barcelona (online)
.
International Committee for Computational Linguistics
.
[PubMed]
Pedro Javier Ortiz
Suárez
,
Benoît
Sagot
, and
Laurent
Romary
.
2019
.
Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures
. In
Proceedings of the 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7)
.
Cardiff, UK
.
Leibniz-Institut für Deutsche Sprache
.
Alon
Talmor
,
Yanai
Elazar
,
Yoav
Goldberg
, and
Jonathan
Berant
.
2020
.
oLMpics—On what language model pre-training captures
.
Transactions of the Association for Computational Linguistics
,
8
:
743
758
.
Ian
Tenney
,
Dipanjan
Das
, and
Ellie
Pavlick
.
2019
.
BERT rediscovers the classical NLP pipeline
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
4593
4601
,
Florence, Italy
.
Association for Computational Linguistics
.
Ashish
Vaswani
,
Noam
Shazeer
,
Niki
Parmar
,
Jakob
Uszkoreit
,
Llion
Jones
,
Aidan N.
Gomez
,
Łukasz
Kaiser
, and
Illia
Polosukhin
.
2017
.
Attention is all you need
. In
Advances in Neural Information Processing Systems
, pages
5998
6008
.
Long Beach, California, USA
.
Elena
Voita
,
Rico
Sennrich
, and
Ivan
Titov
.
2019a
.
The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
4396
4406
,
Hong Kong, China
.
Association for Computational Linguistics
.
Elena
Voita
,
David
Talbot
,
Fedor
Moiseev
,
Rico
Sennrich
, and
Ivan
Titov
.
2019b
.
Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
5797
5808
,
Florence, Italy
.
Association for Computational Linguistics
.
Ivan
Vulić
,
Edoardo Maria
Ponti
,
Robert
Litschko
,
Goran
Glavaš
, and
Anna
Korhonen
.
2020
.
Probing pretrained language models for lexical semantics
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
7222
7240
,
Online
.
Association for Computational Linguistics
.
Gregor
Wiedemann
,
Steffen
Remus
,
Avi
Chawla
, and
Chris
Biemann
.
2019
.
Does BERT make any sense? Interpretable word sense disambiguation with contextualized embeddings
. In
Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers
, pages
161
170
,
Erlangen, Germany
.
German Society for Computational Linguistics & Language Technology
.
Thomas
Wolf
,
Lysandre
Debut
,
Victor
Sanh
,
Julien
Chaumond
,
Clement
Delangue
,
Anthony
Moi
,
Pierric
Cistac
,
Tim
Rault
,
Remi
Louf
,
Morgan
Funtowicz
,
Joe
Davison
,
Sam
Shleifer
,
Patrick von
Platen
,
Clara
Ma
,
Yacine
Jernite
,
Julien
Plu
,
Canwen
Xu
,
Teven Le
Scao
,
Sylvain
Gugger
,
Mariama
Drame
,
Quentin
Lhoest
, and
Alexander
Rush
.
2020
.
Transformers: State-of-the-art natural language processing
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
, pages
38
45
,
Online
.
Association for Computational Linguistics
.
Bin
Zhang
.
2001
.
Dependence of clustering algorithm performance on clustered-ness of data
.
HP Labs Technical Report HPL-2001-91
.
Wangchunshu
Zhou
,
Tao
Ge
,
Ke
Xu
,
Furu
Wei
, and
Ming
Zhou
.
2019
.
BERT-based lexical substitution
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
3368
3373
,
Florence, Italy
.
Association for Computational Linguistics
.
Yukun
Zhu
,
Ryan
Kiros
,
Rich
Zemel
,
Ruslan
Salakhutdinov
,
Raquel
Urtasun
,
Antonio
Torralba
, and
Sanja
Fidler
.
2015
.
Aligning books and movies: Towards story-like visual explanations by watching movies and reading books
. In
Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15)
, pages
1927
,
Santiago, Chile
.
IEEE Computer Society
.
George Kingsley
Zipf
.
1945
.
The meaning-frequency relationship of words
.
Journal of General Psychology
,
33
(
2
):
251256
.
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode