Abstract
Nominal compounds such as red wine and nut case display a continuum of compositionality, with varying contributions from the components of the compound to its semantics. This article proposes a framework for compound compositionality prediction using distributional semantic models, evaluating to what extent they capture idiomaticity compared to human judgments. For evaluation, we introduce data sets containing human judgments in three languages: English, French, and Portuguese. The results obtained reveal a high agreement between the models and human predictions, suggesting that they are able to incorporate information about idiomaticity. We also present an in-depth evaluation of various factors that can affect prediction, such as model and corpus parameters and compositionality operations. General crosslingual analyses reveal the impact of morphological variation and corpus size in the ability of the model to predict compositionality, and of a uniform combination of the components for best results.
1. Introduction
It is a universally acknowledged assumption that the meaning of phrases, expressions, or sentences can be determined by the meanings of their parts and by the rules used to combine them. Part of the appeal of this principle of compositionality1 is that it implies that a meaning can be assigned even to a new sentence involving an unseen combination of familiar words (Goldberg 2015). Indeed, for natural language processing (NLP), this is an attractive way of linearly deriving the meaning of larger units from their components, performing the semantic interpretation of any text.
For representing the meaning of individual words and their combinations in computational systems, distributional semantic models (DSMs) have been widely used. DSMs are based on Harris’ distributional hypothesis that the meaning of a word can be inferred from the context in which it occurs (Harris 1954; Firth 1957). In DSMs, words are usually represented as vectors that, to some extent, capture cooccurrence patterns in corpora (Lin 1998; Landauer, Foltz, and Laham 1998; Mikolov et al. 2013; Baroni, Dinu, and Kruszewski 2014). Evaluation of DSMs has focused on obtaining accurate semantic representations for words, and state-of-the-art models are already capable of obtaining a high level of agreement with human judgments for predicting synonymy or similarity between words (Freitag et al. 2005; Camacho-Collados, Pilehvar, and Navigli 2015; Lapesa and Evert 2017) and for modeling syntactic and semantic analogies between word pairs (Mikolov, Yih, and Zweig 2013). These representations for individual words can also be combined to create representations for larger units such as phrases, sentences, and even whole documents, using simple additive and multiplicative vector operations (Mitchell and Lapata 2010; Reddy, McCarthy, and Manandhar 2011; Mikolov et al. 2013; Salehi, Cook, and Baldwin 2015), syntax-based lexical functions (Socher et al. 2012), or matrix and tensor operations (Baroni and Lenci 2010; Bride, Van de Cruys, and Asher 2015). However, it is not clear to what extent this approach is adequate in the case of idiomatic multiword expressions (MWEs). MWEs fall into a wide spectrum of compositionality; that is, some MWEs are more compositional (e.g., olive oil) while others are more idiomatic (Sag et al. 2002; Baldwin and Kim 2010). In the latter case, the meaning of the MWE may not be straightforwardly related to the meanings of its parts, creating a challenge for the principle of compositionality (e.g., snake oil as a product of questionable benefit, not necessarily an oil and certainly not extracted from snakes).
In this article, we discuss approaches for automatically detecting to what extent the meaning of an MWE can be directly computed from the meanings of its component words, represented using DSMs. We evaluate how accurately DSMs can model the semantics of MWEs with various levels of compositionality compared to human judgments. Since MWEs encompass a large amount of related but distinct phenomena, we focus exclusively on a subcategory of MWEs: nominal compounds. They represent an ideal case study for this work, thanks to their relatively homogeneous syntax (as opposed to other categories of MWEs such as verbal idioms) and their pervasiveness in language. We assume that models able to predict the compositionality of nominal compounds could be generalized to other MWE categories by addressing their variability in future work. Furthermore, to determine to what extent these approaches are also adequate cross-lingually, we evaluate them in three languages: English, French, and Portuguese.
Given that MWEs are frequent in languages (Sag et al. 2002), identifying idiomaticity and producing accurate semantic representations for compositional and idiomatic cases is of relevance to NLP tasks and applications that involve some form of semantic processing, including semantic parsing (Hwang et al. 2010; Jagfeld and van der Plas 2015), word sense disambiguation (Finlayson and Kulkarni 2011; Schneider et al. 2016), and machine translation (Ren et al. 2009; Carpuat and Diab 2010; Cap et al. 2015; Salehi et al. 2015). Moreover, the evaluation of DSMs on tasks involving MWEs, such as compositionality prediction, has the potential to drive their development towards new directions.
The main hypothesis of our work is that, if the meaning of a compositional nominal compound can be derived from a combination of its parts, this translates in DSMs as similar vectors for a compositional nominal compound and for the combination of the vectors of its parts using some vector operation, that we refer to as composition function. Conversely we can use the lack of similarity between the nominal compound vector representation and a combination of its parts to detect idiomaticity. Furthermore, we hypothesize that accuracy in predicting compositionality depends both on the characteristics of the DSMs used to represent expressions and their components and on the composition function adopted. Therefore, we have built 684 DSMs and performed an extensive evaluation, involving over 9,072 analyses, investigating various types of DSMs, their configurations, the corpora used to train them, and the composition function used to build vectors for expressions.2
This article is structured as follows. Section 2 presents related work on distributional semantics, compositionality prediction, and nominal compounds. Section 3 presents the data sets created for our evaluation. Section 4 describes the compositionality prediction framework, along with the composition functions which we evaluate. Section 5 specifies the experimental setup (corpora, DSMs, parameters, and evaluation measures). Section 6 presents the overall results of the evaluated models. Sections 7 and 8 evaluate the impact of DSM and corpus parameters, and of composition functions on compositionality prediction. Section 9 discusses system predictions through an error analysis. Section 10 summarizes our conclusions. Appendix A contains a glossary, Appendix B presents extra sanity-check experiments, Appendix C contains the questionnaire used for data collection, and Appendices D, E, and F list the compounds in the data sets.
2. Related Work
The literature on distributional semantics is extensive (Lin 1998; Turney and Pantel 2010; Baroni and Lenci 2010; Mohammad and Hirst 2012), so we provide only a brief introduction here, underlining their most relevant characteristics to our framework (Section 2.1). Then, we define compositionality prediction and discuss existing approaches, focusing on distributional techniques for multiword expressions (Section 2.2). Our framework is evaluated on nominal compounds, and we discuss their relevant properties (Section 2.3) along with existing data sets for evaluating compositionality prediction (Section 2.4).
2.1 Distributional Semantic Models
Distributional semantic models (DSMs) use context information to represent the meaning of lexical units as vectors. These vectors are built assuming the distributional hypothesis, whose central idea is that the meaning of a word can be learned based on the contexts where it appears—or, as popularized by Firth (1957), “you shall know a word by the company it keeps.”
Formally, a DSM attempts to encode the meaning of each target word wi of a vocabulary V as a vector of real numbers v(wi) in ℝ|V|. Each component of v(wi) is a function of the co-occurrence between wi and the other words in the vocabulary (its contexts wc). This function can be simply a co-occurrence count c(wi, wc), or some measure of the association between wi and each wc, such as pointwise mutual information (PMI, Church and Hanks [1990]; Lin [1999]) or positive PMI (PPMI, Baroni, Dinu, and Kruszewski [2014]; Levy, Goldberg, and Dagan [2015]).
In DSMs, co-occurrence can be defined as two words co-occurring in the same document, sentence, or sentence fragment in a corpus. Intrasentential models are often based on a sliding window; that is, a context word wc co-occurs within a certain window of W words around the target wi. Alternatively, co-occurrence can also be based on syntactic relations obtained from parsed corpora, where a context word wc appears within specific syntactic relations with wi (Lin 1998; Padó and Lapata 2007; Lapesa and Evert 2017).
The set of all vectors v(wi), ∀wi ∈ V can be represented as a sparse co-occurrence matrix V × V → ℝ. Given that most word pairs in this matrix co-occur rarely (if ever), a threshold on the number of co-occurrences is often applied to discard irrelevant pairs. Additionally, co-occurrence vectors can be transformed to have a significantly smaller number of dimensions, converting vectors in ℝ|V| into vectors in ℝd, with d ≪ |V|.3 Two solutions are commonly employed in the literature. The first one consists in using context thresholds, where all target–context pairs that do not belong to the top-d most relevant pairs are discarded (Salehi, Cook, and Baldwin 2014; Padró et al. 2014b). The second solution consists in applying a dimensionality reduction technique such as singular value decomposition on the co-occurrence matrix where only the d largest singular values are retained (Deerwester et al. 1990). Similar techniques focus on the factorization of the logarithm of the co-occurrence matrix (Pennington, Socher, and Manning 2014) and on alternative factorizations of the PPMI matrix (Salle, Villavicencio, and Idiart 2016).
Alternatively, DSMs can be constructed by training a neural network to predict target–context relationships. For instance, a network can be trained to predict a target word wi among all possible words in V given as input a window of surrounding context words. This is known as the continuous bag-of-words model. Conversely, the network can try to predict context words for a target word given as input, and this is known as the skip-gram model (Mikolov et al. 2013). In both cases, the network training procedure allows encoding in the hidden layer semantic information about words as a side effect of trying to solve the prediction task. The weight parameters that connect the unity representing wi with the d-dimensional hidden layer are taken as its vector representation v(wi).
There are a number of factors that may influence the ability of a DSM to accurately learn a semantic representation. These include characteristics of the training corpus such as size (Mikolov, Yih, and Zweig 2013) as well as frequency thresholds and filters (Ferret 2013; Padró et al. 2014b), genre (Lapesa and Evert 2014), preprocessing (Padó and Lapata 2003, 2007), and type of context (window vs. syntactic dependencies) (Agirre et al. 2009; Lapesa and Evert 2017). Characteristics of the model include the choice of association and similarity measures (Curran and Moens 2002), dimensionality reduction strategies (Van de Cruys et al. 2012), and the use of subsampling and negative sampling techniques (Mikolov, Yih, and Zweig 2013). However, the particular impact of these factors on the quality of the resulting DSM may be heterogeneous and depends on the task and model (Lapesa and Evert 2014). Because there is no consensus about a single optimal model that works for all tasks, we compare a variety of models (Section 5) to determine which are best suited for our compositionality prediction framework.
2.2 Compositionality Prediction
Before adopting the principle of compositionality to determine the meaning of a larger unit, such as a phrase or multiword expression (MWE), it is important to determine whether it is idiomatic or not.4 This problem, known as compositionality prediction, can be solved using methods that measure directly the extent to which an expression is constructed from a combination of its parts, or indirectly via language-dependent properties of MWEs linked to idiomaticity like the degree of determiner variability and morphological flexibility (Fazly, Cook, and Stevenson 2009; Tsvetkov and Wintner 2012; Salehi, Cook, and Baldwin 2015; Köper and Schulte im Walde 2016). In this article, we focus on direct prediction methods in order to evaluate the target languages under similar conditions. Nonetheless, this does not exclude the future integration of information used by indirect prediction methods, as a complement to the methods discussed here.
For direct prediction methods, three ingredients are necessary. First, we need vector representations of single-word meanings, such as those built using DSMs (Section 2.1). Second, we need a mathematical model of how the compositional meaning of a phrase is calculated from the meanings of its parts. Third, we need the compositionality measure itself, which estimates the similarity between the compositionally constructed meaning of a phrase and its observed meaning, derived from corpora. There are a number of alternatives for each of the ingredients, and throughout this article we call a specific choice of the three ingredients a compositionality prediction configuration.
Regarding the second ingredient, that is, the mathematical model of compositional meaning, the most natural choice is the additive model (Mitchell and Lapata 2008). In the additive model, the compositional meaning of a phrase w1w2 … wn is calculated as a linear combination of the word vectors of its components: ∑iβiv(wi), where v(wi) is a d-dimensional vector for each word wi, and the βi coefficients assign different weights to the representation of each word (Reddy, McCarthy, and Manandhar 2011; Schulte im Walde, Müller, and Roller 2013; Salehi, Cook, and Baldwin 2015). These weights can capture the asymmetric contribution of each of the components to the semantics of the whole phrase (Bannard, Baldwin, and Lascarides 2003; Reddy, McCarthy, and Manandhar 2011). For example, in flea market, it is the head (market) that has a clear contribution to the overall meaning, whereas in couch potato it is the modifier (couch).
The additive model can be generalized to use a matrix of multiplicative coefficients, which can be estimated through linear regression (Guevara 2011). This model can be further modified to learn polynomial projections of higher degree, with quadratic projections yielding particularly promising results (Yazdani, Farahmand, and Henderson 2015). These models come with the caveat of being supervised, requiring some amount of pre-annotated data in the target language. Because of these requirements, our study focuses on unsupervised compositionality prediction methods only, based exclusively on automatically POS-tagged and lemmatized monolingual corpora.
Alternatives to the additive model include the multiplicative model and its variants (Mitchell and Lapata 2008). However, results suggest that this representation is inferior to the one obtained through the additive model (Reddy, McCarthy, and Manandhar 2011; Salehi, Cook, and Baldwin 2015). Recent work on predicting intra-compound semantics also supports that additive models tend to yield better results than multiplicative models (Hartung et al. 2017).
The third ingredient is the measure of similarity between the compositionally constructed vector and its actual corpus-based representation. Cosine similarity is the most commonly used measure for compositionality prediction in the literature (Schone and Jurafsky 2001; Reddy, McCarthy, and Manandhar 2011; Schulte im Walde, Müller, and Roller 2013; Salehi, Cook, and Baldwin 2015). Alternatively, one can calculate the overlap between the distributional neighbors of the whole phrase and those of the component words (McCarthy, Keller, and Carroll 2003), or the number of single-word distributional neighbors of the whole phrase (Riedl and Biemann 2015).
2.3 Nominal Compounds
Instead of covering compositionality prediction for MWEs in general, we focus on a particular category of phenomena represented by nominal compounds. We define a nominal compound as a syntactically well-formed and conventionalized noun phrase containing two or more content words, whose head is a noun.5 They are conventionalized (or institutionalized) in the sense that their particular realization is statistically idiosyncratic, and their constituents cannot be replaced by synonyms (Sag et al. 2002; Baldwin and Kim 2010; Farahmand, Smith, and Nivre 2015). Their semantic interpretation may be straightforwardly compositional, with contributions from both elements (e.g., climate change), partly compositional, with contribution mainly from one of the elements (e.g., grandfather clock), or idiomatic (e.g., cloud nine) (Nakov 2013).
The syntactic realization of nominal compounds varies across languages. In English, they are often expressed as a sequence of two nouns, with the second noun as the syntactic head, modified by the first noun. This is the most frequently annotated POS-tag pattern in the MWE-annotated DiMSUM English corpus (Schneider et al. 2016). In French and Portuguese, they often assume the form of adjective–noun or noun–adjective pairs, where the adjective modifies the noun. Examples of such constructions include the adjective–noun compound FRpetite annonce (lit. small announcement ‘classified ad’) and the noun–adjective compound PTburaco negro (lit. hole black ‘black hole’).6 Additionally, compounds may also involve prepositions linking the modifier with the head, as in the case of FRcochon d’Inde (lit. pig of India ‘guinea pig’) and PTdente de leite (lit. tooth of milk ‘milk tooth’). Because prepositions are highly polysemous and their representation in DSMs is tricky, we do not include compounds containing prepositions in this article. Hence, we focus on 2-word nominal compounds of the form noun1–noun2 (in English), and noun–adjective and adjective–noun (in the three languages).
Regarding the meaning of nominal compounds, the implicit relation between the components of compositional compounds can be described in terms of free paraphrases involving verbs, such as flu virus as virus that causes/creates flu (Nakov 2008),7 or prepositions, such as olive oil as oil from olives (Lauer 1995). These implicit relations can often be seen explicitly in the equivalent expressions in other languages (e.g., FRhuile d’olive and PTazeite de oliva for ENolive oil).
Alternatively, the meaning of compositional nominal compounds can be described using a closed inventory of relations which make the role of the modifier explicit with respect to the head noun, including syntactic tags such as subject and object, and semantic tags such as instrument and location (Girju et al. 2005). The degree of compositionality of a nominal compound can also be represented using numerical scores (Section 2.4) to indicate to what extent the component words allow predicting the meaning of the whole (Reddy, McCarthy, and Manandhar 2011; Roller, Schulte im Walde, and Scheible 2013; Salehi et al. 2015). The latter is the representation that we adopted in this article.
2.4 Numerical Compositionality Data sets
The evaluation of compositionality prediction models can be performed extrinsically or intrinsically. In extrinsic evaluation, compositionality information can be used to decide how a compound should be treated in NLP systems such as machine translation or text simplification. For instance, for machine translation, idiomatic compounds need to be treated as atomic phrases, as current methods of morphological compound processing cannot be applied to them (Stymne, Cancedda, and Ahrenberg 2013; Cap et al. 2015).
Although potentially interesting, extrinsic evaluation is not straightforward, as results may be influenced both by the compositionality prediction model and by the strategy for integration of compositionality information into the NLP system. Therefore, most related work focuses on an intrinsic evaluation, where the compositionality scores produced by a model are compared to a gold standard, usually a data set where nominal compound semantics have been annotated manually. Intrinsic evaluation thus requires the existence of data sets where each nominal compound has one (or several) numerical scores associated with it, indicating its compositionality. Annotations can be provided by expert linguist annotators or by crowdsourcing, often requiring that several annotators judge the same compound to reduce the impact of subjectivity on the scores. Relevant compositionality data sets of this type are listed below, some of which were used in our experiments.
- •
Reddy, McCarthy, and Manandhar (2011) collected judgments for a set of 90 English noun–noun (e.g., zebra crossing) and adjective–noun (e.g., sacred cow) compounds, in terms of three numerical scores: the compositionality of the compound as a whole and the literal contribution of each of its parts individually, using a scale from 0 to 5. The data set was built through crowdsourcing, and the final scores are the average of 30 judgments per compound.This data set will be referred to as Reddy in our experiments.
- •
Farahmand, Smith, and Nivre (2015) collected judgments for 1,042 English noun–noun compounds. Each compound has binary judgments regarding non-compositionality and conventionalization given by four expert annotators (both native and non-native speakers). A hard threshold is applied so that compounds are considered as noncompositional if at least two annotators say so (Yazdani, Farahmand, and Henderson 2015), and the total compositionality score is given by the sum of the four binary judgments. This data set will be referred to as Farahmand in our experiments.
- •
Kruszewski and Baroni (2014) built the Norwegian Blue Parrot data set, containing judgments for modifier-head phrases in English. The judgments consider whether the phrase is (1) an instance of the concept denoted by the head (e.g., dead parrot and parrot) and (2) a member of the more general concept that includes the head (e.g., dead parrot and pet), along with typicality ratings, with 5,849 judgments in total.
- •
Roller, Schulte im Walde, and Scheible (2013) collected judgments for a set of 244 German noun–noun compounds, each compound with an average of around 30 judgments on a compositionality scale from 1 to 7, obtained through crowdsourcing. The resource was later enriched with feature norms (Roller and Schulte im Walde 2014).
- •
Schulte im Walde et al. (2016) collected judgments for a set of 868 German noun–noun compounds, including human judgments of compositionality on a scale of 1 to 6. Compounds are judged by multiple annotators, and the final compositionality score is the average across annotators. The data set is also annotated for in-corpus frequency, productivity, and ambiguity, and a subset of 180 compounds has been selected for balancing these variables. The annotations were performed by the authors, linguists, and through crowdsourcing. For the balanced subset of 180 compounds, compositionality annotations were performed by experts only, excluding the authors.
For a multilingual evaluation, in this work, we construct two data sets, one for French and one for Portuguese compounds, and extend the Reddy data set for English using the same protocol as Reddy, McCarthy, and Manandhar (2011).
3. Creation of a Multilingual Compositionality Data set
In Section 3.1, we describe the construction of data sets of 180 compounds for French (FR-comp) and Portuguese (PT-comp). For English, the complete data set contains 280 compounds, of which 190 are new and 90 come from the Reddy data set. We use 180 of these (EN-comp) for cross-lingual comparisons (90 from the original Reddy data set combined with 90 new ones from EN-comp90), and 100 new compounds as held-out data (EN-compExt), to evaluate the robustness of the results obtained (Section 6.3). These data sets containing compositionality scores for 2–word nominal compounds are used to evaluate our framework (Section 4), and we discuss their characteristics in Section 3.2.8
3.1 Data Collection
For each of the target languages, we collected, via crowdsourcing, a set of numerical scores corresponding to the level of compositionality of the target nominal compounds. We asked non-expert participants to judge each compound considering three sentences where the compound occurred. After reading the sentences, participants assess the degree to which the meaning of the compound is related to the meanings of its parts. This follows from the assumption that a fully compositional compound will have an interpretation whose meaning stems from both words (e.g., lime tree as a tree of limes), while a fully idiomatic compound will have a meaning that is unrelated to its components (e.g., nut case as an eccentric person).
Our work follows the protocol proposed by Reddy, McCarthy, and Manandhar (2011), where compositionality is explained in terms of the literality of the individual parts. This type of indirect annotation does not require expert linguistic knowledge, and still provides reliable data, as we show later. For each language, data collection involved four steps: compound selection, sentence selection, questionnaire design, and data aggregation.
Compound Selection.
For each data set, we manually selected nominal compounds from dictionaries, corpus searches, and by linguistic introspection, maintaining an equal proportion of compounds that are compositional, partly compositional, and idiomatic.9 We considered them to be compositional if their semantics are related to both components (e.g., benign tumor), partly compositional if their semantics are related to only one of the components (e.g., grandfather clock), and idiomatic if they are not directly related to either (e.g., old flame). This preclassification was used only to select a balanced set of compounds and was not shown to the participants nor used at any later stage. For all languages, all compounds are required to have a head that is unambiguously a noun, and additionally for French and Portuguese, all compounds have an adjective as modifier.
Sentence Selection.
Compounds may be polysemous (e.g., FRbras droit may mean most reliable helper or literally right arm). To avoid any potential sense uncertainty, each compound was presented to the participants with the same sense in three sentences. These sentences were manually selected from the WaC corpora: ukWaC (Baroni et al. 2009), frWaC, and brWaC (Boos, Prestes, and Villavicencio 2014), presented in detail in Section 5.
Questionnaire Design.
For each compound, after reading three sentences, participants are asked to:
- •
provide synonyms for the compound in these sentences. The synonyms are used as additional validation of the quality of the judgments: if unrelated words are provided, the answers are discarded.
- •
assess the contribution of the head noun to the meaning of the compound (e.g., is a busy bee always literally a bee?)
- •
assess the contribution of the modifier noun or adjective to the meaning of the compound (e.g., is a busy bee always literally busy?)
- •
assess the degree to which the compound can be seen as a combination of its parts (e.g., is a busy bee always literally a bee that is busy?)
Data Aggregation.
For English and French, we collected answers using Amazon Mechanical Turk (AMT), manually removing answers that were not from native speakers or where the synonyms provided were unrelated to the target compound sense. Because AMT has few Brazilian Portuguese native speakers, we developed an in-house web interface for the questionnaire, which was sent out to Portuguese-speaking NLP mailing lists.
For a given compound and question we calculate aggregated scores as the arithmetic averages of all answers across participants. We will refer to these averaged scores as the human compositionality score (hc)s. We average the answers to the three questions independently, generating three scores: hcH for the head noun, hcM for the modifier, and hcHM for the whole compound. In our framework, we try to predict hcHM automatically (Section 5). To assess the variability of the answers (Section 3.2.1), we also calculate the standard deviation across participants for each question (σH, σM, and σHM).
The list of compounds, their translations, glosses, and compositionality scores are given in Appendices D (EN-comp90 and EN-compExt), E (FR-comp), and F (PT-comp).10
3.2 Data set Analysis
In this section, we present different measures of agreement among participants (Section 3.2.1) and examine possible correlations between compositionality scores, familiarity, and conventionalization (Section 3.2.2) in the data sets created for this article.
3.2.1 Measuring Data set Quality.
To assess the quality of the collected human compositionality scores, we use standard deviation and inter-annotator agreement scores.
Standard Deviation (andPσ>1.5).
The standard deviation (σ) of the participants’ answers can be used as an indication of their agreement: for each compound and for each of the three questions, small σ values suggest greater agreement. In addition, if the instructions are clear, σ can also be seen as an indication of the level of difficulty of the task. In other words, all other things being equal, compounds with larger σ can be considered intrinsically harder to analyze by the participants. For each data set, we consider two aggregated metrics based on σ:
- •
— The average of σ in the data set.
- •
Pσ>1.5 — The proportion of compounds whose σ is higher than 1.5.
Table 1 presents the result of these metrics when applied to our in-house data sets, as well as to the original Reddy data set. The column indicates the average number of answers per compound, while the other six columns present the values of and Pσ>1.5 for compound (HM), head-only (H), and modifier-only (M) scores.
Data set . | . | . | . | . | PσHM>1.5 . | PσH>1.5 . | PσM>1.5 . |
---|---|---|---|---|---|---|---|
FR-comp | 14.9 | 1.15 | 1.08 | 1.21 | 22.78% | 24.44% | 30.56% |
PT-comp | 31.8 | 1.22 | 1.09 | 1.20 | 14.44% | 17.22% | 19.44% |
EN-comp90 | 18.8 | 1.17 | 1.05 | 1.18 | 18.89% | 16.67% | 27.78% |
EN-compExt | 22.6 | 1.21 | 1.27 | 1.16 | 17.00% | 29.00% | 18.00% |
Reddy | 28.4 | 0.99 | 0.94 | 0.89 | 5.56% | 11.11% | 8.89% |
Data set . | . | . | . | . | PσHM>1.5 . | PσH>1.5 . | PσM>1.5 . |
---|---|---|---|---|---|---|---|
FR-comp | 14.9 | 1.15 | 1.08 | 1.21 | 22.78% | 24.44% | 30.56% |
PT-comp | 31.8 | 1.22 | 1.09 | 1.20 | 14.44% | 17.22% | 19.44% |
EN-comp90 | 18.8 | 1.17 | 1.05 | 1.18 | 18.89% | 16.67% | 27.78% |
EN-compExt | 22.6 | 1.21 | 1.27 | 1.16 | 17.00% | 29.00% | 18.00% |
Reddy | 28.4 | 0.99 | 0.94 | 0.89 | 5.56% | 11.11% | 8.89% |
These values are below what would be expected for random decisions (rand ≃ 1.71, for the Likert scale). Although our data sets exhibit higher variability than Reddy, this may be partly due to the application of filters done by Reddy, McCarthy, and Manandhar (2011) to remove outliers.11 These values could also be due to the collection of fewer answers per compound for some of the data sets. However, there is no clear tendency in the variation of the standard deviation of the answers and the number of participants n. The values of are quite homogeneous, ranging from 1.05 for EN-comp90 (head) to 1.27 for EN-compExt (head). The low agreement for modifiers may be related to a greater variability in semantic relations between modifiers and compounds: these include material (e.g., brass ring), attribute (e.g., black cherry), and time (e.g., night owl).
Figure 1(a) shows standard deviation (σHM, σH, and σM) for each compound of FR-comp as a function of its average compound score hcHM.12 For all three languages, greater agreement was found for compounds at the extremes of the compositionality scale (fully compositional or fully idiomatic) for all scores. These findings can be partly explained by end-of-scale effects, that result in greater variability for the intermediate scores in the Likert scale (from 1 to 4) that correspond to the partly compositional cases. Hence, we expect that it will be easier to predict the compositionality of idiomatic/compositional compounds than of partly compositional ones.
Inter-Annotator Agreement (α).
To measure inter-annotator agreement of multiple participants, taking into account the distance between the ordinal ratings of the Likert scale, we adopt the α score (Artstein and Poesio 2008). The α score is more appropriate for ordinal data than traditional agreement scores for categorical data, such as Cohen’s and Fleiss’ κ (Cohen 1960; Fleiss and Cohen 1973). However, due to the use of crowdsourcing, most participants rated only a small number of compounds with very limited chance of overlap among them: the average number of answers per participant is 13.6 for EN-comp90, 10.2 for EN-compExt, 33.7 for FR-comp, and 53.5 for PT-comp. Because the α score assumes that each participant rates all the items, we focus on the answers provided by three of the participants, who rated the whole set of 180 compounds in PT-comp.
Using a linear distance schema between the answers,13 we obtain an agreement of α = .58 for head-only, α = .44 for modifier-only, and α = .44 for the whole compound. To further assess the difficulty of this task, we also calculate α for a single expert annotator, judging the same set of compounds after an interval of one month. The scores were α = .69 for the head and α = .59 for both the compound and for the modifier. The Spearman correlation between these two annotations performed by the same expert is ρ = 0.77 for hcHM. This can be seen as a qualitative upper bound for automatic compositionality prediction on PT-comp.
3.2.2 Compositionality, Familiarity, and Conventionalization.
Figure 1(b) shows the average scores (hcHM, hcH, and hcM) for the compounds ranked according to the average compound score hcHM. Although this figure is for FR-comp, similar patterns were found for the other data sets. For all three languages, the human compositionality scores provide additional confirmation that the data sets are balanced, with the compound scores (hcHM) being distributed linearly along the scale. Furthermore, we have calculated the average hcHM values separately for the compounds in each of the three compositionality classes used for compound selection: idiomatic, partly compositional and compositional (Section 3.1). These averages are, respectively, 1.0, 2.4, and 4.0 for EN-comp90; 1.1, 2.4, and 4.2 for EN-compExt; 1.3, 2.7, and 4.3 for FR-comp; and 1.3, 2.5, and 3.9 for PT-comp, indicating that our attempt to select a balanced number of compounds from each class is visible in the collected hcHM scores.
Additionally, the human scores also suggest an asymmetric impact of the non-literal parts over the compound: whenever participants judged an element of the compound as non-literal, the whole compound was also rated as idiomatic. Thus, most head and modifier scores (hcH and hcM) are close to or above the diagonal line in Figure 1(b). In other words, a component of the compound is seldom rated as less literal than the compositionality of the whole compound hcHM, although the opposite is more common.
To evaluate if it is possible to predict hcHM from the hcH and hcM, we calculate the arithmetic and geometric means between hcH and hcM for each compound. Figure 2 shows the linear regression of both measures for FR-comp. The goodness of fit is = .93 for the arithmetic mean, and = .96 for the geometric mean, confirming that they are good predictors of hcHM.14 Thus, we assume that hcHM summarizes hcH and hcM, and focus on predicting hcHM instead of hcH and hcM separately. These findings also inspired the pcarith and pcgeom compositionality prediction functions (Section 4).
To examine whether there is an effect of the familiarity of a compound on hc scores, in particular if more idiomatic compounds need to be more familiar, we also calculated the correlation between the compositionality score for a compound hcHM and its frequency in a corpus, as a proxy for familiarity. In this case we used the WaC corpora and calculated the frequencies based on the lemmas. The results, in Table 2, show a statistically significant positive Spearman correlation of ρ = 0.305 for EN-comp90, ρ = 0.384 for EN-compExt, and ρ = 0.598 for FR-comp, indicating that, contrary to our expectations, compounds that are more frequent tend to be assigned higher compositionality scores. However, frequency alone is not enough to predict compositionality, and further investigation is needed to determine if compositionality and frequency are also correlated with other factors.
Data set . | frequency . | PMI . |
---|---|---|
FR-comp | 0.598 (p < 10−18) | 0.164 (p > 0.01) |
PT-comp | 0.109 (p > 0.1) | 0.076 (p > 0.1) |
EN-comp90 | 0.305 (p < 10−2) | −0.024 (p > 0.1) |
EN-compExt | 0.384 (p < 10−5) | 0.138 (p > 0.1) |
Data set . | frequency . | PMI . |
---|---|---|
FR-comp | 0.598 (p < 10−18) | 0.164 (p > 0.01) |
PT-comp | 0.109 (p > 0.1) | 0.076 (p > 0.1) |
EN-comp90 | 0.305 (p < 10−2) | −0.024 (p > 0.1) |
EN-compExt | 0.384 (p < 10−5) | 0.138 (p > 0.1) |
We also analyzed the correlation between compositionality and conventionalization to determine if more idiomatic compounds correspond to more conventionalized ones. We use PMI (Church and Hanks 1990) as a measure of conventionalization, as it indicates the strength of association between the components (Farahmand, Smith, and Nivre 2015). We found no statistically significant correlation between compositionality and PMI.
4. Compositionality Prediction Framework
5. Experimental Setup
This section describes the common setup used for evaluating compositionality prediction, such as corpora (Section 5.1), DSMs (Section 5.2), and evaluation metrics (Section 5.3).
5.1 Corpora
In this work we used the lemmatized and POS-tagged versions of the WaC corpora not only for building DSMs, but also as sources of information about the target compounds for the analyses performed (e.g., in Sections 3.2.2, 9.1, and 9.2):
- •
for English, the ukWaC (Baroni et al. 2009), with 2.25 billion tokens, parsed with MaltParser (Nivre, Hall, and Nilsson 2006);
- •
for French, the frWaC with 1.61 billion tokens preprocessed with TreeTagger (Schmid 1995); and
- •
for Brazilian Portuguese, a combination of brWaC (Boos, Prestes, and Villavicencio 2014), Corpus Brasileiro,18 and all Wikipedia entries,19 with a total of 1.91 billion tokens, all parsed with PALAVRAS (Bick 2000).
For all compounds contained in our data sets, we transformed their occurrences into single tokens by joining their component words with an underscore (e.g., ENmonkey business → monkey_business and FRbelle-mère → belle_mère).20,21 To handle POS-tagging and lemmatization irregularities, we retagged the compounds’ components using the gold POS and lemma in our data sets (e.g., for ENsitting duck, sit/verb_duck/noun → sitting/adjective_duck/noun). We also simplified all POS tags using coarse-grained labels (e.g., verb instead of vvz). All forms are then lowercased (surface forms, lemmas, and POS tags); and noisy tokens, with special characters, numbers, or punctuation, are removed. Additionally, ligatures are normalized for French (e.g., œ → oe) and a spellchecker22 is applied to normalize words across English spelling variants (e.g., color → colour).
To evaluate the influence of preprocessing on compositionality prediction (Section 7.3), we generated four versions of each corpus, with different levels of linguistic information. We expect lemmatization to reduce data sparseness by merging morphologically inflected variants of the same lemma:
- 1.
surface+: the original raw corpus with no preprocessing, containing surface forms.
- 2.
surface: stopword removal, generating a corpus of surface forms of content words.
- 3.
lemmaPoS: stopword removal, lemmatization,23 and POS-tagging; generating a corpus of content words distinguished by POS tags, represented as lemma/POS-tag.
- 4.
lemma: stopword removal and lemmatization; generating a corpus containing only lemmas of content words.
5.2 DSMs
In this section, we describe the state-of-the-art DSMs used for compositionality prediction.
Positive Pointwise Mutual Information (PPMI).
In the models based on the PPMI matrix, the representation of a target word is a vector containing the PPMI association scores between the target and its contexts (Bullinaria and Levy 2012). The contexts are nouns and verbs, selected in a symmetric sliding window of W words to the left/right and weighted linearly according to their distance D to the target (Levy, Goldberg, and Dagan 2015).24 We consider three models that differ in how the contexts are selected:
- •
In PPMI–thresh, the vectors are |V|-dimensional but only the top d contexts with highest PPMI scores for each target word are kept, while the others are set to zero (Padró et al. 2014a).25
- •
In PPMI–TopK, the vectors are d-dimensional, and each of the d dimensions corresponds to a context word taken from a fixed list of k contexts, identical for all target words. We chose k as the 1,000 most frequent words in the corpus after removing the top 50 most frequent words (Salehi, Cook, and Baldwin 2015).
- •
In PPMI–SVD, singular value decomposition is used to factorize the PPMI matrix and reduce its dimensionality from |V| to d.26 We set the value of the context distribution smoothing factor to 0.75, and the negative sampling factor to 5 (Levy, Goldberg, and Dagan 2015). We use the default minimum word count threshold of 5.
Word2vec (w2v).
Word2vec27 relies on a neural network to predict target/context pairs (Mikolov et al. 2013). We use its two variants: continuous bag-of-words (w2v–cbow) and skip-gram (w2v–sg). We adopt the default configurations recommended in the documentation, except for: no hierarchical softmax, 25 negative samples, frequent-word down-sampling rate of 10−6, execution of 15 training iterations, and minimum word count threshold of 5.
Global Vectors (glove).
GloVe28 implements a factorization of the logarithm of the positional co-occurrence count matrix (Pennington, Socher, and Manning 2014). We adopt the default configurations from the documentation, except for: internal cutoff parameter xmax = 75 and processing of the corpus in 15 iterations. For the corpora versions lemma and lemmaPoS (Section 5.1), we use the minimum word count threshold of 5. For surface and surface+, due to the larger vocabulary sizes, we use thresholds of 15 and 20.29
Lexical Vectors (lexvec).
The LexVec model30 factorizes the PPMI matrix in a way that penalizes errors on frequent words (Salle, Villavicencio, and Idiart 2016). We adopt the default configurations in the documentation, except for: 25 negative samples, subsampling rate of 10−6, and processing of the corpus in 15 iterations. Due to the vocabulary sizes, we use a word count threshold of 10 for lemma and lemmaPoS, and 100 for surface and surface+.31
5.2.1 DSM Parameters.
In addition to model-specific parameters, the DSMs described above have some shared DSM parameters. We construct multiple DSM configurations by varying the values of these parameters. These combinations produce a total of 228 DSMs per language (see Table 3). In particular, we evaluate the influence of the following parameters on compositionality prediction:
- •
WindowSize: Number of context words to the left/right of the target word when searching for target-context co-occurrence pairs. The assumption is that larger windows are better for capturing semantic relations (Jurafsky and Martin 2009) and may be more suitable for compositionality prediction. We use window sizes of 1 + 1, 4 + 4, and 8 + 8.32
- •
Dimension: Number of dimensions of each vector. The underlying hypothesis is that, the higher the number of dimensions, the more accurate the representation of the context is going to be. We evaluate our framework with vectors of 250, 500, and 750 dimensions.
- •
WordForm: One of the four word-form and stopword removal variants used to represent a corpus, in Section 5.1: surface+, surface, lemma, and lemmaPoS. They represent different levels of specificity in the informational content of the tokens, and may have a language-dependent impact on the performance of compositionality prediction.
DSM . | Dimension . | WordForm . | WindowSize . |
---|---|---|---|
PPMI–TopK | d = 1000 | ||
PPMI–thresh | d = 250, d = 500, d = 750 | surface+, surface, lemma, lemmaPoS | W = 1 + 1, W = 4 + 4, W = 8 + 8 |
PPMI–SVD | |||
w2v–cbow | |||
w2v–sg | |||
glove | |||
lexvec |
DSM . | Dimension . | WordForm . | WindowSize . |
---|---|---|---|
PPMI–TopK | d = 1000 | ||
PPMI–thresh | d = 250, d = 500, d = 750 | surface+, surface, lemma, lemmaPoS | W = 1 + 1, W = 4 + 4, W = 8 + 8 |
PPMI–SVD | |||
w2v–cbow | |||
w2v–sg | |||
glove | |||
lexvec |
5.3 Evaluation Metrics
To evaluate a compositionality prediction configuration, we calculate Spearman’s ρ rank correlation between the predicted compositionality scores (pc)s and the human compositionality scores (hc)s for the compounds that appear in the evaluation data set. We mostly use the rank correlation instead of linear correlation (Pearson) because we are interested in the framework’s ability to order compounds from least to most compositional, regardless of the actual predicted values.
For English, besides the evaluation data sets presented in Section 3, we also use Reddy and Farahmand (see Section 2.4) to enable comparison with related work. For Farahmand, since it contains binary judgments33 instead of graded compositionality scores, results are reported using the best F1 (BF1) score, which is the highest F1 score found using the top n compounds classified as noncompositional, when n is varied (Yazdani, Farahmand, and Henderson 2015). For Reddy, we sometimes present Pearson scores to enable comparison with related work.
Because of the large number of compositionality prediction configurations evaluated, we only report the best performance for each configuration over all possible DSM parameter values. The generalization of these analyses is then ensured using cross-validation and held-out data. To determine whether the difference between two prediction results are statistically different, we use nonparametric Wilcoxon’s sign-rank test.
6. Overall Results
In this section, we present the overall results obtained on the Reddy, Farahmand, EN-comp, FR-comp, and PT-comp data sets, comparing all possible configurations (Section 6.1). To determine their robustness we also report evaluation for all languages using cross-validation (Section 6.2) and for English using the held-out data set EN-compExt (Section 6.3). All results reported in this section use the pcuniform function.
6.1 Distributional Semantic Models
Table 4 shows the highest overall values obtained for each DSM (columns) on each data set (rows). For English (Reddy, EN-comp, and Farahmand), the highest results for the compounds found in the corpus were obtained with w2v and PPMI–thresh, shown as the first value in each pair in Table 4. Not all compounds in the English data sets are present in our corpus. Therefore, we also report results adopting a fallback strategy (the second value). Because its impact depends on the data set, and the relative performance of the models is similar with or without it, for the remainder of the article we discuss only the results without fallback.34
Data set . | PPMI–SVD . | PPMI–TopK . | PPMI–thresh . | glove . | lexvec . | w2v–cbow . | w2v–sg . |
---|---|---|---|---|---|---|---|
Farahmand | .487/.424 | .435/.376 | .472/.404 | .400/.358 | .449/.431 | .512/.471 | .507/.468 |
Reddy (r) | .738/.726 | .732/.717 | .762/.768 | .783/.787 | .787/.787 | .803/.798 | .814/.814 |
Reddy (ρ) | .743/.743 | .706/.716 | .791/.803 | .754/.759 | .774/.773 | .796/.796 | .812/.812 |
EN-comp | .655/.666 | .624/.632 | .688/.704 | .638/.651 | .646/.658 | .716/.730 | .726/.741 |
FR-comp | .584 | .550 | .702 | .680 | .677 | .652 | .653 |
PT-comp | .530 | .519 | .602 | .555 | .570 | .588 | .586 |
Data set . | PPMI–SVD . | PPMI–TopK . | PPMI–thresh . | glove . | lexvec . | w2v–cbow . | w2v–sg . |
---|---|---|---|---|---|---|---|
Farahmand | .487/.424 | .435/.376 | .472/.404 | .400/.358 | .449/.431 | .512/.471 | .507/.468 |
Reddy (r) | .738/.726 | .732/.717 | .762/.768 | .783/.787 | .787/.787 | .803/.798 | .814/.814 |
Reddy (ρ) | .743/.743 | .706/.716 | .791/.803 | .754/.759 | .774/.773 | .796/.796 | .812/.812 |
EN-comp | .655/.666 | .624/.632 | .688/.704 | .638/.651 | .646/.658 | .716/.730 | .726/.741 |
FR-comp | .584 | .550 | .702 | .680 | .677 | .652 | .653 |
PT-comp | .530 | .519 | .602 | .555 | .570 | .588 | .586 |
The best w2v–cbow and w2v–sg configurations are not significantly different from each other, but both are different from PPMI–thresh (p < 0.05). In a direct comparison with related work, our best result for the Reddy data set (Spearman ρ = .812, Pearson r = .814) improves upon the best correlation reported by Reddy, McCarthy, and Manandhar (2011) (ρ = .714), and by Salehi, Cook, and Baldwin (2015) (r = .796). For Farahmand, these results are comparable to those reported by Yazdani, Farahmand, and Henderson (2015) (BF1 = .487), but our work adopts an unsupervised approach for compositionality prediction. For both FR-comp and PT-comp, the w2v models are outperformed by PPMI–thresh, whose predictions are significantly different from the predictions of other models (p < 0.05).
In short, these results suggest language-dependent trends for DSMs, by which w2v models perform better for the English data sets, and PPMI–thresh for French and Portuguese. While this may be due to the level of morphological inflection in these languages, it may also be due to differences in corpus size or to particular DSM parameters used in each case. In Section 7, we analyze the impact of individual DSM and corpus parameters to better understand this language dependency.
6.2 Cross-Validation
Table 4 reports the best configurations for the EN-comp, FR-comp, and PT-comp data sets. However, to determine whether the Spearman scores obtained are robust and generalizable, in this section we report evaluation using cross-validation. For each data set, we partition the 180 compounds into 5 folds of 36 compounds (f1, f2, …, f5). Then, for each fold fi, we exhaustively look for the best configuration (values of WindowSize, Dimension, and WordForm) for the union of the other folds (∪j≠ifj), and predict the 36 compositionality scores for fi using this configuration. The predicted scores for the 5 folds are then grouped into a single set of predictions, which is evaluated against the 180 human judgments.
The partition of compounds into folds is performed automatically, based on random shuffling.35 To avoid relying on a single arbitrary fold partition, we run cross-validation 10 times, with different fold partitions each time. This process generates 10 Spearman correlations, for which we calculate the average value and a 95% confidence interval.
We have calculated cross-validation scores for a wide range of configurations, focusing on the following DSMs: PPMI–thresh, w2v–cbow, and w2v–sg. Figure 4 presents the average Spearman correlations of cross-validation experiments compared with the best results reported in the previous section, referred to as oracle. In the top left panel the x-axis indicates the DSMs for each language using the best oracle configuration, Figure 4(a). In the other panels, it indicates the best oracle configuration for a specific DSM and a fixed parameter for a given language. We present only a sample of the results for fixed parameters, as they are stable across languages. Results are presented in ascending order of oracle Spearman correlation. For each oracle datapoint, the associated average Spearman from cross-validation is presented along with the 95% confidence interval.
The Spearman correlations obtained through cross-validation are comparable to the ones obtained by the oracle. Moreover, the results are quite stable: increasingly better configurations of oracle tend to be correlated with increasingly better cross-validation scores. Indeed, the Pearson r correlation between the 9 oracle points and the 9 cross-validation points in the top-left panel is 0.969, attesting to the correlation between cross-validation and oracle scores.
For PT-comp, the confidence intervals are quite wide, meaning that prediction quality is sensitive to the choice of compounds used to estimate the best configurations. Probably a larger data set would be required to stabilize cross-validation results. Nonetheless, the other two data sets seem representative enough, so that the small confidence intervals show that, even if we fix the value of a given parameter (e.g., d = 750), the results using cross-validation are stable and very similar to the oracle.
The confidence intervals overlapping with oracle data points also indicate that most cross-validation results are not statistically different from the oracle. This suggests that the highest-Spearman oracle configurations could be trusted as reasonable approximations of the best configurations for other data sets collected for the same language constructed using similar guidelines.
6.3 Evaluation on Held-Out Data
As an additional test of the robustness of the results obtained, we calculated the performance of the best models obtained for one of the data sets (EN-comp), on a separate held-out data set (EN-compExt). The latter contains 100 compounds balanced for compositionality, not included in EN-comp (that is, not used in any of the preceding experiments). The results obtained on EN-compExt are shown in Table 5. They are comparable and mostly better than those for the oracle and for cross-validation. As the items are different in the two data sets, a direct comparison of the results is not possible, but the equivalent performances confirm the robustness of the models and configurations for compositionality prediction. Moreover, these results are obtained in an unsupervised manner, as the compositionality scores are not used to train any of the models. The scores are used only for comparative purposes for determining the impact of various factors in the ability of these DSMs to predict compositionality.
DSM . | WordForm . | WindowSize . | Dimension . | ρ EN-comp . | ρ EN-compExt . |
---|---|---|---|---|---|
PPMI–SVD | surface | 1 + 1 | 250 | 0.655 | 0.692 |
PPMI–TopK | lemmaPoS | 8 + 8 | 1,000 | 0.624 | 0.680 |
PPMI–thresh | lemmaPoS | 8 + 8 | 750 | 0.688 | 0.675 |
glove | lemmaPoS | 8 + 8 | 500 | 0.637 | 0.670 |
lexvec | lemmaPoS | 8 + 8 | 250 | 0.646 | 0.685 |
w2v–cbow | surface+ | 1 + 1 | 750 | 0.716 | 0.731 |
w2v–sg | surface+ | 1 + 1 | 750 | 0.726 | 0.733 |
DSM . | WordForm . | WindowSize . | Dimension . | ρ EN-comp . | ρ EN-compExt . |
---|---|---|---|---|---|
PPMI–SVD | surface | 1 + 1 | 250 | 0.655 | 0.692 |
PPMI–TopK | lemmaPoS | 8 + 8 | 1,000 | 0.624 | 0.680 |
PPMI–thresh | lemmaPoS | 8 + 8 | 750 | 0.688 | 0.675 |
glove | lemmaPoS | 8 + 8 | 500 | 0.637 | 0.670 |
lexvec | lemmaPoS | 8 + 8 | 250 | 0.646 | 0.685 |
w2v–cbow | surface+ | 1 + 1 | 750 | 0.716 | 0.731 |
w2v–sg | surface+ | 1 + 1 | 750 | 0.726 | 0.733 |
7. Influence of DSM Parameters
In this section, we analyze the influence of DSM parameters on compositionality prediction. We consider different window sizes (Section 7.1), numbers of vector dimensions (Section 7.2), types of corpus preprocessing (Section 7.3), and corpus sizes. For each parameter, we analyze all possible values of other parameters. In other words, we report the best results obtained by fixing a value and considering all possible configurations of other parameters. Results reported in this section use the pcuniform function.
7.1 Window Size
DSMs build the representation of every word based on the frequency of other words that appear in its context. Our hypothesis is that larger window sizes result in higher scores, as the additional data allows a better representation of word-level semantics. However, as some of these models adopt different weight decays for larger windows,36 variation in their behavior related to window size is to be expected.
Contrary to our expectations, for the best models in each language, large windows did not lead to better compositionality prediction. Figure 5 shows the best results obtained for each window size.37 For English, w2v is the best model, and its performance does not seem to depend much on the size of the window, but with a small trend for smaller sizes to be better. For French and Portuguese, PPMI–thresh is only the best model for the minimal window size, and there is a large gap in performance for PPMI–thresh as window size increases, such that for larger windows it is outperformed by other models. To assess which of these differences are statistically significant, we have performed Wilcoxon’s sign-rank test on the two highest Spearman values for each DSM in each language. All differences are statistically significant (p < 0.05), with the exception of PPMI–SVD.
The appropriate choice of window size has been shown to be task-specific (Lapesa and Evert 2017), and the results above suggest that, for compositionality prediction, it depends also on the DSM used. Overall, the trend is for smaller windows to lead to better compositionality prediction.
7.2 Dimension
When creating corpus-derived vectors with a DSM, the question is whether additional dimensions can be informative in compositionality prediction. Our hypothesis is that the larger the number of dimensions, the more precise the representations, and the more accurate the compositionality prediction.
The results shown in Figure 6 for each of the comparable data sets confirm this trend in the case of the best DSMs: w2v and PPMI–thresh. Moreover, the effect of changing the vector dimensions for the best models seems to be consistent across these languages. The results for PPMI–SVD, lexvec, and glove are more varied, but they are never among the best models for compositionality prediction in any of the languages.38 All differences between the two highest Spearman correlations are statistically significant (p < 0.05), with the exception of PPMI–SVD for FR-comp, according to Wilcoxon’s sign-rank test.
7.3 Type of Preprocessing
In related work, DSMs are constructed from corpora with various levels of preprocessing (Bullinaria and Levy 2012; Mikolov et al. 2013; Pennington, Socher, and Manning 2014; Kiela and Clark 2014; Levy, Goldberg, and Dagan 2015; Salle, Villavicencio, and Idiart 2016). In this work, we compare four levels: WordForm = surface+, surface, lemmaPoS and lemma, described in Section 5.1, corresponding to decreasing amounts of information. Testing different varieties of corpus preprocessing allows us to explore the trade-off between informational content and the statistical significance related to data sparsity for compositionality prediction.
Figure 7 presents the impact of different types of corpus preprocessing on the quality of compositionality prediction. In EN-comp, all differences between the two highest Spearman values for each DSM were significant, according to Wilcoxon’s sign-rank test, except for PPMI–thresh, whereas in FR-comp and PT-comp they were significant only for PPMI–TopK and lexvec. However, note that the top two results are often both obtained on representations based on lemmas. If we compare the highest lemma-based result with the highest surface-based result for the same DSM, we find a statistically significant difference in every single case (p < 0.05).
When considering the results themselves, although the results for English are heterogeneous, for French and Portuguese, the lemma-based representations consistently allow a better prediction of compositionality scores. This may be explained by the fact that these two languages are morphologically richer than English, and lemma-based representations reduce the sparsity in the data, allowing more information to be gathered from the same amount of data. Moreover, adding POS information (lemmaPoS vs. lemma) does not seem to bring consistent improvements that are statistically significant. This suggests that words that share the same lemma are semantically close enough that any gains from disambiguation are masked by the sparsity of a higher vocabulary size. Finally, the impact of stopword removal is also inconclusive (surface vs. surface+), considering the best models for each language.
7.4 Corpus Size
If we assume that the bigger the corpus, the better the DSM, this could explain why the results for English are better than those for French and Portuguese, although it does not explain why Portuguese is behind French.39 In this section, we examine the impact of corpus size on prediction quality by incrementally increasing the amount of data used to generate the DSMs while monitoring the Spearman correlation (ρ) with the human annotations. We use only the best DSMs for these languages, PPMI–thresh and w2v–sg, with the configurations that produced highest Spearman scores for each full corpus.
As expected, the results in Figure 8 show a smooth, roughly monotonic increase of the ρ values with corpus size, for PPMI–thresh and w2v–sg for each language and data set.40 In all cases there is a clear saturation behavior, so that we can safely say that after one billion tokens, the quality of the predictions reaches a plateau and additional corpus fragments do not bring improvements. This suggests that differences in compositionality prediction performance for these languages cannot be totally explained by differences in corpus sizes.
8. Influence of Compositionality Prediction Function
Up to this point, the predicted compositionality scores for the compounds were calculated using a uniform function that assumes that each component contributes 50% to the meaning of the compound (pcuniform). However, this might not accurately capture a faithful representation of compounds whose meaning is more semantically related to one of the components (e.g., crocodile tears, which is semantically closer to the head tears; and night owl, which is semantically closer to the modifier night). As this may have an impact on the success of compositionality prediction, in this section we evaluate how different compositionality prediction functions model these compounds. In particular, we proposed pcmaxsim, (Section 4) for dynamically determining weights that assign maximal similarity between the compound and each of its components. We have also proposed pcgeom, which favors idiomatic readings through the geometric mean of the similarities between a compound and its components. Our hypotheses are that pcmaxsim will be better correlated with human scores for compositional and partly compositional compounds, while pcgeom can better capture the semantics of idiomatic ones (Section 8.1).
First, to verify whether other prediction functions improve results obtained for the best pcuniform configurations reported up to now, we have evaluated every strategy on all DSM configurations. Table 6 shows that the functions that combine both components (columns pcuniform to pcarith) generate better compositionality predictions than functions that ignore one of the individual components (columns pchead and pcmod). There is some variation among the combined scores, with the best score indicated in bold. Every best score is statistically different from all other scores in its row (p < 0.05). The results for pcarith and pcuniform are very similar, reflecting their similar formulations.41
Data set . | pcuniform . | pcmaxsim . | pcgeom . | pcarith . | pchead . | pcmod . |
---|---|---|---|---|---|---|
EN-comp | .726 | .730 | .677 | .718 | .555 | .677 |
FR-comp | .702 | .693 | .699 | .703 | .617 | .645 |
PT-comp | .602 | .590 | .580 | .598 | .558 | .486 |
Data set . | pcuniform . | pcmaxsim . | pcgeom . | pcarith . | pchead . | pcmod . |
---|---|---|---|---|---|---|
EN-comp | .726 | .730 | .677 | .718 | .555 | .677 |
FR-comp | .702 | .693 | .699 | .703 | .617 | .645 |
PT-comp | .602 | .590 | .580 | .598 | .558 | .486 |
Here we focus on the issue of adjusting β in the compositionally constructed vector; that is, we consider the use of pcmaxsim instead of pcuniform. This score seems to be beneficial in the case of English (EN-comp), but not in the case of French or Portuguese. Table 7 presents the best pcmaxsim model for each data set, along with the average weights assigned to head and modifier for every compound in the data set. Before analyzing the results in Table 7, we have to verify whether the data sets are balanced for the influence of each component to the meaning of the whole, or if there is any bias towards heads/modifiers. The influence of the head, estimated as the average of hcH/(hcH + hcM) over all compounds of a data set, is 0.50 for EN-comp, 0.52 for FR-comp, and 0.52 for PT-comp. This indicates that the data sets are balanced in terms of the influence of each component, and neither head nor modifier predominates as more compositional or idiomatic than the other.
Data set . | DSM . | ρmaxsim . | (head) . | (mod.) . |
---|---|---|---|---|
EN-comp | w2v–sg | .730 | .55 | .45 |
FR-comp | PPMI–thresh | .693 | .68 | .32 |
PT-comp | w2v–sg | .590 | .68 | .32 |
Data set . | DSM . | ρmaxsim . | (head) . | (mod.) . |
---|---|---|---|---|
EN-comp | w2v–sg | .730 | .55 | .45 |
FR-comp | PPMI–thresh | .693 | .68 | .32 |
PT-comp | w2v–sg | .590 | .68 | .32 |
As for the average β weights in pcmaxsim, while the weights that maximize compositionality are fairly similar for EN-comp, they strongly favor the head for both FR-comp and PT-comp. This may be explained by the fact that, for the latter, the modifiers are all adjectives, while EN-comp has mostly nouns as modifiers. Surprisingly, this seemingly more realistic weighting of the compound components for French and Portuguese does not reflect in better compositionality scores, and does not correspond to the average influence of modifiers in these data sets, estimated as 0.48 on average. One possible explanation could be that, in these cases, the adjectives may be contributing to some specific more idiomatic meaning that is not found in isolated occurrences of the adjective itself, such as FRbeau (lit. beautiful), which is used in the translation of most in-law family members, such as FRbeau-frère (lit. beautiful-brother ‘brother-in-law’). In the next section, we investigate which compounds are affected the most by these different scores.
8.1 Rank Improvement Analysis
We perform a cross-lingual analysis, grouping the hcHM scores of the EN-comp, FR-comp, and PT-comp into a unique data set (henceforth ALL-comp), containing 540 compounds. Figure 9 presents the values of rank improvement for the best PPMI–thresh and w2v–sg configurations, ranked according to hcHM (rkhuman): compounds that are better predicted by pcmaxsim have positive rank movements (above the 0 line).42 The density of movement on either side of the 0 (no movement) line appears to be similar for both models with pcmaxsim performing as well as pcuniform.
Figure 9 also marks the outlier compounds with the highest improvements (numbers from 1 to 8) and those with the lowest improvements (letters from A to H), and Table 8 shows their improvement scores. In the case of these outliers, the adjustment seems to be more beneficial to compositional compounds than to idiomatic cases. This is confirmed by a linear regression of the movement of the 8 +8 outliers as a function of the compositionality scores hcHM, where we obtain a positive coefficient of r = 0.73 and r = 0.72 for PPMI–thresh and w2v–sg, respectively. There are more outlier compounds for Portuguese and French (particularly the former), suggesting that pcmaxsim has a stronger impact on those languages than on English. Moreover, some compounds had a similar improvement under both DSMs, with, for example, high improvement for PTcaixa forte literally box strong ‘safe’ and low improvement for PTcoração partido ‘broken heart’. In addition, pcmaxsim also affected some equivalent compounds in different languages, as in the case of PTcaixa forte and FRcoffre fort. Overall, pcmaxsim does not present a considerable impact on the predictions, obtaining an average improvement of = +0.41 across all compounds in ALL-comp.
improvmaxsim for PPMI–thresh . | |||
---|---|---|---|
ID . | improv . | hcHM . | Compound ‘translation’ (gloss) . |
1 | +90 | 2.82 | FRpremier plan ‘foreground’ (lit. first plan) |
2 | +88 | 2.90 | FRmatiére premiére ‘raw material’ (lit. matter primary) |
3 | +86 | 2.89 | PTamigo oculto ‘secret Santa’ (lit. friend hidden) |
4 | +67 | 1.92 | FRpremiére dame ‘first lady’ (lit. first lady) |
5 | +63 | 3.19 | PTcaixa forte ‘safe, vault’ (lit. box strong) |
6 | +58 | 3.14 | PTprato feito ‘blue-plate special’ (lit. plate ready-made) |
7 | +53 | 2.90 | FRidée reçue ‘popular belief’ (lit. idea received) |
8 | +48 | 3.00 | FRmarée noire ‘oil spill’ (lit. tide black) |
H | −42 | 1.52 | PTalta costura ‘haute couture’ (lit. high sewing) |
G | −44 | 2.84 | ENhalf sister |
F | −44 | 0.54 | ENmelting pot |
E | −46 | 1.29 | FRberger allemand ‘German shepherd’ (lit. shepherd German) |
D | −52 | 2.87 | PTmar aberto ‘open sea’ (lit. sea open) |
C | −55 | 1.43 | PTfebre amarela ‘yellow fever’ (lit. fever yellow) |
B | −81 | 0.79 | PTlivro aberto ‘open book’ (lit. book open) |
A | −83 | 1.06 | PTcoração partido ‘broken heart’ (lit. heart broken) |
improvmaxsim for w2v–sg . | |||
ID . | improv . | hcHM . | Compound ‘translation’ (gloss) . |
1 | +138 | 3.58 | PTcerca viva ‘hedge’ (lit. fence living) |
2 | +126 | 3.67 | FRcoffre fort ‘safe, vault’ (lit. chest/box strong) |
3 | +116 | 3.19 | PTcaixa forte ‘safe, vault’ (lit. chest/box strong) |
4 | +107 | 2.03 | PTgolpe baixo ‘low blow’ (lit. punch low) |
5 | +100 | 3.97 | PTprimeira necessidade ‘first necessity’ (lit. first necessity) |
6 | +95 | 4.11 | ENrole model |
7 | +79 | 4.47 | FRbonne pratique ‘good practice’ (lit. good practice) |
8 | +69 | 3.64 | PTcarta aberta ‘open letter’ (lit. letter open) |
H | −68 | 0.40 | FRbras droit ‘most important helper/assistant’ (lit. arm right) |
G | −70 | 1.52 | PTalta costura ‘haute couture’ (lit. high sewing) |
F | −71 | 3.66 | PTcarne vermelha ‘red meat’ (lit. meat red) |
E | −82 | 1.35 | PTalto mar ‘high seas’ (lit. high sea) |
D | −85 | 1.10 | PTmesa redonda ‘round table’ (lit. table round) |
C | −86 | 2.84 | ENhalf sister |
B | −109 | 1.43 | PTfebre amarela ‘yellow fever’ (lit. fever yellow) |
A | −128 | 1.06 | PTcoração partido ‘broken heart’ (lit. heart broken) |
improvmaxsim for PPMI–thresh . | |||
---|---|---|---|
ID . | improv . | hcHM . | Compound ‘translation’ (gloss) . |
1 | +90 | 2.82 | FRpremier plan ‘foreground’ (lit. first plan) |
2 | +88 | 2.90 | FRmatiére premiére ‘raw material’ (lit. matter primary) |
3 | +86 | 2.89 | PTamigo oculto ‘secret Santa’ (lit. friend hidden) |
4 | +67 | 1.92 | FRpremiére dame ‘first lady’ (lit. first lady) |
5 | +63 | 3.19 | PTcaixa forte ‘safe, vault’ (lit. box strong) |
6 | +58 | 3.14 | PTprato feito ‘blue-plate special’ (lit. plate ready-made) |
7 | +53 | 2.90 | FRidée reçue ‘popular belief’ (lit. idea received) |
8 | +48 | 3.00 | FRmarée noire ‘oil spill’ (lit. tide black) |
H | −42 | 1.52 | PTalta costura ‘haute couture’ (lit. high sewing) |
G | −44 | 2.84 | ENhalf sister |
F | −44 | 0.54 | ENmelting pot |
E | −46 | 1.29 | FRberger allemand ‘German shepherd’ (lit. shepherd German) |
D | −52 | 2.87 | PTmar aberto ‘open sea’ (lit. sea open) |
C | −55 | 1.43 | PTfebre amarela ‘yellow fever’ (lit. fever yellow) |
B | −81 | 0.79 | PTlivro aberto ‘open book’ (lit. book open) |
A | −83 | 1.06 | PTcoração partido ‘broken heart’ (lit. heart broken) |
improvmaxsim for w2v–sg . | |||
ID . | improv . | hcHM . | Compound ‘translation’ (gloss) . |
1 | +138 | 3.58 | PTcerca viva ‘hedge’ (lit. fence living) |
2 | +126 | 3.67 | FRcoffre fort ‘safe, vault’ (lit. chest/box strong) |
3 | +116 | 3.19 | PTcaixa forte ‘safe, vault’ (lit. chest/box strong) |
4 | +107 | 2.03 | PTgolpe baixo ‘low blow’ (lit. punch low) |
5 | +100 | 3.97 | PTprimeira necessidade ‘first necessity’ (lit. first necessity) |
6 | +95 | 4.11 | ENrole model |
7 | +79 | 4.47 | FRbonne pratique ‘good practice’ (lit. good practice) |
8 | +69 | 3.64 | PTcarta aberta ‘open letter’ (lit. letter open) |
H | −68 | 0.40 | FRbras droit ‘most important helper/assistant’ (lit. arm right) |
G | −70 | 1.52 | PTalta costura ‘haute couture’ (lit. high sewing) |
F | −71 | 3.66 | PTcarne vermelha ‘red meat’ (lit. meat red) |
E | −82 | 1.35 | PTalto mar ‘high seas’ (lit. high sea) |
D | −85 | 1.10 | PTmesa redonda ‘round table’ (lit. table round) |
C | −86 | 2.84 | ENhalf sister |
B | −109 | 1.43 | PTfebre amarela ‘yellow fever’ (lit. fever yellow) |
A | −128 | 1.06 | PTcoração partido ‘broken heart’ (lit. heart broken) |
Figure 10 shows the same analysis for f = geom, showing the improvement score of pcgeom over pcuniform. We hypothesized that pcgeom should more accurately represent idiomatic compounds. From the previous sections, we know that pcgeom has lower performance than pcuniform when used to estimate the compositionality of the entire data sets (cf. Table 6). This is confirmed by an average score of = −7.87. As in Figure 9, Figure 10 shows a random distribution of improvements. However, the outliers have the opposite pattern, indicating that large reclassifications due to pcgeom tend to favor idiomatic instead of compositional compounds. The linear regression of the movement of the outliers as a function of the compositionality scores results in r = −0.73 and r = −0.82 for PPMI–thresh and w2v–sg, respectively. These confirm our hypothesis for the behavior of pcgeom.
Table 9 lists the outlier compounds indicated in Figure 10 along with their improvement values. Here again, the majority of the outliers belong to PT-comp. Some of the compounds that were found as outliers in pcmaxsim re-appear as outliers for pcgeom with inverted polarity in the improvement score, such as the ranks predicted by PPMI–thresh for PTprato feito literally plate made ‘blue-plate special’ (improvmaxsim = +58, improvgeom = −234) and by w2v–sg for FRbras droit literally arm right ‘assistant’ (improvmaxsim = −68, improvgeom = +228). This suggests that, as future work, we should consider combining both approaches into a single prediction that decides which score to use for each compound as a function of pcuniform.
improvgeom for PPMI–thresh . | |||
---|---|---|---|
ID . | improv . | hcHM . | Compound ‘translation’ (gloss) . |
1 | +157 | 1.31 | ENsnail mail |
2 | +110 | 3.43 | FRguerre civile ‘civil war’ (lit. war civil) |
3 | +109 | 2.83 | FRdisque dur ‘hard drive’ (lit. disk hard) |
4 | +104 | 1.35 | PTalto mar ‘high seas’ (lit. high sea) |
5 | +93 | 2.63 | PTônibus executivo ‘minibus’ (lit. bus executive) |
6 | +85 | 3.32 | ENsearch engine |
7 | +82 | 2.62 | PTcarro forte ‘armored car’ (lit. car strong) |
8 | +79 | 1.18 | ENnoble gas |
H | −190 | 2.44 | PTar condicionado ‘air conditioning’ (lit. air conditioned) |
G | −202 | 3.67 | FRcoffre fort ‘safe, vault’ (lit. chest/box strong) |
F | −202 | 3.57 | FRbon sens ‘common sense’ (lit. good sense) |
E | −234 | 3.14 | PTprato feito ‘blue-plate special’ (lit. plate ready-made) |
D | −292 | 3.64 | FRbaie vitrée ‘open glass window’ (lit. opening glassy) |
C | −327 | 3.64 | PTcarta aberta ‘open letter’ (lit. letter open) |
B | −370 | 4.08 | PTvinho tinto ‘red wine’ (lit. wine dark-red) |
A | −376 | 1.69 | PTcircuito integrado ‘short circuit’ (lit. short circuit) |
improvgeom for w2v–sg . | |||
ID . | improv . | hcHM . | Compound ‘translation’ (gloss) . |
1 | +228 | 0.40 | FRbras droit ‘most important helper/assistant’ (lit. arm right) |
2 | +158 | 1.40 | PTlua nova ‘new moon’ (lit. moon new) |
3 | +127 | 1.35 | PTalto mar ‘high seas’ (lit. high sea) |
4 | +104 | 0.10 | PTpé direito ‘ceiling height’ (lit. foot right) |
5 | +89 | 1.24 | ENcarpet bombing |
6 | +75 | 1.60 | PTlista negra ‘black list’ (lit. list black) |
7 | +73 | 0.65 | PTarma branca ‘cold weapon’ (lit. weapon white) |
8 | +72 | 3.32 | ENsearch engine |
H | −151 | 2.76 | PTdisco rígido ‘hard drive’ (lit. disk rigid) |
G | −169 | 4.63 | ENsubway system |
F | −190 | 2.62 | PTcarro forte ‘armored car’ (lit. car strong) |
E | −238 | 2.83 | FRdisque dur ‘hard drive’ (lit. disk hard) |
D | −256 | 2.84 | ENhalf sister |
C | −260 | 3.64 | PTcarta aberta ‘open letter’ (lit. letter open) |
B | −266 | 4.47 | FRbonne pratique ‘good practice’ (lit. good practice) |
A | −370 | 4.25 | ENend user |
improvgeom for PPMI–thresh . | |||
---|---|---|---|
ID . | improv . | hcHM . | Compound ‘translation’ (gloss) . |
1 | +157 | 1.31 | ENsnail mail |
2 | +110 | 3.43 | FRguerre civile ‘civil war’ (lit. war civil) |
3 | +109 | 2.83 | FRdisque dur ‘hard drive’ (lit. disk hard) |
4 | +104 | 1.35 | PTalto mar ‘high seas’ (lit. high sea) |
5 | +93 | 2.63 | PTônibus executivo ‘minibus’ (lit. bus executive) |
6 | +85 | 3.32 | ENsearch engine |
7 | +82 | 2.62 | PTcarro forte ‘armored car’ (lit. car strong) |
8 | +79 | 1.18 | ENnoble gas |
H | −190 | 2.44 | PTar condicionado ‘air conditioning’ (lit. air conditioned) |
G | −202 | 3.67 | FRcoffre fort ‘safe, vault’ (lit. chest/box strong) |
F | −202 | 3.57 | FRbon sens ‘common sense’ (lit. good sense) |
E | −234 | 3.14 | PTprato feito ‘blue-plate special’ (lit. plate ready-made) |
D | −292 | 3.64 | FRbaie vitrée ‘open glass window’ (lit. opening glassy) |
C | −327 | 3.64 | PTcarta aberta ‘open letter’ (lit. letter open) |
B | −370 | 4.08 | PTvinho tinto ‘red wine’ (lit. wine dark-red) |
A | −376 | 1.69 | PTcircuito integrado ‘short circuit’ (lit. short circuit) |
improvgeom for w2v–sg . | |||
ID . | improv . | hcHM . | Compound ‘translation’ (gloss) . |
1 | +228 | 0.40 | FRbras droit ‘most important helper/assistant’ (lit. arm right) |
2 | +158 | 1.40 | PTlua nova ‘new moon’ (lit. moon new) |
3 | +127 | 1.35 | PTalto mar ‘high seas’ (lit. high sea) |
4 | +104 | 0.10 | PTpé direito ‘ceiling height’ (lit. foot right) |
5 | +89 | 1.24 | ENcarpet bombing |
6 | +75 | 1.60 | PTlista negra ‘black list’ (lit. list black) |
7 | +73 | 0.65 | PTarma branca ‘cold weapon’ (lit. weapon white) |
8 | +72 | 3.32 | ENsearch engine |
H | −151 | 2.76 | PTdisco rígido ‘hard drive’ (lit. disk rigid) |
G | −169 | 4.63 | ENsubway system |
F | −190 | 2.62 | PTcarro forte ‘armored car’ (lit. car strong) |
E | −238 | 2.83 | FRdisque dur ‘hard drive’ (lit. disk hard) |
D | −256 | 2.84 | ENhalf sister |
C | −260 | 3.64 | PTcarta aberta ‘open letter’ (lit. letter open) |
B | −266 | 4.47 | FRbonne pratique ‘good practice’ (lit. good practice) |
A | −370 | 4.25 | ENend user |
9. Characterization of the Predicted Compositionality
In the previous sections, we examined the performance of the compositionality prediction framework in terms of the correlation between automatic predictions and human judgments across languages. We now investigate the relation between predicted scores and other variables that may have an impact on results, such as familiarity (Section 9.1) and conventionalization (Section 9.2). We also compare the predicted compositionality scores with trends previously found in human scores (Section 9.3). The experiments focus on the ALL-comp data set, which groups the predicted scores from the best configurations on EN-comp, FR-comp, and PT-comp (cf. Table 4).
9.1 Predicted Compositionality and Familiarity
Results from Section 3.2.2 show that the familiarity of compounds measured as frequency in large corpora is associated with the compositionality scores assigned by humans. We would like to know whether this correlation also holds true to system predictions: Are the most frequent compounds being predicted as more compositional? As expected, the rank correlation between frequency and pcuniform shows medium to strong correlation (see Table 10, column ρ[pc, freq]), though the level of correlation is somewhat DSM-dependent, are in line with the correlation observed between frequency and human scores, and with the high correlation between predicted and human scores.
DSM . | ρ(pc, freq) . | ρ(diff, freq) . | ρ(pc, PMI) . | ρ(diff, PMI) . |
---|---|---|---|---|
PPMI–SVD | 0.36 | 0.17 | 0.28 | * |
PPMI–thresh | 0.46 | 0.22 | 0.13 | * |
glove | 0.68 | −0.19 | 0.26 | * |
lexvec | 0.54 | * | 0.26 | −0.12 |
PPMI–TopK | 0.28 | 0.15 | * | * |
w2v–cbow | 0.51 | * | 0.17 | * |
w2v–sg | 0.50 | * | 0.17 | * |
DSM . | ρ(pc, freq) . | ρ(diff, freq) . | ρ(pc, PMI) . | ρ(diff, PMI) . |
---|---|---|---|---|
PPMI–SVD | 0.36 | 0.17 | 0.28 | * |
PPMI–thresh | 0.46 | 0.22 | 0.13 | * |
glove | 0.68 | −0.19 | 0.26 | * |
lexvec | 0.54 | * | 0.26 | −0.12 |
PPMI–TopK | 0.28 | 0.15 | * | * |
w2v–cbow | 0.51 | * | 0.17 | * |
w2v–sg | 0.50 | * | 0.17 | * |
Another hypothesis we test is whether frequent compounds are easier to model. A first intuition would be that this hypothesis is true, as a higher number of occurrences is associated with a larger amount of data, from which more representative vectors can be built. To test this hypothesis, we define a compound’s difficulty as the difference between the predicted score and the normalized human score, diff = |pc − (hcHM/5)|, where high values indicate a compound whose compositionality is harder to predict.43
We found a weak (though statistically significant) correlation between frequency and difficulty for some of the DSMs (Table 10, column ρ[diff, freq]). They are mostly positive, indicating that frequency is correlated with difficulty, which is a surprising result, as it implies that the compositionality of rarer compounds was mildly easier to predict for these systems, disproving the hypothesis above. These results either point to an overall lack of correlation between frequency and difficulty, or indicate mild DSM-specific behavior, which should be investigated in further research.
9.2 Predicted Compositionality and Conventionalization
PMI is not only a well-known estimator of the level of conventionalization of a multiword expression (Church and Hanks 1990; Evert 2004; Farahmand, Smith, and Nivre 2015), but it is also used in some DSMs as a way to estimate the strength of association between target and context words. To assess if what our models are implicitly measuring is the association between the component words of a compound rather than compositionality, we now examine the correlation between compositionality scores and PMI.
We found only a weak but statistically significant correlation between predicted compositionality and PMI (Table 10, column ρ[pc, PMI]), which suggests that these DSMs preserve some information regarding conventionalization. However, given that no significant correlation between PMI and human compositionality scores was found (Section 3.2.2) and as DSM predictions are strongly correlated to human predictions, these results indicate that our models capture more than conventionalization. They may also be a feature of this particular set of compounds, as even the compositional cases are also conventional to some extent (e.g., white/?yellow wine). Therefore, further investigation of possible links between idiomaticity and conventionalization is needed.
We also calculated the correlation between PMI and the human–prediction difference (diff), to determine if DSMs build less precise vectors for less conventionalized compounds (approximated as those with lower PMI). However, no statistically significant correlation was found for most DSMs (Table 10, column ρ[diff, PMI]).
9.3 Range-Based Analysis of Predicted Compositionality
Spearman correlation assesses the performance of a given configuration by providing a single numerical value. This facilitates the comparison between configurations, but it hides the internal distribution of predictions. By splitting the data sets into ranges, we obtain a more fine-grained view of possible patterns linked to compositionality prediction.
To determine if the compounds that humans agree more on are also more accurately predicted, we divided ALL-comp into three equally sized subsets, according to the standard deviation among human annotators (low, mid-range, and high values of standard deviation, σHM). As high standard deviation indicates disagreement among annotators, it may be an indicator of the difficulty of the annotation. Table 11 presents the best DSMs, according to Spearman’s ρ evaluated separately on each of the subsets. Indeed, for the compounds that had low σHM, the Spearman values were the highest (between 0.73 and 0.75), while for those with high σHM, the Spearman correlation with human judgments was the lowest (between 0.35 and 0.43). These results confirm that higher scores are achieved for the compounds for which humans agree more, and suggest that part of the difficulty of this task for automatic systems is also related to difficulties for humans.
DSM . | full data set . | Ranges of σHM . | Ranges of hcHM . | ||||
---|---|---|---|---|---|---|---|
low . | mid . | high . | low . | mid . | high . | ||
PPMI–thresh | 0.66 | 0.75 | 0.58 | 0.40 | 0.29 | 0.24 | 0.37 |
glove | 0.63 | 0.73 | 0.54 | 0.35 | 0.27 | 0.26 | 0.35 |
lexvec | 0.64 | 0.73 | 0.54 | 0.36 | 0.18 | 0.20 | 0.37 |
w2v–sg | 0.66 | 0.73 | 0.58 | 0.43 | 0.16 | 0.24 | 0.32 |
DSM . | full data set . | Ranges of σHM . | Ranges of hcHM . | ||||
---|---|---|---|---|---|---|---|
low . | mid . | high . | low . | mid . | high . | ||
PPMI–thresh | 0.66 | 0.75 | 0.58 | 0.40 | 0.29 | 0.24 | 0.37 |
glove | 0.63 | 0.73 | 0.54 | 0.35 | 0.27 | 0.26 | 0.35 |
lexvec | 0.64 | 0.73 | 0.54 | 0.36 | 0.18 | 0.20 | 0.37 |
w2v–sg | 0.66 | 0.73 | 0.58 | 0.43 | 0.16 | 0.24 | 0.32 |
To determine if compositional compounds would be more precisely predicted than idiomatic compounds, we divide ALL-comp into three equally sized subsets based on the level of human compositionality scores (low, mid-range, and high values of hcHM). Table 11 presents the correlation obtained on each subset for the best configuration of each DSM. The more idiomatic compounds have the lowest Spearman values (from 0.16 to 0.29) while the more compositional have the highest ones (from 0.32 to 0.37). These results confirm that the predictions are better for compositional than for idiomatic compounds. Moreover, these scores are much lower than those from the full data set (from 0.63 to 0.66), suggesting that it may be harder to make fine-grained distinctions (e.g., between two compositional compounds like access road and subway system) than to make inter-range distinctions (e.g., between idiomatic and compositional compounds like ivory tower and access road). However, further investigation would be needed to verify this hypothesis.
10. Conclusions
We proposed a framework for compositionality prediction of multiword expressions, focusing on nominal compounds and using DSMs for meaning representation. We investigated how accurately DSMs capture idiomaticity compared to human judgments and examined the impact of several variables in the accuracy of the predictions. In order to determine how language dependent the results are, we evaluated the compositionality prediction framework in English, French, and Portuguese, using data sets containing human-rated compositionality scores, some of which were specifically constructed as part of this work.44 Using these data sets, we presented a large-scale evaluation involving 228 DSMs for each language, and we evaluated more than 9,000 framework configurations to determine the impact of possible factors that may influence compositionality prediction.
Our experiments confirmed that our framework is able to capture idiomaticity accurately, obtaining a strong correlation with human judgments for all three languages. Comparing the performance of different DSMs, the particular choice of DSM had a noticeable impact on the results, with differences over 0.10 Spearman ρ points for all languages. For the comparable data sets (EN-comp, FR-comp, and PT-comp), the best models were w2v and PPMI–thresh.45 Results differed according to language: although for English w2v were the best models, for French and Portuguese, PPMI–thresh outperformed the other models. Moreover, the results for the three languages varied considerably, with those for English outperforming by 0.10 and 0.20 Spearman ρ points those for French and Portuguese, respectively. The latter are morphologically richer than the former, and a closer examination of the type of preprocessing adopted for best results reveals that both languages benefit from less sparse representations resulting from lemmatization and stopword removal, while for English no preprocessing was particularly beneficial.
Although corpus size is often assumed to play a fundamental role in the quality of DSMs, so that the bigger the corpus the better the results, prediction quality stabilized at around one billion tokens for all languages. This may reflect the point where the minimum frequency was reached for producing reliable representations for all compounds in these data sets, even the rare cases, and larger corpora did not lead to better predictions. Moreover, for the best models in each language, DSMs with more dimensions resulted in more accurate predictions confirming our hypothesis. We also found a trend for small window sizes leading to better results for the best models in all three languages, contrary to our hypothesis. A typically good configuration used vectors of 750 dimensions built from minimal context windows of one word to each side of the target.
DSMs were also robust regarding the choice of compositionality prediction function, with a uniform combination of the head and modifier producing the best results for all languages. Other functions like pcmaxsim and pcgeom, which modify these scores to account for different contributions of each component, produced at best similar results.
A deeper analysis of the predicted compositionality scores revealed that, similarly to human-rated scores, familiarity measured as frequency was positively correlated with predicted compositionality. In the case of conventionalization measured as PMI, no correlation was found with human-rated scores and only a mild correlation was found with some predicted scores, suggesting that our models capture more than compound conventionalization, as they have a strong agreement with human scores. Intra-compound standard deviation on human scores was also found to be related to predicted scores, indicating that DSMs have difficulties on those compounds that humans also found difficult. Moreover, predictions were found to be more accurate for compositional compounds.
Although there are many questions that still need to be solved regarding compositionality, we believe that the results presented here advance significantly its understanding and computational modeling. Furthermore, the proposed framework opens important avenues of research that are ready to be pursued. First, the role of morphological inflection could be clarified by extending this investigation to even more inflected languages, such as Turkish. Moreover, other categories of MWEs such as verb +noun expressions should be evaluated to determine the interplay between compositionality prediction and syntactic flexibility of MWEs. The ultimate test would be to use predicted compositionality scores in downstream applications and tasks involving some degree of semantic processing, ranging from MWE identification to parsing, and word-sense disambiguation. In particular, it would be interesting to predict compositionality in context, in order to distinguish idiomatic from literal usages in sentences.
Appendix A. Glossary
- composition function
is a function that takes as input a sequence of vectors v(wi) to v(wj) and outputs a compositionally constructed vector v⊕(wi … wj) representing the compositional meaning of the sequence, where ⊕ indicates the function used to compose the vectors. Example: . 1, 23
- compositionality prediction configuration
is the combination of a particular DSM configuration with a given compositionality prediction function, fully specifying how a predicted compositionality score is calculated for a given word sequence wi … wj. 1
- compositionality prediction framework
is the set of all possible compositionality prediction configurations available. 1, 22
- compositionality prediction function
is a function that takes as input corpus-based vectors for a sequence of words v(wi … wj) and for the individual words composing that sequence v(wi) … v(wj), and outputs a predicted compositionality score, usually proportional to the similarity between the corpus-based vector v(wi … wj) and a compositionally constructed vector v(wi) to v(wj) derived from v(wi) … v(wj) using a composition function. Example: maxsim. 1
- compositionally constructed vector
is the output of a composition function, that is, a vector v⊕(wi … wj) derived from the individual words’ corpus-derived vectors v(wi) to v(wj). 1, 23
- corpus-derived vector
is the output of a DSM for a given element wi of the vocabulary V, that is, a corpus-derived D-dimensional real-numbered vector v(wi) that represents the meaning of wi. A corpus-derived vector of a word sequence v(wi … wj) is built by treating it as a single token in the corpus. 1, 22
- distributional semantic model (DSM)
is a function that takes as input a vocabulary V and a (large) corpus, and outputs a corpus-derived vector v(wi) for each element wi of V based on the distributional profile of wi’s occurrences in the corpus. The vocabulary V can be automatically derived from the input corpus. Example: w2v–cbow. 1
- DSM configuration
is a set of DSM parameters and their values, fully specifying how corpus-derived vectors are built from a given corpus. Example: w2v–cbow using lemmaPoS.W8.d750. 1, 28
- DSM parameter
is a variable in a DSM whose value influences the way corpus-derived vectors are built from a corpus. Example: WordForm. 1, 28
- human compositionality score (hc)
is a real value representing the compositionality assigned by human annotators to a word sequence wi … wj. The correlation between predicted compositionality (pc) and human compositionality (hc) scores is used to evaluate a compositionality prediction configuration. When subscripted, indicates the question used to obtain the score. Example: hcH. 1, 17, 29
- predicted compositionality score (pc)
is the output of a compositionality prediction function, that is, a real value representing the predicted compositionality of a word sequence wi … wj. The correlation between predicted compositionality (pc) and human compositionality (hc) scores is used to evaluate a compositionality prediction configuration. When subscripted, indicates the compositionality prediction function used to obtain the score. Example: pcuniform. 1, 29
Appendix B. Sanity Checks
The number of possible DSM configurations grows exponentially with the number of internal variables in a DSM, forestalling the possibility of an exhaustive search for every possible parameter. We have evaluated in this article the set of variables that are most often manually tuned in the literature, but a reasonable question would be whether these results can be further improved through the modification of some other often-ignored model-specific parameters. We thus perform some sanity checks through a local search of such parameters around the highest-Spearman configuration of each DSM.
B.1 Number of Iterations
Some of the DSMs in consideration on this paper are iterative: they re-read and reprocess the same corpus multiple times. For those DSMs, we present the results of running their best configuration, but using a higher number of iterations. This higher number of iterations is inspired by the models found in parts of the literature, where, for example, the number of glove iterations can be as high as 50 (Salle, Villavicencio, and Idiart 2016) or even 100 (Pennington, Socher, and Manning 2014). The intuition is that most models will lose some information (due to their probabilistic sampling), which could be regained at the cost of a higher number of iterations.
Table 12 presents a comparison between the baseline ρ for 15 iterations and the ρ obtained when 100 iterations are performed. For all DSMs, we see that the increase in the number of iterations does not improve the quality of the vectors, with the relatively small number of 15 iterations yielding better results. This may suggest that a small number of iterations can already sample enough distributional information, with further iterations accruing additional noise from low-frequency words. The extra number of iterations could also be responsible for overfitting of the DSM to represent particularities of the corpus, which would reduce the quality of the underlying vectors. Given the extra cost of running more iterations,46 we refrained from building further models with as many iterations in the rest of the article.
Model (FR-comp) . | ρbase . | ρiter=100 . | Difference (%) . |
---|---|---|---|
w2v–cbow | .660 | .640 | (−2.0) |
w2v–sg | .672 | .636 | (−3.7) |
glove | .680 | .677 | (−0.3) |
lexvec | .677 | .671 | (−0.6) |
Model (Reddy) . | ρbase . | ρiter=100 . | Difference (%) . |
w2v–cbow | .809 | .766 | (−4.3) |
w2v–sg | .821 | .777 | (−4.4) |
glove | .764 | .746 | (−1.8) |
lexvec | .774 | .757 | (−1.7) |
Model (PT-comp) . | ρbase . | ρiter=100 . | Difference (%) . |
w2v–cbow | .588 | .558 | (−3.0) |
w2v–sg | .586 | .551 | (−3.6) |
glove | .555 | .464 | (−9.1) |
lexvec | .570 | .561 | (−0.9) |
Model (FR-comp) . | ρbase . | ρiter=100 . | Difference (%) . |
---|---|---|---|
w2v–cbow | .660 | .640 | (−2.0) |
w2v–sg | .672 | .636 | (−3.7) |
glove | .680 | .677 | (−0.3) |
lexvec | .677 | .671 | (−0.6) |
Model (Reddy) . | ρbase . | ρiter=100 . | Difference (%) . |
w2v–cbow | .809 | .766 | (−4.3) |
w2v–sg | .821 | .777 | (−4.4) |
glove | .764 | .746 | (−1.8) |
lexvec | .774 | .757 | (−1.7) |
Model (PT-comp) . | ρbase . | ρiter=100 . | Difference (%) . |
w2v–cbow | .588 | .558 | (−3.0) |
w2v–sg | .586 | .551 | (−3.6) |
glove | .555 | .464 | (−9.1) |
lexvec | .570 | .561 | (−0.9) |
B.2 Minimum Count Threshold
Minimum-count thresholds are often neglected in the literature, where a default configuration of 0, 1, or 5 being presumably used by most authors. An exception to this trend is the threshold of 100 occurrences used by Levy, Goldberg, and Dagan (2015), whose toolkit we use in PPMI–SVD. No explicit justification has been found for this higher word-count threshold. A reasonable hypothesis would be that higher thresholds improve the quality of the data, as it filters rare words more aggressively.
Table 13 presents the result from the highest-Spearman configurations along with the results for an identical configuration with a higher occurrence threshold of 50.47 The results unanimously agree that a higher threshold does not contribute to the removal of any extra noise. In particular, for PPMI–SVD, it seems to discard enough useful information to considerably reduce the quality of the compositionality prediction measure. The results strongly contradict the default configuration used for PPMI–SVD, suggesting that a lower word-count threshold might yield better results for this task.
Model (FR-comp) . | ρbase . | ρmincount=50 . | Difference (%) . |
---|---|---|---|
w2v–cbow | .660 | .610 | (−5.0) |
w2v–sg | .672 | .613 | (−5.9) |
glove | .680 | .673 | (−0.7) |
PPMI–SVD | .584 | .258 | (−32.6) |
lexvec | .677 | .653 | (−2.4) |
Model (Reddy) . | ρbase . | ρmincount=50 . | Difference (%) . |
w2v–cbow | .809 | .778 | (−3.1) |
w2v–sg | .821 | .776 | (−4.5) |
glove | .764 | .672 | (−9.2) |
PPMI–SVD | .743 | .515 | (−22.8) |
lexvec | .774 | .738 | (−3.6) |
Model (PT-comp) . | ρbase . | ρmincount=50 . | Difference (%) . |
w2v–cbow | .588 | .580 | (−0.8) |
w2v–sg | .586 | .575 | (−1.1) |
glove | .555 | .540 | (−1.5) |
PPMI–SVD | .530 | .418 | (−11.1) |
lexvec | .570 | .566 | (−0.4) |
Model (FR-comp) . | ρbase . | ρmincount=50 . | Difference (%) . |
---|---|---|---|
w2v–cbow | .660 | .610 | (−5.0) |
w2v–sg | .672 | .613 | (−5.9) |
glove | .680 | .673 | (−0.7) |
PPMI–SVD | .584 | .258 | (−32.6) |
lexvec | .677 | .653 | (−2.4) |
Model (Reddy) . | ρbase . | ρmincount=50 . | Difference (%) . |
w2v–cbow | .809 | .778 | (−3.1) |
w2v–sg | .821 | .776 | (−4.5) |
glove | .764 | .672 | (−9.2) |
PPMI–SVD | .743 | .515 | (−22.8) |
lexvec | .774 | .738 | (−3.6) |
Model (PT-comp) . | ρbase . | ρmincount=50 . | Difference (%) . |
w2v–cbow | .588 | .580 | (−0.8) |
w2v–sg | .586 | .575 | (−1.1) |
glove | .555 | .540 | (−1.5) |
PPMI–SVD | .530 | .418 | (−11.1) |
lexvec | .570 | .566 | (−0.4) |
B.3 Windows of Size 2 + 2
For many models, the best window size found was either WindowSize = 1 + 1 or WindowSize = 4 + 4 (see Section 7.1). It is possible that a higher score could be obtained by a configuration in between. While a full exhaustive search would be the ideal solution, an initial approximation of the best 2 + 2 configuration could be obtained by running the experiments on the highest Spearman configurations, with the window size replaced by 2 + 2.
Results shown in Table 14 for a window size of 2 + 2 are consistently worse than the base model, indicating that the optimal configuration is likely the one that was obtained with window size of 1 + 1 or 4 + 4. This is further confirmed by the fact that most DSMs had the best configuration with window size of 1 + 1 or 8 + 8, with few cases of 4 + 4 as best model, which suggests that the quality of most configurations in the space of models is either monotonically increasing or decreasing with regards to these window sizes, favoring thus the configurations with more extreme WindowSize parameters.
Model (FR-comp) . | ρbase . | ρwin=2+2 . | Difference (%) . |
---|---|---|---|
PPMI–SVD | .584 | .397 | (−18.7) |
PPMI–thresh | .702 | .678 | (−2.4) |
glove | .680 | .657 | (−2.3) |
lexvec | .677 | .671 | (−0.6) |
w2v–cbow | .660 | .644 | (−1.6) |
w2v–sg | .672 | .639 | (−3.3) |
Model (Reddy) . | ρbase . | ρwin=2+2 . | Difference (%) . |
PPMI–SVD | .743 | .583 | (−16.0) |
lexvec | .774 | .757 | (−1.7) |
w2v–cbow | .809 | .777 | (−3.2) |
w2v–sg | .821 | .784 | (−3.7) |
Model (PT-comp) . | ρbase . | ρwin=2+2 . | Difference (%) . |
PPMI–SVD | .530 | .446 | (−8.4) |
PPMI–thresh | .602 | .561 | (−4.1) |
lexvec | .570 | .564 | (−0.6) |
Model (FR-comp) . | ρbase . | ρwin=2+2 . | Difference (%) . |
---|---|---|---|
PPMI–SVD | .584 | .397 | (−18.7) |
PPMI–thresh | .702 | .678 | (−2.4) |
glove | .680 | .657 | (−2.3) |
lexvec | .677 | .671 | (−0.6) |
w2v–cbow | .660 | .644 | (−1.6) |
w2v–sg | .672 | .639 | (−3.3) |
Model (Reddy) . | ρbase . | ρwin=2+2 . | Difference (%) . |
PPMI–SVD | .743 | .583 | (−16.0) |
lexvec | .774 | .757 | (−1.7) |
w2v–cbow | .809 | .777 | (−3.2) |
w2v–sg | .821 | .784 | (−3.7) |
Model (PT-comp) . | ρbase . | ρwin=2+2 . | Difference (%) . |
PPMI–SVD | .530 | .446 | (−8.4) |
PPMI–thresh | .602 | .561 | (−4.1) |
lexvec | .570 | .564 | (−0.6) |
B.4 Higher Number of Dimensions
As seen in Section 7.2, some DSMs obtain better results when moving from 250 to 500 dimensions, and this trend continues when moving to 750 dimensions. This behavior is notably stronger for PPMI–thresh, which suggests that an even higher number of dimensions could have better predictive power.
Table 15 presents the result of running PPMI–thresh for increasing values of of the Dimension parameter. The baseline configuration (indicated as ★ in Table 15) was the highest-scoring configuration found in Section 7.2: lemmaPoS.W1.d750 for PT-comp and FR-comp, and surface.W8.d750 for Reddy. As seen in Section 7.2, results for 250 and 500 dimensions have lower scores than the results for 750 dimensions. Results for 1,000 dimensions were mixed: they are slightly worse for FR-comp and EN-comp, and slightly better for PT-comp. Increasing the number of dimensions generates models that are progressively worse. These results suggest that the maximum vector quality is achieved between 750 and 1,000 dimensions.
Model (FR-comp) . | ρdim=X . | Difference (%) . |
---|---|---|
dim = 250 | .671 | (−3.1) |
dim = 500 | .695 | (−0.7) |
dim = 750 | .702★ | (0.0) |
dim = 1,000 | .694 | (−0.8) |
dim = 2,000 | .645 | (−5.8) |
dim = 5,000 | .636 | (−6.7) |
dim = 30,000 | .552 | (−15.1) |
dim = 999,999 | .539 | (−16.3) |
Model (Reddy) . | ρdim=X . | Difference (%) . |
dim = 250 | .764 | (−2.7) |
dim = 500 | .782 | (−1.0) |
dim = 750 | .791★ | (0.0) |
dim = 1,000 | .784 | (−0.7) |
dim = 2,000 | .760 | (−3.1) |
dim = 5,000 | .744 | (−4.7) |
dim = 30,000 | .700 | (−9.1) |
dim = 999,999 | .566 | (−22.5) |
Model (PT-comp) . | ρdim=X . | Difference (%) . |
dim = 250 | .543 | (−5.9) |
dim = 500 | .546 | (−5.6) |
dim = 750 | .602★ | (0.0) |
dim = 1,000 | .609 | (+0.7) |
dim = 2,000 | .601 | (−0.1) |
dim = 5,000 | .505 | (−9.7) |
dim = 30,000 | .532 | (−7.0) |
dim = 999,999 | .500 | (−10.2) |
Model (FR-comp) . | ρdim=X . | Difference (%) . |
---|---|---|
dim = 250 | .671 | (−3.1) |
dim = 500 | .695 | (−0.7) |
dim = 750 | .702★ | (0.0) |
dim = 1,000 | .694 | (−0.8) |
dim = 2,000 | .645 | (−5.8) |
dim = 5,000 | .636 | (−6.7) |
dim = 30,000 | .552 | (−15.1) |
dim = 999,999 | .539 | (−16.3) |
Model (Reddy) . | ρdim=X . | Difference (%) . |
dim = 250 | .764 | (−2.7) |
dim = 500 | .782 | (−1.0) |
dim = 750 | .791★ | (0.0) |
dim = 1,000 | .784 | (−0.7) |
dim = 2,000 | .760 | (−3.1) |
dim = 5,000 | .744 | (−4.7) |
dim = 30,000 | .700 | (−9.1) |
dim = 999,999 | .566 | (−22.5) |
Model (PT-comp) . | ρdim=X . | Difference (%) . |
dim = 250 | .543 | (−5.9) |
dim = 500 | .546 | (−5.6) |
dim = 750 | .602★ | (0.0) |
dim = 1,000 | .609 | (+0.7) |
dim = 2,000 | .601 | (−0.1) |
dim = 5,000 | .505 | (−9.7) |
dim = 30,000 | .532 | (−7.0) |
dim = 999,999 | .500 | (−10.2) |
B.5 Random Initialization
The word vectors generated by the glove and w2v models have some level of non-determinism caused by random initialization and random sampling techniques. A reasonable concern would be whether the results presented for different parameter variations are close enough to the scores obtained by an average model. To assess the variability of these models, we evaluated three different runs of every DSM configuration (the original execution ρ1, used elsewhere in this article, along with two other executions ρ2 and ρ3) for glove, w2v–cbow, and w2v–sg. We then calculate the average ρavg of these three executions for every model.
Table 16 reports the highest-Spearman configurations of ρavg for the Reddy and EN-comp data sets. When comparing ρavg to the results of the original execution ρ1, we see that the variability in the different executions of the same configuration is minimal. This is further confirmed by the low sample standard deviation48 obtained from the scores of the three executions. Given the high stability of these models, results in the rest of the article were calculated and reported as ρ1 for all data sets.
Data set . | DSM . | configuration . | ρ1 . | ρ2 . | ρ3 . | ρavg . | stddev . |
---|---|---|---|---|---|---|---|
Reddy | glove | lemmaPoS.W8.d250 | .759 | .760 | .753 | .757 | .004 |
w2v–cbow | surface.W1.d500 | .796 | .807 | .799 | .801 | .006 | |
w2v–sg | surface.W1.d750 | .812 | .788 | .812 | .804 | .014 | |
EN-comp | glove | lemmaPoS.W8.d500 | .651 | .646 | .650 | .649 | .003 |
w2v–cbow | surface+.W1.d750 | .730 | .732 | .728 | .730 | .002 | |
w2v–sg | surface+.W1.d750 | .741 | .732 | .721 | .731 | .010 |
Data set . | DSM . | configuration . | ρ1 . | ρ2 . | ρ3 . | ρavg . | stddev . |
---|---|---|---|---|---|---|---|
Reddy | glove | lemmaPoS.W8.d250 | .759 | .760 | .753 | .757 | .004 |
w2v–cbow | surface.W1.d500 | .796 | .807 | .799 | .801 | .006 | |
w2v–sg | surface.W1.d750 | .812 | .788 | .812 | .804 | .014 | |
EN-comp | glove | lemmaPoS.W8.d500 | .651 | .646 | .650 | .649 | .003 |
w2v–cbow | surface+.W1.d750 | .730 | .732 | .728 | .730 | .002 | |
w2v–sg | surface+.W1.d750 | .741 | .732 | .721 | .731 | .010 |
B.6 Data Filtering
Along with the verification of parameters, we also evaluate whether data set variations could yield better results. In particular, we consider the use of filtering techniques, which are used in the literature as a method of guaranteeing data set quality. As per Roller, Schulte im Walde, and Scheible (2013), we consider two strategies of data removal: (1) removing individual outlier compositionality judgments through z-score filtering; and (2) removing all annotations from outlier human judges. A compositionality judgment is considered an outlier if it stands at more than z standard deviations away from the mean; a human judge is deemed an outlier if its Spearman correlation to the average of the others ρoth is lower than a given threshold R.49 These methods allow us to remove accidentally erroneous annotations, as well as annotators whose response deviated too much from the mean (in particular, spammers and non-native speakers).
Table 17 presents the evaluation of raw and filtered data sets regarding two quality measures: The average of the standard deviations for all NCs (); and the proportion of NCs in the data set whose standard deviation is higher than 1.5 (Pσ>1.5), as per Reddy, McCarthy, and Manandhar (2011). The results suggest that filtering techniques can improve the overall quality of the data sets, as seen in the reduction of the proportion of NCs with high standard deviation, as well as in the reduction of the average standard deviation itself. We additionally present the data retention rate (DRR), which is the proportion of NCs that remained in the data set after filtering. While the DRR does indicate a reduction in the amount of data, this reduction may be considered acceptable in light of the improvement suggested by the quality measures.
Data set . | . | Pσ>1.5 . | DRR . | ||
---|---|---|---|---|---|
raw . | filtered . | raw . | filtered . | ||
FR-comp | 1.15 | 0.94 | 22.78% | 13.89% | 87.34% |
PT-comp | 1.22 | 1.00 | 14.44% | 6.11% | 87.81% |
EN-comp90 | 1.17 | 0.87 | 18.89% | 3.33% | 83.61% |
Reddy | 0.99 | — | 5.56% | — | — |
Data set . | . | Pσ>1.5 . | DRR . | ||
---|---|---|---|---|---|
raw . | filtered . | raw . | filtered . | ||
FR-comp | 1.15 | 0.94 | 22.78% | 13.89% | 87.34% |
PT-comp | 1.22 | 1.00 | 14.44% | 6.11% | 87.81% |
EN-comp90 | 1.17 | 0.87 | 18.89% | 3.33% | 83.61% |
Reddy | 0.99 | — | 5.56% | — | — |
On a more detailed analysis, we have verified that the improvement in these quality measures is heavily tied to the use of z-score filtering, with similar results obtained when it is considered alone. The application of R-filtering by itself, on the other hand, did not show any noticeable improvement in the quality measures for reasonable amounts of DRR. This is the opposite from what was found by Roller, Schulte im Walde, and Scheible (2013) on their German data set, where only R-filtering was found to improve results under these quality measures. We present our findings in more detail in Ramisch, Cordeiro, and Villavicencio (2016).
We then consider whether filtering can have an impact on the performance of predicted compositionality scores. For each of the 228 model configurations that were constructed for each language, we launched an evaluation on the filtered EN-comp90, FR-comp, and PT-comp data sets (using z-score filtering only, as it was responsible for most of the improvement in quality measures). Overall, no improvement was observed in the results of the prediction (values of Spearman ρ) when we compare raw and filtered data sets. Looking more specifically at the best configurations for each DSM (see Table 18), we can see that most results do not significantly change when the evaluation is performed on the raw or filtered data sets. This suggests that the amount of judgments collected for each compound greatly offsets any irregularity caused by outliers, making the use of filtering techniques superfluous.
Data set . | EN-comp90 . | FR-comp . | PT-comp . | |||
---|---|---|---|---|---|---|
raw . | filtered . | raw . | filtered . | raw . | filtered . | |
PPMI –SVD | .604 | .601 | .584 | .579 | .530 | .526 |
PPMI–TopK | .564 | .571 | .550 | .545 | .519 | .516 |
PPMI–thresh | .602 | .607 | .702 | .700 | .602 | .601 |
glove | .538 | .544 | .680 | .676 | .555 | .552 |
lexvec | .567 | .572 | .677 | .676 | .570 | .568 |
w2v–cbow | .669 | .665 | .651 | .651 | .588 | .587 |
w2v–sg | .665 | .661 | .653 | .654 | .586 | .584 |
Data set . | EN-comp90 . | FR-comp . | PT-comp . | |||
---|---|---|---|---|---|---|
raw . | filtered . | raw . | filtered . | raw . | filtered . | |
PPMI –SVD | .604 | .601 | .584 | .579 | .530 | .526 |
PPMI–TopK | .564 | .571 | .550 | .545 | .519 | .516 |
PPMI–thresh | .602 | .607 | .702 | .700 | .602 | .601 |
glove | .538 | .544 | .680 | .676 | .555 | .552 |
lexvec | .567 | .572 | .677 | .676 | .570 | .568 |
w2v–cbow | .669 | .665 | .651 | .651 | .588 | .587 |
w2v–sg | .665 | .661 | .653 | .654 | .586 | .584 |
Appendix C. Questionnaire
The questionnaire was structured in five subtasks, presented to the annotators through these instructions:
- 1.
Read the compound itself.
- 2.
Read 3 sentences containing the compound.
- 3.
Provide 2 to 3 synonym expressions for the target compound seen in the sentences, preferably involving one of the words in the compound. We ask annotators to prioritize short expressions, with 1 to 3 words each, and to try to include the MWE components in their reply (eliciting a paraphrase).
- 4.
Using a Likert scale from 0 to 5, judge how much of the meaning of the compound comes from the modifier and the head separately. Figure 11 shows an example for the judgment of the head.
- 5.
Using a Likert scale from 0 to 5, judge how much of the meaning of the compound comes from its components.
We require answers in an even-numbered scale (there are 6 possibilities between 0 and 5), as otherwise the participants could be biased toward the middle score. In order to help participants visualize the meaning of their reply, whenever their mouse hovers over a particular score, we present a guiding tooltip, as can be seen in Figure 11.
The order of subtasks has also been taken into account. During a pilot test, we found that presenting the multiple-choice questions (subtasks 4–5) before asking for synonyms (subtask 3) yielded lower agreement, as users were often less self-consistent in the multiple-choice questions (e.g., replying “non-compositional” for subtask 4 but “compositional” for subtask 5), even if they carefully selected their synonyms in response to subtask 3.
The request for synonyms before the multiple-choice questions prompts the participants to focus on the meaning of the compound. These synonyms can then also be taken into account when considering the semantic contribution of each element of the compound—we leave this for future work.
Appendix D. List of English Compounds
We present below the 90 nominal compounds in EN-comp90 and the 100 nominal compounds in EN-compExt, along with their human-rated compositionality scores. We refer to Reddy, McCarthy, and Manandhar (2011) for the other 90 compounds belonging to Reddy which, together with the former two sets, represent 280 nominal compounds in total.
D.1 Compounds in EN-comp90
Compounds . | hcHM . |
---|---|
ancient history | 1.95 |
armchair critic | 1.33 |
baby buggy | 3.94 |
bad hat | 0.62 |
benign tumour | 4.69 |
big fish | 0.85 |
birth rate | 4.60 |
black cherry | 3.11 |
bow tie | 4.25 |
brain teaser | 2.65 |
busy bee | 0.88 |
carpet bombing | 1.24 |
cellular phone | 3.78 |
close call | 1.59 |
closed book | 0.68 |
computer program | 4.50 |
con artist | 2.10 |
cooking stove | 4.68 |
cotton candy | 1.79 |
critical review | 4.06 |
dead end | 1.32 |
dirty money | 2.21 |
dirty word | 2.48 |
disc jockey | 1.25 |
divine service | 3.11 |
dry land | 3.95 |
dry wall | 3.33 |
dust storm | 3.85 |
eager beaver | 0.36 |
economic aid | 4.33 |
elbow grease | 0.56 |
elbow room | 0.61 |
entrance hall | 4.17 |
eternal rest | 3.25 |
fish story | 1.68 |
flower child | 0.50 |
food market | 3.82 |
foot soldier | 1.95 |
front man | 1.64 |
goose egg | 0.48 |
grey matter | 2.39 |
guinea pig | 0.45 |
half sister | 2.84 |
half wit | 1.16 |
health check | 4.17 |
high life | 1.67 |
inner circle | 1.56 |
inner product | 3.00 |
insane asylum | 3.95 |
insurance company | 5.00 |
insurance policy | 4.15 |
iron collar | 3.88 |
labour union | 4.76 |
life belt | 2.84 |
life vest | 3.44 |
lime tree | 4.61 |
loan shark | 1.00 |
loose woman | 2.53 |
mail service | 4.69 |
market place | 3.00 |
mental disorder | 4.89 |
middle school | 3.84 |
milk tooth | 1.43 |
mother tongue | 0.59 |
narrow escape | 1.75 |
net income | 2.94 |
news agency | 4.39 |
noble gas | 1.18 |
nut case | 0.44 |
old flame | 0.58 |
old hat | 0.35 |
old timer | 0.89 |
phone book | 4.25 |
pillow slip | 3.70 |
pocket book | 1.42 |
prison guard | 4.89 |
prison term | 4.79 |
private eye | 0.82 |
record book | 3.70 |
research lab | 4.75 |
sex bomb | 0.53 |
silver lining | 0.35 |
sound judgement | 3.39 |
sparkling water | 3.14 |
street girl | 3.16 |
subway system | 4.63 |
tennis elbow | 2.50 |
top dog | 1.05 |
wet blanket | 0.21 |
word painting | 1.62 |
Compounds . | hcHM . |
---|---|
ancient history | 1.95 |
armchair critic | 1.33 |
baby buggy | 3.94 |
bad hat | 0.62 |
benign tumour | 4.69 |
big fish | 0.85 |
birth rate | 4.60 |
black cherry | 3.11 |
bow tie | 4.25 |
brain teaser | 2.65 |
busy bee | 0.88 |
carpet bombing | 1.24 |
cellular phone | 3.78 |
close call | 1.59 |
closed book | 0.68 |
computer program | 4.50 |
con artist | 2.10 |
cooking stove | 4.68 |
cotton candy | 1.79 |
critical review | 4.06 |
dead end | 1.32 |
dirty money | 2.21 |
dirty word | 2.48 |
disc jockey | 1.25 |
divine service | 3.11 |
dry land | 3.95 |
dry wall | 3.33 |
dust storm | 3.85 |
eager beaver | 0.36 |
economic aid | 4.33 |
elbow grease | 0.56 |
elbow room | 0.61 |
entrance hall | 4.17 |
eternal rest | 3.25 |
fish story | 1.68 |
flower child | 0.50 |
food market | 3.82 |
foot soldier | 1.95 |
front man | 1.64 |
goose egg | 0.48 |
grey matter | 2.39 |
guinea pig | 0.45 |
half sister | 2.84 |
half wit | 1.16 |
health check | 4.17 |
high life | 1.67 |
inner circle | 1.56 |
inner product | 3.00 |
insane asylum | 3.95 |
insurance company | 5.00 |
insurance policy | 4.15 |
iron collar | 3.88 |
labour union | 4.76 |
life belt | 2.84 |
life vest | 3.44 |
lime tree | 4.61 |
loan shark | 1.00 |
loose woman | 2.53 |
mail service | 4.69 |
market place | 3.00 |
mental disorder | 4.89 |
middle school | 3.84 |
milk tooth | 1.43 |
mother tongue | 0.59 |
narrow escape | 1.75 |
net income | 2.94 |
news agency | 4.39 |
noble gas | 1.18 |
nut case | 0.44 |
old flame | 0.58 |
old hat | 0.35 |
old timer | 0.89 |
phone book | 4.25 |
pillow slip | 3.70 |
pocket book | 1.42 |
prison guard | 4.89 |
prison term | 4.79 |
private eye | 0.82 |
record book | 3.70 |
research lab | 4.75 |
sex bomb | 0.53 |
silver lining | 0.35 |
sound judgement | 3.39 |
sparkling water | 3.14 |
street girl | 3.16 |
subway system | 4.63 |
tennis elbow | 2.50 |
top dog | 1.05 |
wet blanket | 0.21 |
word painting | 1.62 |
D.2 Compounds in EN-compExt
Compounds . | hcHM . |
---|---|
academy award | 3.52 |
arcade game | 3.80 |
baby blues | 2.88 |
backroom boy | 1.48 |
bad apple | 1.13 |
banana republic | 0.86 |
bankruptcy proceeding | 4.78 |
basket case | 0.42 |
beauty sleep | 2.96 |
best man | 3.12 |
big cheese | 0.36 |
big picture | 1.48 |
big wig | 0.60 |
biological clock | 2.42 |
black box | 1.29 |
black operation | 1.39 |
blind alley | 1.14 |
blood bath | 1.38 |
blue blood | 0.58 |
blue print | 1.04 |
box office | 0.88 |
brain drain | 2.08 |
bull market | 1.23 |
cable car | 2.68 |
calendar month | 4.23 |
civil marriage | 3.13 |
cocoa butter | 3.23 |
computer expert | 4.46 |
contact lenses | 3.64 |
copy cat | 0.74 |
crime rate | 4.39 |
damp squib | 0.95 |
dark horse | 0.65 |
day shift | 4.54 |
disability insurance | 4.45 |
double cross | 1.14 |
double dutch | 0.29 |
double whammy | 2.48 |
dream ticket | 1.32 |
dutch courage | 1.00 |
fair play | 2.59 |
fairy tale | 1.68 |
fall guy | 1.36 |
field work | 2.10 |
football season | 4.04 |
fresh water | 4.20 |
freudian slip | 2.35 |
ghost town | 1.50 |
glass ceiling | 0.81 |
grass root | 0.86 |
hard drive | 2.17 |
hard shoulder | 1.52 |
head hunter | 1.50 |
health care | 4.47 |
heavy cross | 1.17 |
hen party | 1.05 |
home run | 2.86 |
honey trap | 1.22 |
hot potato | 0.56 |
incubation period | 3.92 |
information age | 3.40 |
injury time | 3.20 |
insider trading | 3.88 |
jet lag | 2.67 |
job fair | 3.50 |
leap year | 2.38 |
love song | 4.58 |
low profile | 2.10 |
marketing consultant | 4.00 |
medical procedure | 4.83 |
music festival | 4.58 |
music journalist | 4.54 |
noise complaint | 4.52 |
pain killer | 2.17 |
peace conference | 4.46 |
peace talk | 4.13 |
pipe dream | 0.91 |
poison pill | 0.96 |
radioactive material | 4.61 |
radioactive waste | 4.58 |
rainy season | 4.23 |
rice paper | 4.00 |
shelf life | 1.30 |
skin tone | 3.88 |
smoke screen | 1.11 |
social insurance | 2.83 |
speed trap | 3.71 |
stag night | 1.44 |
sugar daddy | 0.44 |
tear gas | 3.27 |
time difference | 4.41 |
traffic control | 3.69 |
traffic jam | 3.62 |
travel guide | 4.38 |
wedding anniversary | 4.86 |
wedding day | 4.94 |
white noise | 1.17 |
white spirit | 1.31 |
winter solstice | 4.55 |
world conference | 3.96 |
Compounds . | hcHM . |
---|---|
academy award | 3.52 |
arcade game | 3.80 |
baby blues | 2.88 |
backroom boy | 1.48 |
bad apple | 1.13 |
banana republic | 0.86 |
bankruptcy proceeding | 4.78 |
basket case | 0.42 |
beauty sleep | 2.96 |
best man | 3.12 |
big cheese | 0.36 |
big picture | 1.48 |
big wig | 0.60 |
biological clock | 2.42 |
black box | 1.29 |
black operation | 1.39 |
blind alley | 1.14 |
blood bath | 1.38 |
blue blood | 0.58 |
blue print | 1.04 |
box office | 0.88 |
brain drain | 2.08 |
bull market | 1.23 |
cable car | 2.68 |
calendar month | 4.23 |
civil marriage | 3.13 |
cocoa butter | 3.23 |
computer expert | 4.46 |
contact lenses | 3.64 |
copy cat | 0.74 |
crime rate | 4.39 |
damp squib | 0.95 |
dark horse | 0.65 |
day shift | 4.54 |
disability insurance | 4.45 |
double cross | 1.14 |
double dutch | 0.29 |
double whammy | 2.48 |
dream ticket | 1.32 |
dutch courage | 1.00 |
fair play | 2.59 |
fairy tale | 1.68 |
fall guy | 1.36 |
field work | 2.10 |
football season | 4.04 |
fresh water | 4.20 |
freudian slip | 2.35 |
ghost town | 1.50 |
glass ceiling | 0.81 |
grass root | 0.86 |
hard drive | 2.17 |
hard shoulder | 1.52 |
head hunter | 1.50 |
health care | 4.47 |
heavy cross | 1.17 |
hen party | 1.05 |
home run | 2.86 |
honey trap | 1.22 |
hot potato | 0.56 |
incubation period | 3.92 |
information age | 3.40 |
injury time | 3.20 |
insider trading | 3.88 |
jet lag | 2.67 |
job fair | 3.50 |
leap year | 2.38 |
love song | 4.58 |
low profile | 2.10 |
marketing consultant | 4.00 |
medical procedure | 4.83 |
music festival | 4.58 |
music journalist | 4.54 |
noise complaint | 4.52 |
pain killer | 2.17 |
peace conference | 4.46 |
peace talk | 4.13 |
pipe dream | 0.91 |
poison pill | 0.96 |
radioactive material | 4.61 |
radioactive waste | 4.58 |
rainy season | 4.23 |
rice paper | 4.00 |
shelf life | 1.30 |
skin tone | 3.88 |
smoke screen | 1.11 |
social insurance | 2.83 |
speed trap | 3.71 |
stag night | 1.44 |
sugar daddy | 0.44 |
tear gas | 3.27 |
time difference | 4.41 |
traffic control | 3.69 |
traffic jam | 3.62 |
travel guide | 4.38 |
wedding anniversary | 4.86 |
wedding day | 4.94 |
white noise | 1.17 |
white spirit | 1.31 |
winter solstice | 4.55 |
world conference | 3.96 |
Appendix E. List of French Compounds
We present below the 180 nominal compounds in FR-comp, along with their human-rated compositionality scores.
Compounds . | hcHM . | Translation (Gloss) . |
---|---|---|
activité physique | 4.93 | ‘physical activity’ (lit. activity physical) |
année scolaire | 3.60 | ‘school year’ (lit. year scholar) |
art contemporain | 4.60 | ‘contemporary art’ (lit. art contemporary) |
baie vitrée | 3.64 | ‘open glass window’ (lit. opening glassy) |
bas côté | 1.31 | ‘aisle/roadside’ (lit. low side) |
beau frère | 0.67 | ‘brother-in-law’ (lit. beautiful brother) |
beau père | 1.18 | ‘father-in-law’ (lit. beautiful father) |
belle mère | 0.80 | ‘mother-in-law’ (lit. beautiful mother) |
berger allemand | 1.29 | ‘German shepherd’ (lit. shepherd German) |
bon sens | 3.57 | ‘common sense’ (lit. good sense) |
bon vent | 0.87 | ‘good luck’ (lit. good/fair wind) |
bon vivant | 2.57 | ‘bon vivant’ (lit. good living) |
bonne humeur | 4.53 | ‘good mood’ (lit. good mood) |
bonne poire | 0.42 | ‘sucker, soft touch’ (lit. good pear) |
bonne pratique | 4.47 | ‘good practice’ (lit. good practice) |
bouc émissaire | 0.23 | ‘scapegoat’ (lit. goat emissary) |
bras cassé | 0.57 | ‘lame duck’ (lit. arm broken) |
bras droit | 0.40 | ‘most important helper/assistant’ (lit. arm right) |
brebis galeuse | 0.55 | ‘black sheep’ (lit. sheep scabby) |
carte blanche | 0.20 | ‘carte blanche’ (lit. card white) |
carte bleue | 1.94 | ‘bank card’ (lit. card blue) |
carte grise | 3.08 | ‘vehicle registration’ (lit. card grey) |
carte vitale | 1.70 | ‘healthcare card’ (lit. card vital) |
carton plein | 0.78 | ‘clean sweep’ (lit. cardboard full) |
casque bleu | 1.85 | ‘UN peacekeeper’ (lit. helmet blue) |
centre commercial | 3.93 | ‘shopping center’ (lit. center commercial) |
cercle vicieux | 2.15 | ‘vicious circle’ (lit. circle vicious) |
cerf volant | 0.64 | ‘kite’ (lit. deer flying) |
chambre froide | 4.27 | ‘cold chamber’ (lit. chamber cold) |
changement climatique | 4.79 | ‘climate change’ (lit. change climatic) |
chapeau bas | 0.64 | ‘bravo’ (lit. hat low) |
charge sociale | 3.00 | ‘social security contribution’ (lit. charge social) |
chauve souris | 0.33 | ‘bat’ (lit. bald mouse) |
chute libre | 3.64 | ‘free fall’ (lit. fall free) |
club privé | 4.58 | ‘private club (sexual connotation)’ (lit. club private) |
coffre fort | 3.67 | ‘safe, vault’ (lit. chest/box strong) |
communauté urbaine | 4.57 | ‘urban community’ (lit. community urban) |
conseil municipal | 4.00 | ‘city council’ (lit. council municipal) |
coup dur | 2.40 | ‘problem, difficulty’ (lit. blow hard) |
coup franc | 1.71 | ‘free kick (soccer)’ (lit. blow free/frank) |
courrier électronique | 4.57 | ‘e-mail’ (lit. mail electronic) |
court circuit | 1.69 | ‘short circuit’ (lit. short circuit) |
court métrage | 2.36 | ‘short film’ (lit. short length) |
crème fraîche | 3.73 | ‘French sour cream’ (lit. cream fresh) |
crème glacée | 4.75 | ‘ice cream’ (lit. cream icy) |
dernier cri | 0.67 | ‘something trendy’ (lit. last scream) |
dernier mot | 3.09 | ‘final say’ (lit. last word) |
directeur général | 3.87 | ‘chief executive officer’ (lit. director general) |
disque dur | 2.83 | ‘hard drive’ (lit. disk hard) |
douche froide | 1.18 | ‘damper/frustration’ (lit. shower cold) |
droit fondamental | 4.27 | ‘fundamental right’ (lit. right fundamental) |
développement économique | 4.46 | ‘economic development’ (lit. development economic) |
eau chaude | 5.00 | ‘hot water’ (lit. water hot) |
eau douce | 2.33 | ‘fresh water’ (lit. water soft/sweet) |
eau minérale | 4.00 | ‘mineral water’ (lit. water mineral) |
eau potable | 5.00 | ‘drinking water’ (lit. water potable) |
eau vive | 3.44 | ‘jellyfish’ (lit. water lively) |
eau forte | 0.90 | ‘etching’ (lit. water strong) |
eaux usées | 4.54 | ‘sewage’ (lit. waters used) |
effet spécial | 3.67 | ‘special effect’ (lit. effect special) |
expérience professionnelle | 4.86 | ‘professional experience’ (lit. experience professional) |
fait divers | 3.69 | ‘news story’ (lit. fact diverse) |
famille nombreuse | 4.90 | ‘large family’ (lit. family numerous) |
faux ami | 1.25 | ‘false friend’ (lit. false friend) |
faux cul | 0.31 | ‘hypocrite’ (lit. false arse) |
faux pas | 1.82 | ‘blunder’ (lit. false step) |
faux semblant | 3.57 | ‘false pretence’ (lit. false appearance) |
feu rouge | 2.60 | ‘red traffic light’ (lit. fire red) |
feu vert | 0.71 | ‘green light, permission’ (lit. fire green) |
fil conducteur | 1.25 | ‘underlying theme’ (lit. thread conductor) |
fleur bleue | 0.45 | ‘sentimental’ (lit. flower blue) |
foie gras | 4.54 | ‘foie gras’ (lit. liver fatty) |
fou rire | 2.33 | ‘giggle’ (lit. crazy laughter) |
grand air | 1.33 | ‘outdoors’ (lit. big air) |
grand jour | 1.07 | ‘broad daylight’ (lit. big day) |
grand saut | 2.17 | ‘move forward’ (lit. big leap) |
grand écran | 3.14 | ‘silver screen’ (lit. big screen) |
grande entreprise | 4.54 | ‘big company’ (lit. big company) |
grande surface | 3.14 | ‘department store’ (lit. big surface) |
grippe aviaire | 3.58 | ‘avian flu’ (lit. flu avian) |
gros mot | 1.40 | ‘swearword’ (lit. large/fat word) |
gros plan | 1.87 | ‘close-up’ (lit. large/fat plan) |
guerre civile | 3.43 | ‘civil war’ (lit. war civil) |
haut parleur | 1.83 | ‘loudspeaker’ (lit. loud/high speaker) |
haute mer | 2.54 | ‘high seas’ (lit. high sea) |
haute montagne | 4.13 | ‘high mountains’ (lit. high mountain) |
heure supplémentaire | 4.00 | ‘overtime hour’ (lit. hour extra) |
huile essentielle | 2.25 | ‘essential oil’ (lit. oil essential) |
idée reçue | 2.90 | ‘popular belief’ (lit. idea received) |
insertion professionnelle | 4.27 | ‘professional integration, employability’ (lit. insertion professional) |
intérêt général | 4.36 | ‘general interest’ (lit. interest general) |
jeune fille | 4.64 | ‘young girl, maiden’ (lit. young girl) |
journal officiel | 4.50 | ‘official gazette’ (lit. newspaper official) |
langue française | 4.85 | ‘French language’ (lit. language French) |
marée noire | 3.00 | ‘oil spill’ (lit. tide black) |
match nul | 2.46 | ‘draw, stalemate’ (lit. match null) |
matière grasse | 5.00 | ‘fat’ (lit. matter greasy) |
matière grise | 2.15 | ‘grey matter’ (lit. matter grey) |
matière première | 2.90 | ‘raw material’ (lit. matter primary) |
mauvaise foi | 2.38 | ‘bad faith’ (lit. bad faith) |
mauvaise langue | 2.21 | ‘gossiper’ (lit. bad tongue) |
montagnes russes | 1.08 | ‘roller coaster’ (lit. mountains Russian) |
monument historique | 4.79 | ‘historical monument’ (lit. monument historical) |
mort né | 3.23 | ‘stillborn’ (lit. dead born) |
nouveau monde | 2.73 | ‘New World, Americas’ (lit. new world) |
nuit blanche | 1.07 | ‘sleepless night’ (lit. night white) |
numéro vert | 1.50 | ‘toll-free number’ (lit. number green) |
ordure ménagère | 4.20 | ‘household waste’ (lit. garbage domestic) |
organisation syndicale | 4.90 | ‘trade union’ (lit. organisation of-trade-union) |
pages jaunes | 3.00 | ‘yellow pages’ (lit. pages yellow) |
parachute doré | 0.50 | ‘golden parachute’ (lit. parachute golden) |
parc naturel | 4.33 | ‘nature park’ (lit. park natural) |
parti politique | 4.88 | ‘political party’ (lit. party political) |
parti pris | 2.69 | ‘bias’ (lit. party taken) |
partie fine | 0.80 | ‘orgy’ (lit. party fine/delicate) |
petit ami | 0.86 | ‘boyfriend’ (lit. small friend) |
petit beurre | 1.64 | ‘butter biscuit’ (lit. small butter) |
petit déjeuner | 2.27 | ‘breakfast’ (lit. small lunch) |
petit joueur | 1.00 | ‘amateur’ (lit. small player) |
petit pois | 4.14 | ‘pea’ (lit. small pea) |
petit salé | 1.15 | ‘salted pork’ (lit. small salty) |
petit écran | 2.50 | ‘television’ (lit. small screen) |
petit enfant | 2.79 | ‘grandchild’ (lit. small child) |
petit four | 0.92 | ‘type of pastry’ (lit. small oven) |
petit nègre | 0.50 | ‘pidgin or ‘badly spoken’ French’ (lit. little black-person) |
petite annonce | 2.69 | ‘classified ad’ (lit. small announcement) |
petite nature | 0.47 | ‘sensitive/fragile person’ (lit. small nature) |
pied noir | 0.13 | ‘French expats from Algeria’ (lit. foot black) |
pièce montée | 2.47 | ‘tiered cake’ (lit. piece assembled) |
pleine lune | 3.54 | ‘full moon’ (lit.full moon) |
poids lourd | 2.08 | ‘truck’ (lit. weight heavy) |
point faible | 2.46 | ‘weak point’ (lit. point weak) |
point mort | 1.00 | ‘standstill’ (lit. point dead) |
pot pourri | 0.40 | ‘medley’ (lit. pot/jar rotten) |
poule mouillée | 0.00 | ‘coward’ (lit. chicken wet) |
poupée russe | 3.75 | ‘Russian nesting doll’ (lit. doll Russian) |
premier ministre | 3.67 | ‘prime minister’ (lit. first minister) |
premier plan | 2.82 | ‘foreground’ (lit. first plan) |
première dame | 1.92 | ‘first lady’ (lit. first lady) |
prince charmant | 2.00 | ‘prince charming’ (lit. prince charming) |
prévision météorologique | 4.70 | ‘weather forecast’ (lit. forecast meteorological) |
recherche scientifique | 4.92 | ‘scientific research’ (lit. research scientific) |
ressources humaines | 3.91 | ‘human resources’ (lit. resources human) |
rond point | 3.18 | ‘roundabout’ (lit. round point) |
roulette russe | 0.87 | ‘Russian roulette’ (lit. roulette Russian) |
réchauffement climatique | 4.40 | ‘global warming’ (lit. warming climatic) |
région parisienne | 4.43 | ‘Paris region’ (lit. region Parisian) |
réseau social | 4.09 | ‘social network’ (lit. network social) |
sang froid | 0.47 | ‘cold blood, self-control’ (lit. blood cold) |
second degré | 1.40 | ‘irony, tongue-in-cheek’ (lit. second degree) |
second rôle | 3.64 | ‘supporting role’ (lit. second role) |
septième ciel | 0.21 | ‘cloud nine’ (lit. seventh heaven) |
service public | 4.71 | ‘public service’ (lit. service public) |
site officiel | 4.85 | ‘official website’ (lit. website official) |
soirée privée | 4.53 | ‘private party’ (lit. party private) |
sucre roux | 4.31 | ‘brown sugar’ (lit. sugar ginger-colored) |
sécurité routière | 4.55 | ‘road safety’ (lit. safety of-road) |
sécurité sociale | 3.67 | ‘social security’ (lit. security social) |
table basse | 4.79 | ‘coffee table’ (lit. table low) |
table ronde | 1.46 | ‘round table, discussion’ (lit. table round) |
tapis rouge | 3.31 | ‘red carpet, luxurious welcoming’ (lit. carpet red) |
temps fort | 1.87 | ‘key moment, highlight’ (lit. time strong) |
temps mort | 2.07 | ‘wasted time, idleness’ (lit. time dead) |
temps partiel | 3.62 | ‘part-time (work)’ (lit. time partial) |
temps plein | 3.08 | ‘full-time (work)’ (lit. time full) |
temps réel | 3.00 | ‘real time’ (lit. time real) |
travaux publics | 4.09 | ‘public works’ (lit. works public) |
trou noir | 2.58 | ‘black hole’ (lit. hole black) |
trou normand | 0.78 | ‘palate cleanser’ (lit. hole Norman) |
téléphone arabe | 0.23 | ‘Chinese whispers’ (lit. telephone Arabic) |
téléphone portable | 5.00 | ‘cellphone’ (lit. telephone portable) |
valeur sûre | 3.64 | ‘safe bet’ (lit. value safe/sure) |
vie associative | 4.00 | ‘community life’ (lit. life associative) |
vie quotidienne | 4.31 | ‘everyday life’ (lit. life daily) |
vieille fille | 2.42 | ‘spinster’ (lit. old girl/maid) |
vin blanc | 3.80 | ‘white wine’ (lit. wine white) |
vin rouge | 4.69 | ‘red wine’ (lit. wine red) |
yeux rouges | 4.36 | ‘red eyes’ (lit. eyes red) |
école primaire | 3.92 | ‘primary school’ (lit. school primary) |
étoile filante | 3.20 | ‘shooting star’ (lit. star slipping) |
Compounds . | hcHM . | Translation (Gloss) . |
---|---|---|
activité physique | 4.93 | ‘physical activity’ (lit. activity physical) |
année scolaire | 3.60 | ‘school year’ (lit. year scholar) |
art contemporain | 4.60 | ‘contemporary art’ (lit. art contemporary) |
baie vitrée | 3.64 | ‘open glass window’ (lit. opening glassy) |
bas côté | 1.31 | ‘aisle/roadside’ (lit. low side) |
beau frère | 0.67 | ‘brother-in-law’ (lit. beautiful brother) |
beau père | 1.18 | ‘father-in-law’ (lit. beautiful father) |
belle mère | 0.80 | ‘mother-in-law’ (lit. beautiful mother) |
berger allemand | 1.29 | ‘German shepherd’ (lit. shepherd German) |
bon sens | 3.57 | ‘common sense’ (lit. good sense) |
bon vent | 0.87 | ‘good luck’ (lit. good/fair wind) |
bon vivant | 2.57 | ‘bon vivant’ (lit. good living) |
bonne humeur | 4.53 | ‘good mood’ (lit. good mood) |
bonne poire | 0.42 | ‘sucker, soft touch’ (lit. good pear) |
bonne pratique | 4.47 | ‘good practice’ (lit. good practice) |
bouc émissaire | 0.23 | ‘scapegoat’ (lit. goat emissary) |
bras cassé | 0.57 | ‘lame duck’ (lit. arm broken) |
bras droit | 0.40 | ‘most important helper/assistant’ (lit. arm right) |
brebis galeuse | 0.55 | ‘black sheep’ (lit. sheep scabby) |
carte blanche | 0.20 | ‘carte blanche’ (lit. card white) |
carte bleue | 1.94 | ‘bank card’ (lit. card blue) |
carte grise | 3.08 | ‘vehicle registration’ (lit. card grey) |
carte vitale | 1.70 | ‘healthcare card’ (lit. card vital) |
carton plein | 0.78 | ‘clean sweep’ (lit. cardboard full) |
casque bleu | 1.85 | ‘UN peacekeeper’ (lit. helmet blue) |
centre commercial | 3.93 | ‘shopping center’ (lit. center commercial) |
cercle vicieux | 2.15 | ‘vicious circle’ (lit. circle vicious) |
cerf volant | 0.64 | ‘kite’ (lit. deer flying) |
chambre froide | 4.27 | ‘cold chamber’ (lit. chamber cold) |
changement climatique | 4.79 | ‘climate change’ (lit. change climatic) |
chapeau bas | 0.64 | ‘bravo’ (lit. hat low) |
charge sociale | 3.00 | ‘social security contribution’ (lit. charge social) |
chauve souris | 0.33 | ‘bat’ (lit. bald mouse) |
chute libre | 3.64 | ‘free fall’ (lit. fall free) |
club privé | 4.58 | ‘private club (sexual connotation)’ (lit. club private) |
coffre fort | 3.67 | ‘safe, vault’ (lit. chest/box strong) |
communauté urbaine | 4.57 | ‘urban community’ (lit. community urban) |
conseil municipal | 4.00 | ‘city council’ (lit. council municipal) |
coup dur | 2.40 | ‘problem, difficulty’ (lit. blow hard) |
coup franc | 1.71 | ‘free kick (soccer)’ (lit. blow free/frank) |
courrier électronique | 4.57 | ‘e-mail’ (lit. mail electronic) |
court circuit | 1.69 | ‘short circuit’ (lit. short circuit) |
court métrage | 2.36 | ‘short film’ (lit. short length) |
crème fraîche | 3.73 | ‘French sour cream’ (lit. cream fresh) |
crème glacée | 4.75 | ‘ice cream’ (lit. cream icy) |
dernier cri | 0.67 | ‘something trendy’ (lit. last scream) |
dernier mot | 3.09 | ‘final say’ (lit. last word) |
directeur général | 3.87 | ‘chief executive officer’ (lit. director general) |
disque dur | 2.83 | ‘hard drive’ (lit. disk hard) |
douche froide | 1.18 | ‘damper/frustration’ (lit. shower cold) |
droit fondamental | 4.27 | ‘fundamental right’ (lit. right fundamental) |
développement économique | 4.46 | ‘economic development’ (lit. development economic) |
eau chaude | 5.00 | ‘hot water’ (lit. water hot) |
eau douce | 2.33 | ‘fresh water’ (lit. water soft/sweet) |
eau minérale | 4.00 | ‘mineral water’ (lit. water mineral) |
eau potable | 5.00 | ‘drinking water’ (lit. water potable) |
eau vive | 3.44 | ‘jellyfish’ (lit. water lively) |
eau forte | 0.90 | ‘etching’ (lit. water strong) |
eaux usées | 4.54 | ‘sewage’ (lit. waters used) |
effet spécial | 3.67 | ‘special effect’ (lit. effect special) |
expérience professionnelle | 4.86 | ‘professional experience’ (lit. experience professional) |
fait divers | 3.69 | ‘news story’ (lit. fact diverse) |
famille nombreuse | 4.90 | ‘large family’ (lit. family numerous) |
faux ami | 1.25 | ‘false friend’ (lit. false friend) |
faux cul | 0.31 | ‘hypocrite’ (lit. false arse) |
faux pas | 1.82 | ‘blunder’ (lit. false step) |
faux semblant | 3.57 | ‘false pretence’ (lit. false appearance) |
feu rouge | 2.60 | ‘red traffic light’ (lit. fire red) |
feu vert | 0.71 | ‘green light, permission’ (lit. fire green) |
fil conducteur | 1.25 | ‘underlying theme’ (lit. thread conductor) |
fleur bleue | 0.45 | ‘sentimental’ (lit. flower blue) |
foie gras | 4.54 | ‘foie gras’ (lit. liver fatty) |
fou rire | 2.33 | ‘giggle’ (lit. crazy laughter) |
grand air | 1.33 | ‘outdoors’ (lit. big air) |
grand jour | 1.07 | ‘broad daylight’ (lit. big day) |
grand saut | 2.17 | ‘move forward’ (lit. big leap) |
grand écran | 3.14 | ‘silver screen’ (lit. big screen) |
grande entreprise | 4.54 | ‘big company’ (lit. big company) |
grande surface | 3.14 | ‘department store’ (lit. big surface) |
grippe aviaire | 3.58 | ‘avian flu’ (lit. flu avian) |
gros mot | 1.40 | ‘swearword’ (lit. large/fat word) |
gros plan | 1.87 | ‘close-up’ (lit. large/fat plan) |
guerre civile | 3.43 | ‘civil war’ (lit. war civil) |
haut parleur | 1.83 | ‘loudspeaker’ (lit. loud/high speaker) |
haute mer | 2.54 | ‘high seas’ (lit. high sea) |
haute montagne | 4.13 | ‘high mountains’ (lit. high mountain) |
heure supplémentaire | 4.00 | ‘overtime hour’ (lit. hour extra) |
huile essentielle | 2.25 | ‘essential oil’ (lit. oil essential) |
idée reçue | 2.90 | ‘popular belief’ (lit. idea received) |
insertion professionnelle | 4.27 | ‘professional integration, employability’ (lit. insertion professional) |
intérêt général | 4.36 | ‘general interest’ (lit. interest general) |
jeune fille | 4.64 | ‘young girl, maiden’ (lit. young girl) |
journal officiel | 4.50 | ‘official gazette’ (lit. newspaper official) |
langue française | 4.85 | ‘French language’ (lit. language French) |
marée noire | 3.00 | ‘oil spill’ (lit. tide black) |
match nul | 2.46 | ‘draw, stalemate’ (lit. match null) |
matière grasse | 5.00 | ‘fat’ (lit. matter greasy) |
matière grise | 2.15 | ‘grey matter’ (lit. matter grey) |
matière première | 2.90 | ‘raw material’ (lit. matter primary) |
mauvaise foi | 2.38 | ‘bad faith’ (lit. bad faith) |
mauvaise langue | 2.21 | ‘gossiper’ (lit. bad tongue) |
montagnes russes | 1.08 | ‘roller coaster’ (lit. mountains Russian) |
monument historique | 4.79 | ‘historical monument’ (lit. monument historical) |
mort né | 3.23 | ‘stillborn’ (lit. dead born) |
nouveau monde | 2.73 | ‘New World, Americas’ (lit. new world) |
nuit blanche | 1.07 | ‘sleepless night’ (lit. night white) |
numéro vert | 1.50 | ‘toll-free number’ (lit. number green) |
ordure ménagère | 4.20 | ‘household waste’ (lit. garbage domestic) |
organisation syndicale | 4.90 | ‘trade union’ (lit. organisation of-trade-union) |
pages jaunes | 3.00 | ‘yellow pages’ (lit. pages yellow) |
parachute doré | 0.50 | ‘golden parachute’ (lit. parachute golden) |
parc naturel | 4.33 | ‘nature park’ (lit. park natural) |
parti politique | 4.88 | ‘political party’ (lit. party political) |
parti pris | 2.69 | ‘bias’ (lit. party taken) |
partie fine | 0.80 | ‘orgy’ (lit. party fine/delicate) |
petit ami | 0.86 | ‘boyfriend’ (lit. small friend) |
petit beurre | 1.64 | ‘butter biscuit’ (lit. small butter) |
petit déjeuner | 2.27 | ‘breakfast’ (lit. small lunch) |
petit joueur | 1.00 | ‘amateur’ (lit. small player) |
petit pois | 4.14 | ‘pea’ (lit. small pea) |
petit salé | 1.15 | ‘salted pork’ (lit. small salty) |
petit écran | 2.50 | ‘television’ (lit. small screen) |
petit enfant | 2.79 | ‘grandchild’ (lit. small child) |
petit four | 0.92 | ‘type of pastry’ (lit. small oven) |
petit nègre | 0.50 | ‘pidgin or ‘badly spoken’ French’ (lit. little black-person) |
petite annonce | 2.69 | ‘classified ad’ (lit. small announcement) |
petite nature | 0.47 | ‘sensitive/fragile person’ (lit. small nature) |
pied noir | 0.13 | ‘French expats from Algeria’ (lit. foot black) |
pièce montée | 2.47 | ‘tiered cake’ (lit. piece assembled) |
pleine lune | 3.54 | ‘full moon’ (lit.full moon) |
poids lourd | 2.08 | ‘truck’ (lit. weight heavy) |
point faible | 2.46 | ‘weak point’ (lit. point weak) |
point mort | 1.00 | ‘standstill’ (lit. point dead) |
pot pourri | 0.40 | ‘medley’ (lit. pot/jar rotten) |
poule mouillée | 0.00 | ‘coward’ (lit. chicken wet) |
poupée russe | 3.75 | ‘Russian nesting doll’ (lit. doll Russian) |
premier ministre | 3.67 | ‘prime minister’ (lit. first minister) |
premier plan | 2.82 | ‘foreground’ (lit. first plan) |
première dame | 1.92 | ‘first lady’ (lit. first lady) |
prince charmant | 2.00 | ‘prince charming’ (lit. prince charming) |
prévision météorologique | 4.70 | ‘weather forecast’ (lit. forecast meteorological) |
recherche scientifique | 4.92 | ‘scientific research’ (lit. research scientific) |
ressources humaines | 3.91 | ‘human resources’ (lit. resources human) |
rond point | 3.18 | ‘roundabout’ (lit. round point) |
roulette russe | 0.87 | ‘Russian roulette’ (lit. roulette Russian) |
réchauffement climatique | 4.40 | ‘global warming’ (lit. warming climatic) |
région parisienne | 4.43 | ‘Paris region’ (lit. region Parisian) |
réseau social | 4.09 | ‘social network’ (lit. network social) |
sang froid | 0.47 | ‘cold blood, self-control’ (lit. blood cold) |
second degré | 1.40 | ‘irony, tongue-in-cheek’ (lit. second degree) |
second rôle | 3.64 | ‘supporting role’ (lit. second role) |
septième ciel | 0.21 | ‘cloud nine’ (lit. seventh heaven) |
service public | 4.71 | ‘public service’ (lit. service public) |
site officiel | 4.85 | ‘official website’ (lit. website official) |
soirée privée | 4.53 | ‘private party’ (lit. party private) |
sucre roux | 4.31 | ‘brown sugar’ (lit. sugar ginger-colored) |
sécurité routière | 4.55 | ‘road safety’ (lit. safety of-road) |
sécurité sociale | 3.67 | ‘social security’ (lit. security social) |
table basse | 4.79 | ‘coffee table’ (lit. table low) |
table ronde | 1.46 | ‘round table, discussion’ (lit. table round) |
tapis rouge | 3.31 | ‘red carpet, luxurious welcoming’ (lit. carpet red) |
temps fort | 1.87 | ‘key moment, highlight’ (lit. time strong) |
temps mort | 2.07 | ‘wasted time, idleness’ (lit. time dead) |
temps partiel | 3.62 | ‘part-time (work)’ (lit. time partial) |
temps plein | 3.08 | ‘full-time (work)’ (lit. time full) |
temps réel | 3.00 | ‘real time’ (lit. time real) |
travaux publics | 4.09 | ‘public works’ (lit. works public) |
trou noir | 2.58 | ‘black hole’ (lit. hole black) |
trou normand | 0.78 | ‘palate cleanser’ (lit. hole Norman) |
téléphone arabe | 0.23 | ‘Chinese whispers’ (lit. telephone Arabic) |
téléphone portable | 5.00 | ‘cellphone’ (lit. telephone portable) |
valeur sûre | 3.64 | ‘safe bet’ (lit. value safe/sure) |
vie associative | 4.00 | ‘community life’ (lit. life associative) |
vie quotidienne | 4.31 | ‘everyday life’ (lit. life daily) |
vieille fille | 2.42 | ‘spinster’ (lit. old girl/maid) |
vin blanc | 3.80 | ‘white wine’ (lit. wine white) |
vin rouge | 4.69 | ‘red wine’ (lit. wine red) |
yeux rouges | 4.36 | ‘red eyes’ (lit. eyes red) |
école primaire | 3.92 | ‘primary school’ (lit. school primary) |
étoile filante | 3.20 | ‘shooting star’ (lit. star slipping) |
Appendix F. List of Portuguese Compounds
We present below the 180 nominal compounds in PT-comp, along with their human-rated compositionality scores.
Compounds . | hcHM . | Translation (Gloss) . |
---|---|---|
abalo sísmico | 4.42 | ‘earthquake’ (lit. shock seismic) |
acampamento militar | 4.82 | ‘military camp’ (lit. camp military) |
agente secreto | 4.58 | ‘secret agent’ (lit. agent secret) |
alarme falso | 3.24 | ‘false alarm’ (lit. alarm false) |
algodão doce | 1.28 | ‘cotton candy’ (lit. cotton sweet) |
alta temporada | 2.04 | ‘high season’ (lit. high season) |
alta costura | 1.52 | ‘haute couture’ (lit. high sewing) |
alto mar | 1.35 | ‘high seas’ (lit. high sea) |
alto falante | 0.88 | ‘loudspeaker’ (lit. loud/high speaker) |
amigo oculto | 2.89 | ‘secret Santa’ (lit. friend hidden) |
amigo secreto | 3.11 | ‘secret Santa’ (lit. friend secret) |
amor próprio | 3.91 | ‘self-esteem’ (lit. love own) |
ano novo | 4.29 | ‘new year’ (lit. year new) |
ar condicionado | 2.44 | ‘air conditioning’ (lit. air conditioned) |
ar livre | 1.95 | ‘open air’ (lit. air free) |
arma branca | 0.65 | ‘cold weapon’ (lit. weapon white) |
ato falho | 3.50 | ‘Freudian slip’ (lit. act faulty) |
banho turco | 2.19 | ‘Turkish bath’ (lit. bath Turkish) |
batata doce | 4.24 | ‘sweet potato’ (lit. potato sweet) |
bebida alcoólica | 5.00 | ‘alcoholic drink’ (lit. drink alcoholic) |
bode expiatório | 0.47 | ‘scapegoat’ (lit. goat expiatory) |
braço direito | 0.57 | ‘right arm’ (lit. arm right) |
buraco negro | 2.88 | ‘black hole’ (lit. hole black/dark) |
café colonial | 2.70 | ‘afternoon tea’ (lit. breakfast colonial) |
caixa forte | 3.19 | ‘safe, vault’ (lit. box strong) |
caixa preta | 0.94 | ‘black box’ (lit. box black) |
caixeiro viajante | 3.43 | ‘traveling salesman’ (lit. clerk traveling) |
carne branca | 2.85 | ‘white meat’ (lit. meat white) |
carne vermelha | 3.66 | ‘red meat’ (lit. meat red) |
carro forte | 2.62 | ‘armored car’ (lit. car strong) |
carta aberta | 3.64 | ‘open letter’ (lit. letter open) |
centro comercial | 3.68 | ‘shopping mall’ (lit. center commercial) |
centro espírita | 3.43 | ‘Spiritualist center’ (lit. center spiritualist) |
cerca viva | 3.58 | ‘hedge’ (lit. fence living) |
cheiro verde | 0.67 | ‘parsley’ (lit. smell green) |
circuito integrado | 4.52 | ‘integrated circuit’ (lit. circuit integrated) |
classe executiva | 2.67 | ‘business class’ (lit. class executive) |
coluna social | 2.45 | ‘gossip column’ (lit. column social) |
colégio militar | 4.88 | ‘military high-school’ (lit. high-school military) |
comida caseira | 4.11 | ‘homemade food’ (lit. food homemade) |
companhia aérea | 3.11 | ‘airline’ (lit. company aerial) |
conta corrente | 2.71 | ‘checking account’ (lit. account current) |
coração partido | 1.06 | ‘broken heart’ (lit. heart broken) |
corda bamba | 1.31 | ‘tightrope, bad situation’ (lit. rope wobbly) |
cordas vocais | 2.32 | ‘vocal chords’ (lit. chords vocal) |
curto circuito | 1.96 | ‘short circuit’ (lit. short circuit) |
câmara fria | 4.65 | ‘cold chamber’ (lit. chamber cold) |
céu aberto | 1.68 | ‘outdoors, open air’ (lit. sky open) |
círculo vicioso | 2.17 | ‘vicious circle’ (lit. circle vicious) |
círculo virtuoso | 2.39 | ‘virtuous circle’ (lit. circle virtuous) |
deputado federal | 4.92 | ‘federal deputy’ (lit. deputy federal) |
desfile militar | 4.93 | ‘military parade’ (lit. parade military) |
direitos humanos | 3.86 | ‘human rights’ (lit. rights human) |
disco rígido | 2.76 | ‘hard drive’ (lit. disk rigid) |
disco voador | 2.94 | ‘flying saucer’ (lit. disk flying) |
efeitos especiais | 3.37 | ‘special effects’ (lit. effects special) |
elefante branco | 0.16 | ‘white elephant’ (lit. elephant white) |
escada rolante | 3.85 | ‘escalator’ (lit. stair rolling) |
estrela cadente | 2.52 | ‘shooting star’ (lit. star falling) |
exame clínico | 4.75 | ‘clinical examination’ (lit. examination clinical) |
exames laboratoriais | 4.90 | ‘laboratory tests’ (lit. examinations laboratory) |
farinha integral | 4.72 | ‘wholemeal flour’ (lit. flour integral) |
febre amarela | 1.43 | ‘yellow fever’ (lit. fever yellow) |
ficha limpa | 2.97 | ‘clean criminal records’ (lit. file clean) |
fila indiana | 1.17 | ‘single file’ (lit. queue Indian) |
fio condutor | 1.58 | ‘underlying theme’ (lit. thread conductor) |
força bruta | 3.33 | ‘brute force’ (lit. force brute) |
gatos pingados | 0.00 | ‘a few people’ (lit. cats dropped) |
gelo seco | 2.33 | ‘dry ice’ (lit. ice dry) |
golpe baixo | 2.03 | ‘low blow’ (lit. punch low) |
governo federal | 4.97 | ‘federal government’ (lit. government federal) |
gripe aviária | 3.11 | ‘avian flu’ (lit. flu avian) |
gripe suína | 2.48 | ‘swine flu’ (lit. flu swine) |
guarda florestal | 4.16 | ‘forest ranger’ (lit. guard forest) |
jogo duro | 1.13 | ‘rough play’ (lit. game hard) |
juízo final | 3.60 | ‘doomsday’ (lit. judgement final) |
leite integral | 4.67 | ‘whole milk’ (lit. milk integral) |
lista negra | 1.60 | ‘black list’ (lit. list black) |
livre-docente | 2.63 | ‘professor’ (lit. free lecturer) |
livro aberto | 0.79 | ‘open book’ (lit. book open) |
longa data | 1.63 | ‘longtime’ (lit. date long) |
longa-metragem | 0.96 | ‘feature film’ (lit. long length/footage) |
lua cheia | 3.52 | ‘full moon’ (lit. moon full) |
lua nova | 1.40 | ‘new moon’ (lit. moon new) |
lugar comum | 1.52 | ‘cliché’ (lit. place common) |
magia negra | 1.72 | ‘black magic’ (lit. magic black) |
mar aberto | 2.87 | ‘open sea’ (lit. sea open) |
maré alta | 4.03 | ‘high tide’ (lit. tide high) |
maré baixa | 4.18 | ‘low tide’ (lit. tide low) |
massa cinzenta | 1.69 | ‘grey matter’ (lit. mass grey) |
mau contato | 2.84 | ‘faulty contact’ (lit. bad contact) |
mau humor | 4.29 | ‘bad mood’ (lit. bad humour) |
mau olhado | 1.97 | ‘evil eye’ (lit. bad glance) |
mercado negro | 1.06 | ‘black market’ (lit. black market) |
mesa redonda | 1.10 | ‘round table’ (lit. table round) |
montanha russa | 0.31 | ‘roller coaster’ (lit. mountain Russian) |
má fé | 1.62 | ‘bad faith’ (lit. bad faith) |
máquina virtual | 3.76 | ‘virtual machine’ (lit. machine virtual) |
mão fechada | 1.06 | ‘stingy’ (lit. hand closed) |
navio negreiro | 3.52 | ‘slave ship’ (lit. ship black-slave) |
novo mundo | 2.29 | ‘new world’ (lit. new world) |
novo rico | 3.62 | ‘new rich, new money’ (lit. new rich) |
nó cego | 0.74 | ‘difficult situation’ (lit. knot blind) |
núcleo atômico | 4.93 | ‘atomic nucleus’ (lit. nucleus atomic) |
olho gordo | 0.28 | ‘evil eye’ (lit. eye fat) |
olho mágico | 0.27 | ‘peephole’ (lit. eye magic) |
olho nu | 2.15 | ‘naked eye’ (lit. eye naked) |
ovelha negra | 0.45 | ‘black sheep’ (lit. sheep black) |
papel higiênico | 4.27 | ‘toilet paper’ (lit. paper hygienic) |
paraíso fiscal | 1.47 | ‘tax haven’ (lit. paradise fiscal) |
pastor alemão | 0.90 | ‘German shepherd’ (lit. shepherd German) |
pau mandado | 0.30 | ‘subservient, stooge’ (lit. stick ordered) |
pavio curto | 0.80 | ‘short-tempered’ (lit. fuse short) |
pente fino | 0.53 | ‘careful research’ (lit. comb thin) |
peso morto | 0.90 | ‘dead weight’ (lit. weight dead) |
planta baixa | 0.74 | ‘floor plan’ (lit. plant short) |
ponto cego | 1.92 | ‘blind spot’ (lit. point blind) |
ponto forte | 1.51 | ‘strong point’ (lit. point strong) |
ponto fraco | 2.27 | ‘weak point’ (lit. point weak) |
poção mágica | 3.29 | ‘magic potion’ (lit. potion magic) |
prato feito | 3.14 | ‘blue-plate special’ (lit. plate ready-made) |
primeira infância | 3.70 | ‘early childhood’ (lit. first infancy) |
primeira-mão | 0.71 | ‘first hand’ (lit. first hand) |
primeira necessidade | 3.97 | ‘first necessity’ (lit. first necessity) |
primeira-dama | 1.52 | ‘first lady’ (lit. first dame) |
primeiro-ministro | 2.87 | ‘first minister’ (lit. first minister) |
primeiro plano | 2.00 | ‘forefront’ (lit. first plan) |
processo seletivo | 4.78 | ‘selection process’ (lit. process selective) |
pronto socorro | 2.76 | ‘first-aid posts’ (lit. ready aid) |
príncipe encantado | 1.72 | ‘prince charming’ (lit. prince enchanted) |
puro sangue | 1.55 | ‘pure blood’ (lit. pure blood) |
pão-duro | 0.12 | ‘stingy’ (lit. bread hard) |
pé quente | 0.09 | ‘lucky’ (lit. foot hot) |
pé-direito | 0.10 | ‘ceiling height’ (lit. foot right) |
pé frio | 0.23 | ‘unlucky’ (lit. foot cold) |
pólo aquático | 2.87 | ‘water polo’ (lit. aquatic pole/polo) |
quadro negro | 2.94 | ‘blackboard’ (lit. board black) |
queda livre | 3.48 | ‘free fall’ (lit. fall free) |
quinta categoria | 1.00 | ‘second-rate’ (lit. fifth category) |
rede social | 3.27 | ‘social network’ (lit. network social) |
regime político | 4.00 | ‘political system’ (lit. regime political) |
relógio analógico | 4.92 | ‘analog clock’ (lit. clock analog) |
relógio biológico | 2.12 | ‘biological clock’ (lit. clock biological) |
reta final | 1.12 | ‘final stretch’ (lit. straight line final) |
roda gigante | 4.20 | ‘Ferris wheel’ (lit. wheel giant) |
roleta russa | 0.29 | ‘Russian roulette’ (lit. roulette Russian) |
saia justa | 0.37 | ‘tight spot’ (lit. skirt tight) |
sala cirúrgica | 4.47 | ‘operating room’ (lit. room surgical) |
salão paroquial | 4.52 | ‘parish hall’ (lit. hall parish) |
sangue azul | 0.15 | ‘blue-blooded’ (lit. blood blue) |
sangue frio | 0.52 | ‘cold-blooded’ (lit. blood cold) |
sangue quente | 0.87 | ‘hot-blooded’ (lit. blood hot) |
secretária eletrônica | 2.52 | ‘answering machine’ (lit. secretary electronic) |
segundas intenções | 2.11 | ‘ulterior motives’ (lit. second intentions) |
segundo plano | 1.55 | ‘aside, in the background’ (lit. second plan) |
sentença judicial | 4.67 | ‘court ruling’ (lit. sentence judicial) |
sexto sentido | 1.40 | ‘sixth sense’ (lit. sixth sense) |
sinal verde | 1.39 | ‘green lights’ (lit. signal green) |
sistema político | 4.36 | ‘political system’ (lit. system political) |
sétima arte | 2.19 | ‘seventh art’ (lit. seventh art) |
tapete vermelho | 3.76 | ‘red carpet’ (lit. carpet red) |
tartaruga marinha | 5.00 | ‘sea turtle’ (lit. turtle marine) |
tela plana | 4.96 | ‘flat screen TV’ (lit. screen flat) |
tempo real | 2.81 | ‘real time’ (lit. time real) |
terceira idade | 1.70 | ‘elder’ (lit. third age) |
terceira pessoa | 2.00 | ‘third person’ (lit. third person) |
tiro livre | 1.58 | ‘free kick (soccer)’ (lit. shot free) |
trabalho braçal | 3.55 | ‘manual labor’ (lit. work arm) |
trabalho escravo | 4.24 | ‘slave work’ (lit. work slave) |
vaca louca | 1.23 | ‘mad cow’ (lit. cow crazy/mad) |
vinho branco | 3.40 | ‘white wine’ (lit. wine white) |
vinho tinto | 4.08 | ‘red wine’ (lit. wine dark-red) |
vista grossa | 0.50 | ‘turn a blind eye’ (lit. vision thick) |
viva voz | 1.70 | ‘aloud’ (lit. live voice) |
voto secreto | 4.82 | ‘secret ballot’ (lit. vote secret) |
vôo doméstico | 3.41 | ‘domestic flight’ (lit. flight domestic) |
vôo internacional | 4.96 | ‘international flight’ (lit. flight international) |
água doce | 1.45 | ‘fresh water’ (lit. water sweet) |
água mineral | 4.21 | ‘mineral water’ (lit. water mineral) |
ônibus executivo | 2.63 | ‘minibus’ (lit. bus executive) |
Compounds . | hcHM . | Translation (Gloss) . |
---|---|---|
abalo sísmico | 4.42 | ‘earthquake’ (lit. shock seismic) |
acampamento militar | 4.82 | ‘military camp’ (lit. camp military) |
agente secreto | 4.58 | ‘secret agent’ (lit. agent secret) |
alarme falso | 3.24 | ‘false alarm’ (lit. alarm false) |
algodão doce | 1.28 | ‘cotton candy’ (lit. cotton sweet) |
alta temporada | 2.04 | ‘high season’ (lit. high season) |
alta costura | 1.52 | ‘haute couture’ (lit. high sewing) |
alto mar | 1.35 | ‘high seas’ (lit. high sea) |
alto falante | 0.88 | ‘loudspeaker’ (lit. loud/high speaker) |
amigo oculto | 2.89 | ‘secret Santa’ (lit. friend hidden) |
amigo secreto | 3.11 | ‘secret Santa’ (lit. friend secret) |
amor próprio | 3.91 | ‘self-esteem’ (lit. love own) |
ano novo | 4.29 | ‘new year’ (lit. year new) |
ar condicionado | 2.44 | ‘air conditioning’ (lit. air conditioned) |
ar livre | 1.95 | ‘open air’ (lit. air free) |
arma branca | 0.65 | ‘cold weapon’ (lit. weapon white) |
ato falho | 3.50 | ‘Freudian slip’ (lit. act faulty) |
banho turco | 2.19 | ‘Turkish bath’ (lit. bath Turkish) |
batata doce | 4.24 | ‘sweet potato’ (lit. potato sweet) |
bebida alcoólica | 5.00 | ‘alcoholic drink’ (lit. drink alcoholic) |
bode expiatório | 0.47 | ‘scapegoat’ (lit. goat expiatory) |
braço direito | 0.57 | ‘right arm’ (lit. arm right) |
buraco negro | 2.88 | ‘black hole’ (lit. hole black/dark) |
café colonial | 2.70 | ‘afternoon tea’ (lit. breakfast colonial) |
caixa forte | 3.19 | ‘safe, vault’ (lit. box strong) |
caixa preta | 0.94 | ‘black box’ (lit. box black) |
caixeiro viajante | 3.43 | ‘traveling salesman’ (lit. clerk traveling) |
carne branca | 2.85 | ‘white meat’ (lit. meat white) |
carne vermelha | 3.66 | ‘red meat’ (lit. meat red) |
carro forte | 2.62 | ‘armored car’ (lit. car strong) |
carta aberta | 3.64 | ‘open letter’ (lit. letter open) |
centro comercial | 3.68 | ‘shopping mall’ (lit. center commercial) |
centro espírita | 3.43 | ‘Spiritualist center’ (lit. center spiritualist) |
cerca viva | 3.58 | ‘hedge’ (lit. fence living) |
cheiro verde | 0.67 | ‘parsley’ (lit. smell green) |
circuito integrado | 4.52 | ‘integrated circuit’ (lit. circuit integrated) |
classe executiva | 2.67 | ‘business class’ (lit. class executive) |
coluna social | 2.45 | ‘gossip column’ (lit. column social) |
colégio militar | 4.88 | ‘military high-school’ (lit. high-school military) |
comida caseira | 4.11 | ‘homemade food’ (lit. food homemade) |
companhia aérea | 3.11 | ‘airline’ (lit. company aerial) |
conta corrente | 2.71 | ‘checking account’ (lit. account current) |
coração partido | 1.06 | ‘broken heart’ (lit. heart broken) |
corda bamba | 1.31 | ‘tightrope, bad situation’ (lit. rope wobbly) |
cordas vocais | 2.32 | ‘vocal chords’ (lit. chords vocal) |
curto circuito | 1.96 | ‘short circuit’ (lit. short circuit) |
câmara fria | 4.65 | ‘cold chamber’ (lit. chamber cold) |
céu aberto | 1.68 | ‘outdoors, open air’ (lit. sky open) |
círculo vicioso | 2.17 | ‘vicious circle’ (lit. circle vicious) |
círculo virtuoso | 2.39 | ‘virtuous circle’ (lit. circle virtuous) |
deputado federal | 4.92 | ‘federal deputy’ (lit. deputy federal) |
desfile militar | 4.93 | ‘military parade’ (lit. parade military) |
direitos humanos | 3.86 | ‘human rights’ (lit. rights human) |
disco rígido | 2.76 | ‘hard drive’ (lit. disk rigid) |
disco voador | 2.94 | ‘flying saucer’ (lit. disk flying) |
efeitos especiais | 3.37 | ‘special effects’ (lit. effects special) |
elefante branco | 0.16 | ‘white elephant’ (lit. elephant white) |
escada rolante | 3.85 | ‘escalator’ (lit. stair rolling) |
estrela cadente | 2.52 | ‘shooting star’ (lit. star falling) |
exame clínico | 4.75 | ‘clinical examination’ (lit. examination clinical) |
exames laboratoriais | 4.90 | ‘laboratory tests’ (lit. examinations laboratory) |
farinha integral | 4.72 | ‘wholemeal flour’ (lit. flour integral) |
febre amarela | 1.43 | ‘yellow fever’ (lit. fever yellow) |
ficha limpa | 2.97 | ‘clean criminal records’ (lit. file clean) |
fila indiana | 1.17 | ‘single file’ (lit. queue Indian) |
fio condutor | 1.58 | ‘underlying theme’ (lit. thread conductor) |
força bruta | 3.33 | ‘brute force’ (lit. force brute) |
gatos pingados | 0.00 | ‘a few people’ (lit. cats dropped) |
gelo seco | 2.33 | ‘dry ice’ (lit. ice dry) |
golpe baixo | 2.03 | ‘low blow’ (lit. punch low) |
governo federal | 4.97 | ‘federal government’ (lit. government federal) |
gripe aviária | 3.11 | ‘avian flu’ (lit. flu avian) |
gripe suína | 2.48 | ‘swine flu’ (lit. flu swine) |
guarda florestal | 4.16 | ‘forest ranger’ (lit. guard forest) |
jogo duro | 1.13 | ‘rough play’ (lit. game hard) |
juízo final | 3.60 | ‘doomsday’ (lit. judgement final) |
leite integral | 4.67 | ‘whole milk’ (lit. milk integral) |
lista negra | 1.60 | ‘black list’ (lit. list black) |
livre-docente | 2.63 | ‘professor’ (lit. free lecturer) |
livro aberto | 0.79 | ‘open book’ (lit. book open) |
longa data | 1.63 | ‘longtime’ (lit. date long) |
longa-metragem | 0.96 | ‘feature film’ (lit. long length/footage) |
lua cheia | 3.52 | ‘full moon’ (lit. moon full) |
lua nova | 1.40 | ‘new moon’ (lit. moon new) |
lugar comum | 1.52 | ‘cliché’ (lit. place common) |
magia negra | 1.72 | ‘black magic’ (lit. magic black) |
mar aberto | 2.87 | ‘open sea’ (lit. sea open) |
maré alta | 4.03 | ‘high tide’ (lit. tide high) |
maré baixa | 4.18 | ‘low tide’ (lit. tide low) |
massa cinzenta | 1.69 | ‘grey matter’ (lit. mass grey) |
mau contato | 2.84 | ‘faulty contact’ (lit. bad contact) |
mau humor | 4.29 | ‘bad mood’ (lit. bad humour) |
mau olhado | 1.97 | ‘evil eye’ (lit. bad glance) |
mercado negro | 1.06 | ‘black market’ (lit. black market) |
mesa redonda | 1.10 | ‘round table’ (lit. table round) |
montanha russa | 0.31 | ‘roller coaster’ (lit. mountain Russian) |
má fé | 1.62 | ‘bad faith’ (lit. bad faith) |
máquina virtual | 3.76 | ‘virtual machine’ (lit. machine virtual) |
mão fechada | 1.06 | ‘stingy’ (lit. hand closed) |
navio negreiro | 3.52 | ‘slave ship’ (lit. ship black-slave) |
novo mundo | 2.29 | ‘new world’ (lit. new world) |
novo rico | 3.62 | ‘new rich, new money’ (lit. new rich) |
nó cego | 0.74 | ‘difficult situation’ (lit. knot blind) |
núcleo atômico | 4.93 | ‘atomic nucleus’ (lit. nucleus atomic) |
olho gordo | 0.28 | ‘evil eye’ (lit. eye fat) |
olho mágico | 0.27 | ‘peephole’ (lit. eye magic) |
olho nu | 2.15 | ‘naked eye’ (lit. eye naked) |
ovelha negra | 0.45 | ‘black sheep’ (lit. sheep black) |
papel higiênico | 4.27 | ‘toilet paper’ (lit. paper hygienic) |
paraíso fiscal | 1.47 | ‘tax haven’ (lit. paradise fiscal) |
pastor alemão | 0.90 | ‘German shepherd’ (lit. shepherd German) |
pau mandado | 0.30 | ‘subservient, stooge’ (lit. stick ordered) |
pavio curto | 0.80 | ‘short-tempered’ (lit. fuse short) |
pente fino | 0.53 | ‘careful research’ (lit. comb thin) |
peso morto | 0.90 | ‘dead weight’ (lit. weight dead) |
planta baixa | 0.74 | ‘floor plan’ (lit. plant short) |
ponto cego | 1.92 | ‘blind spot’ (lit. point blind) |
ponto forte | 1.51 | ‘strong point’ (lit. point strong) |
ponto fraco | 2.27 | ‘weak point’ (lit. point weak) |
poção mágica | 3.29 | ‘magic potion’ (lit. potion magic) |
prato feito | 3.14 | ‘blue-plate special’ (lit. plate ready-made) |
primeira infância | 3.70 | ‘early childhood’ (lit. first infancy) |
primeira-mão | 0.71 | ‘first hand’ (lit. first hand) |
primeira necessidade | 3.97 | ‘first necessity’ (lit. first necessity) |
primeira-dama | 1.52 | ‘first lady’ (lit. first dame) |
primeiro-ministro | 2.87 | ‘first minister’ (lit. first minister) |
primeiro plano | 2.00 | ‘forefront’ (lit. first plan) |
processo seletivo | 4.78 | ‘selection process’ (lit. process selective) |
pronto socorro | 2.76 | ‘first-aid posts’ (lit. ready aid) |
príncipe encantado | 1.72 | ‘prince charming’ (lit. prince enchanted) |
puro sangue | 1.55 | ‘pure blood’ (lit. pure blood) |
pão-duro | 0.12 | ‘stingy’ (lit. bread hard) |
pé quente | 0.09 | ‘lucky’ (lit. foot hot) |
pé-direito | 0.10 | ‘ceiling height’ (lit. foot right) |
pé frio | 0.23 | ‘unlucky’ (lit. foot cold) |
pólo aquático | 2.87 | ‘water polo’ (lit. aquatic pole/polo) |
quadro negro | 2.94 | ‘blackboard’ (lit. board black) |
queda livre | 3.48 | ‘free fall’ (lit. fall free) |
quinta categoria | 1.00 | ‘second-rate’ (lit. fifth category) |
rede social | 3.27 | ‘social network’ (lit. network social) |
regime político | 4.00 | ‘political system’ (lit. regime political) |
relógio analógico | 4.92 | ‘analog clock’ (lit. clock analog) |
relógio biológico | 2.12 | ‘biological clock’ (lit. clock biological) |
reta final | 1.12 | ‘final stretch’ (lit. straight line final) |
roda gigante | 4.20 | ‘Ferris wheel’ (lit. wheel giant) |
roleta russa | 0.29 | ‘Russian roulette’ (lit. roulette Russian) |
saia justa | 0.37 | ‘tight spot’ (lit. skirt tight) |
sala cirúrgica | 4.47 | ‘operating room’ (lit. room surgical) |
salão paroquial | 4.52 | ‘parish hall’ (lit. hall parish) |
sangue azul | 0.15 | ‘blue-blooded’ (lit. blood blue) |
sangue frio | 0.52 | ‘cold-blooded’ (lit. blood cold) |
sangue quente | 0.87 | ‘hot-blooded’ (lit. blood hot) |
secretária eletrônica | 2.52 | ‘answering machine’ (lit. secretary electronic) |
segundas intenções | 2.11 | ‘ulterior motives’ (lit. second intentions) |
segundo plano | 1.55 | ‘aside, in the background’ (lit. second plan) |
sentença judicial | 4.67 | ‘court ruling’ (lit. sentence judicial) |
sexto sentido | 1.40 | ‘sixth sense’ (lit. sixth sense) |
sinal verde | 1.39 | ‘green lights’ (lit. signal green) |
sistema político | 4.36 | ‘political system’ (lit. system political) |
sétima arte | 2.19 | ‘seventh art’ (lit. seventh art) |
tapete vermelho | 3.76 | ‘red carpet’ (lit. carpet red) |
tartaruga marinha | 5.00 | ‘sea turtle’ (lit. turtle marine) |
tela plana | 4.96 | ‘flat screen TV’ (lit. screen flat) |
tempo real | 2.81 | ‘real time’ (lit. time real) |
terceira idade | 1.70 | ‘elder’ (lit. third age) |
terceira pessoa | 2.00 | ‘third person’ (lit. third person) |
tiro livre | 1.58 | ‘free kick (soccer)’ (lit. shot free) |
trabalho braçal | 3.55 | ‘manual labor’ (lit. work arm) |
trabalho escravo | 4.24 | ‘slave work’ (lit. work slave) |
vaca louca | 1.23 | ‘mad cow’ (lit. cow crazy/mad) |
vinho branco | 3.40 | ‘white wine’ (lit. wine white) |
vinho tinto | 4.08 | ‘red wine’ (lit. wine dark-red) |
vista grossa | 0.50 | ‘turn a blind eye’ (lit. vision thick) |
viva voz | 1.70 | ‘aloud’ (lit. live voice) |
voto secreto | 4.82 | ‘secret ballot’ (lit. vote secret) |
vôo doméstico | 3.41 | ‘domestic flight’ (lit. flight domestic) |
vôo internacional | 4.96 | ‘international flight’ (lit. flight international) |
água doce | 1.45 | ‘fresh water’ (lit. water sweet) |
água mineral | 4.21 | ‘mineral water’ (lit. water mineral) |
ônibus executivo | 2.63 | ‘minibus’ (lit. bus executive) |
Acknowledgments
This work has been partly funded by projects PARSEME (Cost Action IC1207), PARSEME-FR (ANR-14-CERA-0001), AIM-WEST (FAPERGS-INRIA 1706-2551/ 13-7), CNPq (312114/2015-0, 423843/2016-8) “Simplificação Textual de Expressões Complexas,” sponsored by Samsung Eletrônica da Amazônia Ltda. under the terms of Brazilian federal law No. 8.248/91. We would like to thank the anonymous reviewers who provided numerous helpful suggestions, Alexis Nasr for reviewing earlier versions of this article, Rodrigo Wilkens and Leonardo Zilio for contributing to the data set creation, and all anonymous annotators who judged the compositionality of compounds.
Notes
Attributed to Frege (1892/1960).
This article significantly extends and updates previous publications:
- 1.
We consolidate the description of the data sets introduced in Ramisch et al. (2016) and Ramisch, Cordeiro, and Villavicencio (2016) by adding details about data collection, filtering, and results of a thorough analysis studying the correlation between compositionality and related variables.
- 2.
We extend the compositionality prediction framework described in Cordeiro, Ramisch, and Villavicencio (2016) by adding and evaluating new composition functions and DSMs.
- 3.
We extend the evaluation reported in Cordeiro et al. (2016) not only by adding Portuguese, but also by evaluating additional parameters: corpus size, composition functions, and new DSMs.
After dimensionality reduction, nowadays word vectors are often called word embeddings.
The task of determining whether a phrase is compositional is closely related to MWE discovery (Constant et al. 2017), which aims to automatically extract MWE lists from corpora.
The terms noun compound and compound noun are usually reserved for nominal compounds formed by sequences of nouns only, typical of Germanic languages but not frequent in Romance languages.
In this article, examples are preceded by their language codes: EN for English, FR for French, and PT for Brazilian Portuguese. In the absence of a language code, English is implied.
For English, only EN-comp90 and EN-compExt (90 and 100 new compounds, respectively) are considered. Reddy (included in EN-comp) is analyzed in Reddy, McCarthy, and Manandhar (2011).
We have not attempted to select compounds that are translations of each other, as a compound in a given language may be realized differently in the other languages.
Freely available at: http://pageperso.lis-lab.fr/∼carlos.ramisch/?page=downloads/compounds.
Participants with negative correlation with the mean, and answers farther than ±1.5 from the mean.
Only FR-comp is shown as the other data sets display similar patterns.
A disagreement between answers a and b is weighted |a − b|.
and are .91 and .96 in PT-comp, .90 and .96 in EN-comp90, and .92 and .95 in EN-compExt.
Implemented as feat_compositionality.py in the mwetoolkit: http://mwetoolkit.sf.net.
Except when explicitly indicated, the term vector refers to corpus-derived vectors output by DSMs.
In practice, for the special case of two words, we do not need to perform parameter search for β, which has a closed form obtained by solving the equation pcβ(w1w2) = 0: β = .
Wikipedia articles downloaded on June 2016.
Hyphenated compounds are also re-tokenized with an underscore separator.
Therefore, in Section 5.2, the terms target/context words may actually refer to compounds.
In the lemmatized corpora, the lemmas of proper names are replaced by placeholders.
In previous work adjectives and adverbs were also included as contexts, but the results obtained with only verbs and nouns were better (Padró et al. 2014a).
Vectors still have |V| dimensions but we use d as a shortcut to represent the fact that we only retain the most relevant target-context pairs for each target word.
Thresholds were selected so as to not use more than 128 GB of RAM.
This is in line with the authors’ threshold suggestions (Salle, Villavicencio, and Idiart 2016).
A compound is considered as noncompositional if at least 2 out of 4 annotators annotate it as noncompositional.
This refers to 5 out of 180 in EN-comp and 129 out of 1,042 in Farahmand. For these, the fallback strategy assigns the average compositionality score (Salehi, Cook, and Baldwin 2015). Although fallback produces slightly better results for EN-comp, it does the opposite for Farahmand, which contains a larger proportion of missing compounds (2.8% vs. 12.4%).
We have also considered separating folds so as to be balanced regarding their compositionality scores. The results were similar to the ones reported here.
For PPMI–SVD with WindowSize = 8 + 8, a context word at distance D from its target word is weighted . For glove, the decay happens much faster, with a weight of , which allows the model to look farther away without being affected by potential noise introduced by distant contexts.
Henceforth, we omit results for EN-comp90 and Reddy, as they are included in EN-comp.
For PPMI–SVD and lexvec, this behavior might be related to the fact that both methods perform a factorization of the PPMI matrix.
As the characteristics of Farahmand are different from the other data sets, in this analysis we only use the other more comparable data sets.
For PPMI–thresh, eight different samplings of corpus fragments were performed (for a total of 800 DSMs per language), with each y-axis data point presenting the average and standard deviation of the ρ obtained from those samplings. For w2v–sg, since it is much more time-consuming, a single sampling was used, and thus only one execution was performed for each datapoint (for a total of 100 DSMs per language).
The Pearson correlations (averaged across 7 DSMs) between pcarith and pcuniform are r = .972 for EN-comp, r = .991 for FR-comp, and r = .969 for PT-comp, confirming their similar results.
We focus on one representative of PPMI-based DSMs and one representative of word-embedding models. Similar results were observed for the best configurations of other DSMs.
We linearly normalize predicted scores to be between 0 and 1. However, given that negative scores are rare in practice, unreported correlation with non-normalized pc are similar to the ones reported.
The resulting data sets and framework implementation are freely available to the community.
As Farahmand is considerably different from the other data sets, a direct comparison is not possible.
The running time grows linearly with the number of iterations.
The threshold used for ρbase depends on the DSM, and is described in Section 5.2.
The low standard deviation is not a unique property of high-ranking configurations: The average of deviations for all models was .004 for EN-comp and .006 for Reddy.
The judgment threshold we adopted was z = 2.2 for EN-comp90, z = 2.2 for PT-comp, and z = 2.5 for FR-comp. The human judge threshold was R = 0.5.