Abstract
Idiomatic expressions are an integral part of natural language and constantly being added to a language. Owing to their non-compositionality and their ability to take on a figurative or literal meaning depending on the sentential context, they have been a classical challenge for NLP systems. To address this challenge, we study the task of detecting whether a sentence has an idiomatic expression and localizing it when it occurs in a figurative sense. Prior research for this task has studied specific classes of idiomatic expressions offering limited views of their generalizability to new idioms. We propose a multi-stage neural architecture with attention flow as a solution. The network effectively fuses contextual and lexical information at different levels using word and sub-word representations. Empirical evaluations on three of the largest benchmark datasets with idiomatic expressions of varied syntactic patterns and degrees of non-compositionality show that our proposed model achieves new state-of-the-art results. A salient feature of the model is its ability to identify idioms unseen during training with gains from 1.4% to 30.8% over competitive baselines on the largest dataset.
1 Introduction
Idiomatic expressions (IEs) are a special class of multi-word expressions (MWEs) that typically occur as collocations and exhibit semantic non- compositionality (a.k.a. semantic idiomaticity), where the meaning of the expression is not derivable from its parts (Baldwin and Kim, 2010). In terms of occurrence, IEs are individually rare, but collectively frequent in and constantly added to natural language across different genres (Moon et al., 1998). Additionally, they are known to enhance fluency and used to convey ideas succinctly when used in everyday language (Baldwin and Kim, 2010; Moon et al., 1998).
Classically regarded as a “pain in the neck” to idiom-unaware NLP applications (Sag et al., 2002) these phrases are challenging for reasons including their non-compositionality (semantic idiomaticity), besides taking a figurative or literal meaning depending on the context (semantic ambiguity), as shown by the example in Table 1. Borrowing the terminology from Haagsma et al. (2020), we call these phrases potentially idiomatic expressions (PIEs) to account for the contextual semantic ambiguity. Indeed, prior work has identified the challenges that PIEs pose to many NLP applications, such as machine translation (Fadaee et al., 2018; Salton et al., 2014), paraphrase generation (Ganitkevitch et al., 2013), and sentiment analysis (Liu et al., 2017; Biddle et al., 2020). Accordingly, making applications idiom- aware, either by identifying them before or during the task, has been found to be effective (Korkontzelos and Manandhar, 2010; Nivre and Nilsson, 2004; Nasr et al., 2015). This study proposes a novel architecture that detects the presence of a PIE. When found, its span in a given sentence is localized and returning the phrase if it is used figuratively (i.e., used as an IE); otherwise an empty string is returned indicating that the phrase is used literally (see Table 1). Such a network can serve as a preprocessing step for broad-coverage downstream NLP applications because we consider the ability to detect IEs to be a first step towards their accurate processing. This is the idiomatic expression identification problem, which is the MWE identification problem defined by Baldwin and Kim (2010) limited to MWEs with semantic idiomaticity.
Input | Tom said many bad things about Jane behind her back. (Figurative) |
He took one from an armchair and put it behind her back. (Literal) | |
Output | behind her back |
〈CLS〉 〈SEP〉 |
Input | Tom said many bad things about Jane behind her back. (Figurative) |
He took one from an armchair and put it behind her back. (Literal) | |
Output | behind her back |
〈CLS〉 〈SEP〉 |
Despite being well-studied in the current literature as idiom type and token classification (e.g., Fazly et al., 2009; Feldman and Peng, 2013; Salton et al., 2016; Taslimipoor et al., 2018; Peng et al., 2014; Liu and Hwa, 2019), previous methods are limited for various reasons. They rely on knowing the PIEs being classified and hence their exact positions, or focus on specific syntactic patterns (e.g., verb-noun compounds or verbal MWEs), thereby calling into question their use in more realistic scenarios with unseen PIEs (a likely event, given the prolific nature of PIEs). Additionally, without a cross-type (type-aware) evaluation, where the PIE types from the train and test splits are segregated (Fothergill and Baldwin, 2012; Taslimipoor et al., 2018, the true generalizability of these methods to unseen idioms cannot be inferred. For instance, a model could be classifying by memorizing known PIEs or their tendencies to occur exclusively as figurative or literal expressions.
In contrast, this study aims to identify IEs in general (i.e., without posing constraints on the PIE type) in a more realistic setting where new idioms may occur, by proposing the iDentifier of Idiomatic expressions via Semantic Compatibility (DISC) that performs detection and localization jointly. The novelty is that we perform the task without an explicit mention of the identity or the position of the PIE. As a result, the task is more challenging than the previously explored idiom token classification.
An effective solution to this task calls for the ability to relate the meaning of its component words with each other (e.g., Baldwin, 2005; McCarthy et al., 2007) as well as with the context (Liu and Hwa, 2019). This aligns with the widely upheld psycholinguistic findings on human processing of a phrase’s figurative meaning in comparison with its literal interpretation (Bobrow and Bell, 1973). Toward this end, we rely on the contextualized representation of a PIE (accounting both for its internal and contextual properties), hypothesizing that a figurative expression’s contextualized representation should be different from that of its literal counterpart. We refer to this as its semantic compatibility (SC)—if a PIE is semantically compatible with its context, then it is literal; if not, it is figurative. The idea of SC also captures the distinction between literal word combinations and idioms, in terms of the semantics encoded by both (Jaeger, 1999) and the related property of selectional preference (Wilks, 1975)—the tendency for a word to semantically select or constrain which other words may appear in its association (Katz and Fodor, 1963) successfully used for processing metaphors (Shutova et al., 2013) and word sense disambiguation (Stevenson and Wilks, 2001). We capture SC by effectively fusing information from the input tokens’ contextualized and literal word representations to then localize the span of PIE used figuratively. Here we leverage the idea of attention flow previously studied in a machine comprehension setting (Seo et al., 2017).
Our main contributions in this work are:
A novel IE identification model, DISC, that uses attention flow to fuse lexical semantic information at different levels and discern the SC of a PIE. Taking only a sentence as input and using only word and POS representations, it simultaneously performs detection and localization of the PIEs used figuratively. To the best of our knowledge, this is the first such study on this task.
Realistic evaluation:
We include two novel aspects in our evaluation methodology. First, we consider a new and stringent performance measure for subsequence identification; the identification is successful if and only if every word in the exact IE subsequence is identified. Second, we consider type-aware evaluation so as to highlight a model’s generalizability to unseen PIEs regardless of syntactic pattern.
Competitive performance:
Using benchmark datasets with a variety of PIEs, we show DISC1 compares favorably with strong baselines on PIEs seen during training. Particularly noteworthy is its identification accuracy on unseen PIEs, which is 1.4% to 11.1% higher than the best baseline.
2 Related Work
We provide a unified view of the diverse terminologies and tasks studied in prior works that define the scope of our study.
MWEs, IEs and Metaphors.
We first introduce the relation between the three related concepts, namely, MWE, IE, and metaphor, in order to present a clearer picture of the scope of our work. According to Baldwin and Kim (2010) and Constant et al. (2017), MWEs (e.g., bus driver and good morning) satisfy the properties of outstanding collocation and contain multiple words. IEs are a special type of MWE that also exhibit non-compositionality at the semantic level. This has generally been considered to be the key distinguishing property between idioms (IEs) and MWEs in general, although the boundary between IEs and non-idiom MWEs is not clearly defined. (Baldwin and Kim, 2010; Fadaee et al., 2018; Liu et al., 2017; Biddle et al., 2020). Metaphors are a form of figurative speech used to make an implicit comparison at an attribute level between two things seemingly unrelated on the surface. By definition, certain MWEs and IEs use metaphorical figuration (e.g., couch potato and behind the scenes). However, not all metaphors are IEs because metaphors are not required to possess any of the properties of IEs—that is, the components of a metaphor need not co-occur frequently (metaphors can be uniquely created by anyone), metaphors can be direct and plain comparisons and thus are not semantically non-compositional, and they need not have multiple words (e.g., titanium in the sentence “I am titanium”).
PIE and MWE Processing.
Current literature considers idiom type classification and idiom token classification (Cook et al., 2008; Liu and Hwa, 2019; Liu, 2019) as two idiom-related tasks. Idiom type classification decides if a phrase could be used as an idiom without specifically considering its context. Several works (e.g., Fazly and Stevenson, 2006; Shutova et al., 2010) have studied the distinguishing properties of idioms from other literal phrases, especially that of non- compositionality (Westerståhl, 2002; Tabossi et al., 2008, 2009; Reddy et al., 2011; Cordeiro et al., 2016).
In contrast, idiom token classification (Fazly et al., 2009; Feldman and Peng, 2013; Peng and Feldman, 2016; Salton et al., 2016; Taslimipoor et al., 2018; Peng et al., 2014; Liu and Hwa, 2019) determines whether a given PIE is used literally or figuratively in a sentence. Prior work has used per-idiom classifiers that are completely non- scalable to be practical (Liu and Hwa, 2017), required the position of the PIEs in the sentence (e.g., Liu and Hwa, 2019), and focused only on specific PIE patterns, such as verb-noun compounds (Taslimipoor et al., 2018). Overall, available research for this task only disambiguates a given phrase. In contrast, we do not assume any knowledge of the PIE being detected; given a sentence, we detect whether there is a PIE and disambiguate its use.
PIEs being special types of MWEs, our task is related to MWE extraction and MWE identification (Baldwin and Kim, 2010). As with idioms, MWE extraction takes a text corpus as input and produces a list of new MWEs (e.g., Fazly et al., 2009; Evert and Krenn, 2001; Pearce, 2001; Schone and Jurafsky, 2001). MWE identification takes a text corpus as input and locates all occurrences of MWEs in the text at the token level, differentiating between their figurative and literal use (Baldwin, 2005; Katz and Giesbrecht, 2006; Hashimoto et al., 2006; Blunsom, 2007; Sporleder and Li, 2009; Fazly et al., 2009; Savary et al., 2017); the identified MWEs may or may not be known beforehand. Constant et al. (2017) group main MWE-related tasks into MWE discovery and MWE identification: MWE discovery is identical to MWE extraction, while the MWE identification here, different from Baldwin and Kim’s definition, identifies only known MWEs. Our task is identical to Baldwin and Kim’s (2010) MWE identification and Savary et al.’s (2017) verbal MWE identification while focusing only on PIEs, and we aim to both detect the presence of PIEs and localize IE positions (boundaries), regardless of whether the PIEs were previously seen or not. Besides, like idiom type classification and MWE extraction, our approach also works for identifying new idiomatic expressions.
Approaches to MWE identification fall into two broad types. (1) A tree-based approach by first constructing a syntactic tree of the sentence and then traversing a selective set of candidate subsequences (at a node) to identify idioms (Liu et al., 2017). However, since the construction of a syntactic tree is itself affected by the presence of idioms (Nasr et al., 2015; Green et al., 2013), the nodes may not correspond to an entire idiomatic expression, which in turn can affect even a perfect classifier’s ability to identify idioms precisely. (2) Framing the problem as a sequence labeling problem for token-level idiomatic/literal labeling, similar to prior work (Jang et al., 2015; Mao et al., 2019; Gong et al., 2020; Kumar and Sharma, 2020; Su et al., 2020) on metaphor detection that label each token as a metaphor or a non-metaphor and Schneider and Smith’s (2015) approach to MWE identification by tagging tokens from a MWE with the same supersense tag. This tagging approach provides finer control over subsequence extraction and is unrestricted by factors that could impact a tree-based approach, and does not require the traversal of all possible subsequences in search of the candidate phrases. Our approach is similar to this in spirit but focused on PIEs. In particular, Schneider and Smith (2015) aim to tag all MWEs while making no distinction for the non-compositional phrases, whereas our work aims to only identify IEs from sentences containing PIEs.
Semantic Compatibility.
Exploiting SC for processing idioms has been considered in rather restricted settings, where the identity of a PIE (and hence its position) is known. For instance, Liu and Hwa (2019) used SC to classify a given phrase in its context as literal/idiomatic. A corpus of annotated phrases was used to train a linear classification layer to discriminate between phrases’ contextualized and literal embeddings. Peng and Feldman (2016) directly check the compatibility between the word embeddings of a PIE with the embeddings of its context words to perform the literal/idiomatic classification. Jang et al. (2015) used SC and the global discourse context to detect the figurative use of a small list of candidate metaphor words. Gong et al. (2017) treated the phrase’s respective context as vector spaces and modeled the distance of the phrase from the vector space as an index of SC. We extend these prior efforts to identify both the presence and the position of an IE using only a sentence as input without knowing the PIE.
3 Method
In line with studies on MWE identification mentioned above, we frame the identification of idiomatic subsequences as a token-level tagging problem, where we perform literal/idiomatic classification for every token in the sentence. A simple post-processing step finally extracts the PIE subsequence used in the idiomatic sense.
Task Definition.
Given an input sentence S = w1,w2,…,wL, where wi for i ∈ [1,L] are the tokenized units and L is the number of tokens in S, the task is to label the individual token wi with a label ci ∈{idiomatic,literal} so that the final output is a sequence of classifications C = c1,c2,…,cL. For a correct prediction, the phrase wi:j in S is idiomatic and the corresponding ci:j are classified into the ‘idiom’ class, while the rest are the ‘literal’ class; or the phrase wi:j in S is literal and the corresponding c1:L are all classified into the ‘literal’ class.
Overview of Proposed Approach.
The overall workflow and model architecture of DISC are illustrated in Figure 1. The model can be roughly divided into three distinct phases: (1) the embedding phase, (2) the attention phase, and (3) the prediction phase. In the embedding phase, the input sequence S is tokenized and both the contextualized and static word embeddings are generated and supplemented with character-level information. Furthermore, POS tag embeddings of the input tokens are generated to provide syntactic information. In the attention phase, an attention flow layer combines the POS tag embeddings with the static word embeddings, yielding an enhanced literal representation for every word. Then, a second attention flow layer fuses the contextualized and the enriched literal representations by attending to the rich features of each token in the tokenized input sequence. Finally, the prediction phase further encodes the sequence of feature vectors and performs token-level literal/idiomatic classification to produce the predicted sequence C.
Embedding Phase.
Here the input sentence S is tokenized in two ways—one for the pre-trained language model and the other for the pre-trained static word embedding layer—resulting in two tokenized sequences Tc and Ts, such that |Tc| = M and |Ts| = N. Since the two tokenizers are not necessarily the same, N and M may be unequal.
Next, Tc is fed to a pre-trained language model to produce a sequence of contextualized word embeddings, , where Dcon is the embedding vector dimension. A pre-trained word embedding layer takes Ts to produce a sequence of static word embeddings, , where Ds is the embedding vector dimension. The contextualized embeddings capture the semantic content of the phrases within the specific context, while the static word embeddings capture the compositional meaning of the phrases, both of which allow the model to check SC.
Additionally, informed by the finding that character-level information alleviates the problem of morphological variability in idiom detection (Liu et al., 2017), character sequences are generated from Ts, and their character- level embeddings, obtained using a 1-D Convolutional Neural Network (CNN) followed by a max-pooling layer over the maximum width of the tokens, Wt. Then, Echar and Es are combined via a two-layer highway network (Srivastava et al., 2015) which yields .
Lastly, to capture shallow syntactic information, a POS embedding layer generates a sequence of POS tags for Ts and a simple linear embedding layer produces a sequence of POS tag embeddings, , where Dpos is the POS embedding vector dimension.
In effect, the embedding layer encodes four levels of information: character-level, phrase-internal and implicit context (static word embedding), phrase-external and explicit context (contextual embedding), and shallow syntactic information (POS tag).
To perform an initial feature extraction from the raw embeddings and unify the different embedding vector dimensions, we apply a Bidirectional LSTM (BiLSTM) layer for each embedding sequence resulting in , , and , where Dembed/2 is the hidden dimension of the BiLSTM layers.
Attention Phase.
The attention phase mainly consists of two attention flow layers. In its native application (i.e., reading comprehension), the attention flow layer linked and fused information from the context word sequence and the query word sequence (Seo et al., 2017), producing query-aware vector representations of the context words while propagating the word embeddings from the previous layer. Analogously, for our task, the attention flow layer fuses information from the two embedding sequences encoding different kinds of information. More specifically, given two sequences Sa ∈ℝL×D and Sb ∈ℝK×D of lengths L and K, the attention flow layer computes H ∈ℝL×K using, , where Hij is the attended, merged embedding of the i-th token in Sa and the j-th token in Sb, W0 is a trainable weight matrix, is the i-th column of Sa, is the j-th column of Sb, [;] is vector concatenation, and ∘ is the Hadamard product. Next, the attentions are computed from both Sa-to-Sb and Sb-to-Sa. The Sa-to-Sb attended representation is computed as , where ; ai ∈ℝK and ; . The Sb-to-Sa attended representation is computed as , where , b ∈ℝL, and . Finally, the attention flow layer outputs a combined vector U ∈ℝ8D×L, where .
The two attention flow layers serve different purposes. The first one fuses the static word embeddings and the POS tag embeddings resulting in token representations that encode information from a given word’s POS and that of its neighbors. The POS information is useful because different idioms often follow common syntactic structures (e.g., verb-noun idioms), which can be used to recognize idioms unseen in the training data based on their similarity in syntactic structures (and thus aid generalizability). In all, the first attention flow layer yields enriched static embeddings that more effectively capture the literal representation of the input sequence. The second attention flow layer combines the contextualized and literal embeddings so that the resulting representation encodes the SC between the literal and contextualized representations of the PIEs. This is informed by prior findings that the SC between the static and the contextualized representation of a phrase is a good indicator of its idiomatic usage (Liu and Hwa, 2019). In addition, this attention flow layer permits working with contextualized and static embedding sequences of differing lengths using model-appropriate tokenizers for the pre-trained language model and the word embedding layer without having to explicitly map the tokens from the different tokenizers.
Prediction Phase.
The prediction phase consists of a single BiLSTM layer and a linear layer. The BiLSTM layer further processes and encodes the rich representations from the attention phase. The linear layer that follows uses a log softmax function to predict the probability of each token over the five target classes idiomatic,literal,start,end, and padding. This architecture is inspired by the RNN-HG model from (Mao et al., 2019) with the difference that our BiLSTM has only one layer. During training, the token-level negative log-likelihood loss is computed and backpropagated to update the model parameters.
Implementation Details.
In our implementation, the tokenizer for the language model uses the WordPiece algorithm (Schuster and Nakajima, 2012) prominently used in BERT (Devlin et al., 2019), whereas the static word embedding layer used Python’s Natural Language Toolkit (NLTK) (Loper and Bird, 2002).
The pre-trained language model is the uncased base BERT from Huggingface’s Transformers package (Wolf et al., 2020) with an embedding dimension of Dcon = 768. The pre-trained word embedding layer is the cased Common Crawl version of GloVE, which has a vocabulary of 2.2 M words and the embedding vectors are of dimension Ds = 300 (Pennington et al., 2014). Both the BERT and GloVE models are frozen during training. We use NLTK’s POS tagger for the POS tags.
For the character embedding layer, the input embedding dimension is 64 and the number of CNN output channels is Dchar = 64. The highway network has two layers. The POS tag embedding is of dimension Dpos = 64. All the BiLSTM layers have a hidden dimension of 256, and thus Demb = 512.
4 Experiments
Datasets.
We use the following three of the largest available datasets of idiomatic expressions to evaluate the proposed model alongside other baselines. MAGPIE (Haagsma et al., 2020): MAGPIE is a recent, the largest-to-date corpus of PIEs in English. It consists of 1,756 PIEs across different syntactic patterns along with the sentences in which they occur (56,622 annotated data instances with an average of 32.24 instances per PIE), where the sentences are drawn from a diverse set of genres, such as news and science, collected from resources such as the British National Corpus (BNC) (BNC Consortium, 2007). For our experiments, we only considered the complete sentences of up to 50 words in length that contain the unambiguously labelled PIEs (as indicated by the perfect confidence score).
Dataset . | Split . | Size (pct. idiomatic) . | # of idioms . | Avg. idiom occ . | Std. idiom occ . | ||||
---|---|---|---|---|---|---|---|---|---|
Train . | Test . | Train . | Test . | Train . | Test . | Train . | Test . | ||
MAGPIE | Random | 32,162 (76.63%) | 4,030 (76.48%) | 1,675 | 1,072 | 19.2 | 3.76 | 24.82 | 3.65 |
Type-aware | 32,155 (77.90%) | 4,050 (70.54%) | 1,411 | 168 | 22.79 | 24.11 | 29.96 | 32.05 | |
SemEval5B | Random | 1,420 (50.56%) | 357 (50.70%) | 10 | 10 | 142 | 35.7 | 51.25 | 12.69 |
Type-aware | 1,111 (58.74%) | 341 (58.65%) | 31 | 9 | 35.81 | 37.89 | 28.84 | 30.12 | |
VNC | Random | 2,285 (79.52%) | 254 (70.47%) | 53 | 50 | 43.11 | 5.08 | 25.89 | 2.93 |
Type-aware | 2,191 (79.69%) | 348 (71.84%) | 47 | 6 | 46.62 | 58 | 27.99 | 27.77 |
Dataset . | Split . | Size (pct. idiomatic) . | # of idioms . | Avg. idiom occ . | Std. idiom occ . | ||||
---|---|---|---|---|---|---|---|---|---|
Train . | Test . | Train . | Test . | Train . | Test . | Train . | Test . | ||
MAGPIE | Random | 32,162 (76.63%) | 4,030 (76.48%) | 1,675 | 1,072 | 19.2 | 3.76 | 24.82 | 3.65 |
Type-aware | 32,155 (77.90%) | 4,050 (70.54%) | 1,411 | 168 | 22.79 | 24.11 | 29.96 | 32.05 | |
SemEval5B | Random | 1,420 (50.56%) | 357 (50.70%) | 10 | 10 | 142 | 35.7 | 51.25 | 12.69 |
Type-aware | 1,111 (58.74%) | 341 (58.65%) | 31 | 9 | 35.81 | 37.89 | 28.84 | 30.12 | |
VNC | Random | 2,285 (79.52%) | 254 (70.47%) | 53 | 50 | 43.11 | 5.08 | 25.89 | 2.93 |
Type-aware | 2,191 (79.69%) | 348 (71.84%) | 47 | 6 | 46.62 | 58 | 27.99 | 27.77 |
SemEval5B (Korkontzelos et al., 2013): This set has 60 PIEs unrestricted by syntactic pattern appearing in 4,350 sentences from the ukWaC corpus (Baroni et al., 2009). As in MAGPIE, we only consider the sentences with the annotated phrases.
VNC (Cook et al., 2008): Verb Noun Combinations (VNC) dataset is a popular benchmark dataset that contains expert-curated 53 PIE types that are only verb-noun combinations and around 2,500 sentences containing them either in a figurative or literal sense—all extracted from the BNC. Because VNC does not mark the location of the idiom, we manually labeled them.
Together, the datasets account for a wide variety of PIEs, making this the largest available study on a wide variety of PIE categories.
Baseline Models.
We use the following six baseline models for our experiments. We note that because our method is similar to idiom type classification only in its end goal and not in setting, we exclude SOTA models for idiom classification from this comparison, but include the more recent MWE extraction methods.
Gazetteer is a naïve baseline that looks up a PIE in a lexicon. In our experiments, to make the Gazetteer method independent of the algorithm and lexicon, we present the theoretical performance upper bound for any Gazetteer-based algorithm as follows. We assume that the Gazetteer perfectly detects the idiom boundaries in sentences and, in turn, predicts all PIEs to be idiomatic. We point out that since the idiomatic class is the most frequent-class in all of our benchmark datasets, this also turns out to be the majority-class baseline for the case of sentence-level, binary idiomatic and literal classification, that is, it predicts every sentence in a dataset to be idiomatic for binary idiom detection.
BERT-LSTM has a simple architecture that combines the pre-trained BERT and a linear layer to perform a binary classification at each token and was used in Kurfalı and Östling (2020) for disambiguating PIEs.
Seq2Seq has an encoder-decoder structure and is commonly used in sequence tagging tasks (Filippova et al., 2015; Malmi et al., 2019; Dong et al., 2019). It first uses the pre-trained BERT to generate contextualized embeddings and then sends them to a BiLSTM encoder-decoder model to tag each token as literal/idiomatic. Although not commonly used in idiom processing tasks, the encoder-decoder framework serves as a simple yet effective baseline for our tagging based idiom identification.
BERT-BiLSTM-CRF (Huang et al., 2015) is an established model for sequence tagging (and the state-of-the-art for name entity recognition in different languages [Huang et al., 2015; Hu and Verberne, 2020]), which uses a BiLSTM to encode the sequence information and then performs sequence tagging with a conditional random field (CRF).
RNN-MHCA (Mao et al., 2019) is a recent state-of-the-art model for metaphor detection on the benchmark VUA dataset that uses GloVe and ELMo embeddings with a multi-head contextual attention.
Experimental Setup.
For a fair comparison across the models, we use a pre-trained BERT model in place of the linear embedding layers, ELMo, and RoBERTa model respectively in the last three baselines. The pre-trained BERT model is also frozen for all the baseline model and DISC. Owing to a lack of a good fine-tuning strategy that fits all baselines, we leave to future work exploring improved performance via end-to-end BERT fine-tuning.
In order to test the models’ ability to identify unseen idioms, each dataset was split into train and test set in two ways: random and type-aware. In the random split, the sentences are randomly divided and the same PIE can appear in both sets, whereas in the type-aware split, the idioms in the test set and the train set do not overlap. For MAGPIE and SemEval5B, we use their respective random/type-aware and train/test splits. For VNC, to create the type-aware split, we randomly split the idiom types by a 90/10 ratio, leaving 47 idiom types in train set and 6 idiom types in test set. For every dataset split, we trained every model for 600 epochs with a batch size of 64, an initial learning rate of 1e − 4, using the Adam optimizer.
The checkpoints with the best test set performance during training are recorded later in the result tables. For models with BiLSTMs, we used the same specifications as in our model with a hidden dimension of 256 and a single layer, except for BiLSTM-CRF, where we used a stacked two-layer BiLSTM. For the linear layers, we set a dropout rate of 0.2 during training. For Seq2Seq, we used a teacher forcing ratio of 0.7 during training and brute force search during inference. The same pre-trained BERT model from Huggingface’s Transformers package was used as a frozen embedding layer in all models. All the other hyperparameters were in their default values.
All training and testing were done on a single machine with an Intel Core i9-9900K processor and a single NVIDIA GeForce RTX 2080 Ti graphics card.
Evaluation Metrics.
We use two metrics to evaluate the performance of the models. (1) Classification F1 score (F1) measures the binary idiom detection performance at the sequence level with the presence of idioms being the positive class. (2) Sequence accuracy (SA) computes the idiom identification performance at the sentence level, where a sequence is considered as being classified correctly if and only if all its tokens are tagged correctly. We point out that the performance in terms of F1 score is essentially analogous to the performance of the idiom token classification task (see Section 2), the primary difference being whether the idiom is specified or not. Because SA is stricter than F1, we regard it to be the most relevant metric for idiom detection and span localization. Here we consider SA to be the primary evaluation metric with F1 providing additional performance references.
5 Results and Analyses
IE Identification Performance.
A comparative evaluation of the models on the MAGPIE, SemEval5B, and VNC datasets is shown in Table 3.
Data Split . | Model . | Magpie . | SemEval5B . | VNC . | |||
---|---|---|---|---|---|---|---|
F1 . | SA . | F1 . | SA . | F1 . | SA . | ||
Random | Gazetteer | 86.67 | 76.47 | 67.29 | 50.70 | 82.68 | 70.47 |
BERT | 87.16 | 37.10 | 92.51 | 76.47 | 93.09 | 50.00 | |
Seq2Seq | 92.70 | 83.21 | 94.41 | *94.12 | 95.21 | 86.61 | |
BERT-BiLSTM-CRF | 94.22 | *87.71 | 93.29 | 92.44 | 95.45 | 85.03 | |
RNN-MHCA | 95.51 | *86.82 | *94.94 | 93.56 | *96.15 | 91.33 | |
IlliniMET | 86.54 | 37.97 | 92.59 | 78.15 | 93.55 | 59.45 | |
DISC | 95.02 | *87.47 | *95.80 | *95.23 | *96.97 | 93.31 | |
Type-aware | Gazetteer | 82.73 | 0.00 | 73.94 | 0.00 | 83.61 | 0.00 |
BERT | 86.27 | 39.70 | 73.37 | 35.19 | 86.85 | 50.86 | |
Seq2Seq | 83.81 | 63.42 | 50.35 | 44.28 | 88.80 | 73.56 | |
BERT-BiLSTM-CRF | 80.47 | 61.78 | 57.82 | 44.57 | 83.30 | 65.52 | |
RNN-MHCA | 86.34 | 61.42 | 56.25 | 42.23 | *88.74 | 79.02 | |
IlliniMET | 83.58 | 39.68 | 69.49 | 41.94 | 87.97 | 54.60 | |
DISC | 87.78 | 70.47 | 58.82 | 55.71 | *89.02 | 80.46 |
Data Split . | Model . | Magpie . | SemEval5B . | VNC . | |||
---|---|---|---|---|---|---|---|
F1 . | SA . | F1 . | SA . | F1 . | SA . | ||
Random | Gazetteer | 86.67 | 76.47 | 67.29 | 50.70 | 82.68 | 70.47 |
BERT | 87.16 | 37.10 | 92.51 | 76.47 | 93.09 | 50.00 | |
Seq2Seq | 92.70 | 83.21 | 94.41 | *94.12 | 95.21 | 86.61 | |
BERT-BiLSTM-CRF | 94.22 | *87.71 | 93.29 | 92.44 | 95.45 | 85.03 | |
RNN-MHCA | 95.51 | *86.82 | *94.94 | 93.56 | *96.15 | 91.33 | |
IlliniMET | 86.54 | 37.97 | 92.59 | 78.15 | 93.55 | 59.45 | |
DISC | 95.02 | *87.47 | *95.80 | *95.23 | *96.97 | 93.31 | |
Type-aware | Gazetteer | 82.73 | 0.00 | 73.94 | 0.00 | 83.61 | 0.00 |
BERT | 86.27 | 39.70 | 73.37 | 35.19 | 86.85 | 50.86 | |
Seq2Seq | 83.81 | 63.42 | 50.35 | 44.28 | 88.80 | 73.56 | |
BERT-BiLSTM-CRF | 80.47 | 61.78 | 57.82 | 44.57 | 83.30 | 65.52 | |
RNN-MHCA | 86.34 | 61.42 | 56.25 | 42.23 | *88.74 | 79.02 | |
IlliniMET | 83.58 | 39.68 | 69.49 | 41.94 | 87.97 | 54.60 | |
DISC | 87.78 | 70.47 | 58.82 | 55.71 | *89.02 | 80.46 |
Overall, DISC is the best performing model among all baseline models. Specifically, DISC and RNN-MHCA show competitive results in all random split settings, however, DISC has stronger performance on type-aware settings, indicating that the SC check enables DISC to recognize the non-compositionality of idioms permitting it to generalize better to idioms unseen in the training set. Therefore, while RNN-MHCA might be as good as DISC when it comes to identifying (and potentially memorizing) known idioms, DISC is more capable of identifying unseen idioms since it better leverages the SC property of idioms in addition to memorization.
In the random setting, DISC performs on par with RNN-MHCA and BERT-BiLSTM-CRF in terms of F1 and SA for MAGPIE while outperforming all baselines using the other datasets. It is notable that even with the ability to perfectly localize PIEs, Gazetteer has a low SA compared to the other top-performing models due to its inability to use the context to determine if the PIE is used idiomatically. In the type-aware setting, the F1 of DISC is comparable to that of RNN-MHCA and BERT-BiLSTM. However, in terms of SA, DISC outperforms all models across all datasets. We also observe that for all datasets, achieving high F1 scores is much easier than achieving high SA. This is especially salient in the MAGPIE type-aware split where all the models achieve similar F1s, whereas DISC outperforms the others in terms of SA by margins ranging from 7% to 30.8% absolute points. Moreover, Gazetteer is unable to perform PIE localization at all in this setting on accounts of its being limited to the instances available in an idiom lexicon.
For MAGPIE random split, it is notable that all the models (including Gazetteer with its majority-class prediction) achieve at least 86% F1 score. For MAGPIE type-aware split, DISC is decisively the best performing model with absolute gains of at least 7.1% in SA and at least 1.4% in F1. For SemEval5B type-aware split, DISC is the best performing model in terms of SA with gains of at least 11.1%. Note that in terms of F1, although BERT outperforms DISC by 14.6%, Gazetteer outperforms all methods. We believe this is due to a combination of factors, including the insufficiency of the training instances (there were only 1,111 instances) and the number of idioms (there were only 31 unique idioms in the train set), and the distributional dissimilarity between the train and test sets with respect to the semantic and the syntactic properties of the PIEs in the SemEval5B dataset, for example, unlike VNC where both the train and test idioms were verb-noun constructions, SemEval5B idioms are of more diverse syntactic structures, yet SemEval5B has fewer training instances and total number of idioms. However, DISC outperforms BERT by 20.5% in SA, which shows that DISC has the best idiom identification ability. For VNC, DISC and RNN-MHCA perform competitively in all evaluation metrics in both random- and type-aware settings. In terms of SA, DISC has a >1% gain over RNN-MHCA in both random and type-aware settings.
Idiom identification Cross-domain Performance across Datasets.
To check the cross-domain performance of the best performing models (DISC and RNN-MHCA), we train them on the MAGPIE train set (as it contains the largest number of instances) and test on the VNC and the SemEval5B test sets in a random-split setting. As shown in Table 4, both models show a performance drop due to the change of the sentence source between SemEval5B and MAGPIE, and the small overlap between VNC and MAGPIE (only 4 common idioms). RNN-MHCA obtains F1 scores that are 3.6% and 1.44% higher than that of DISC on SemEval5B and VNC respectively, indicating its better ability to detect PIEs in this cross-domain setting. However, DISC is able to detect and locate the idioms more precisely, yielding SA gains of 7.8% and 2.81% over those of RNN-MHCA on SemEval5B and VNC, respectively. We argue that this demonstrates DISC’s ability to identify idioms with a higher precision and that DISC’s gain in SA outweighs its loss in F1, given that the gain is generally higher than the loss and SA is a more reliable measure of identification performance.
Tgt. Domain . | Models . | F1 . | SA . |
---|---|---|---|
SemEval5B | RNN-MHCA | 81.35 | 54.72 |
DISC | 77.70 | 61.80 | |
VNC | RNN-MHCA | 85.01 | 69.74 |
DISC | 83.57 | 72.55 |
Tgt. Domain . | Models . | F1 . | SA . |
---|---|---|---|
SemEval5B | RNN-MHCA | 81.35 | 54.72 |
DISC | 77.70 | 61.80 | |
VNC | RNN-MHCA | 85.01 | 69.74 |
DISC | 83.57 | 72.55 |
We now evaluate the performance by paying specific attention to one specific property of PIEs that makes them challenging to NLP applications—syntactic flexibility (fixedness) (Constant et al., 2017).
Effect of Idiom Fixedness.
We analyze the idiom identification performance with respect to the idiom fixedness levels. Following the definitions given by Sag et al. (2002) for lexicalized phrases, we categorized idioms into three fixedness levels: (1) fixed (e.g., with respect to)—fully lexicalized with no morphosyntactic variation or internal modification, (2) semi-fixed (e.g., keep up with)—permit limited lexical variations (kept up with) such as inflection and determiner selection, while adhering to strict constraints on word order and composition, and (3) syntactically flexible (e.g., serve someone right)—largely retain basic word order and permit a wide range of syntactic variability such that the internal words of the idioms are subject to change. The authors (both near-native English speakers, one with linguistics background) manually labeled the PIEs in the MAGPIE test set into these 3 levels by first independently labeling 35 per level. Seeing that the agreement was 91%, all the remaining idioms were labeled by one researcher. We note that the highest level is occupied by verbal MWEs (VMWEs) that are characterized by complex structures, discontinuities, variability, and ambiguity (Savary et al., 2017).
We use this labeled set to compute the DISC performance for each fixedness level in terms of classification F1 and SA. As shown in Table 5, although the fixed idioms obtain the best performance as expected, the performance difference between semi-fixed and syntactically flexible idioms suggests that DISC can reliably detect idioms from different fixedness levels.
Error Analysis.
Next, we analyze DISC’s performance and its errors on the MAGPIE dataset to gain further insights into DISC’s idiom identification abilities and its shortcomings.
A closer inspection of the results showed that 65.9% of the 1,071 idiom types from the MAGPIE random split test set have perfect average SA (i.e., 100%), indicating that DISC successfully learned to recognize the SC of the vast majority of the idiom types from the training set.
In order to gain insights related to DISC’s ability to memorize the instances of known PIEs to perform identification on the known ones, we analyze the relationship between the average SA and the number of training samples on a per PIE basis in the MAGPIE random split using Pearson correlation. A correlation of 0.1857 with a p < 0.05 indicates a weak relationship between the number of training instances and the performance. This, taken together with the strong type-aware performance, suggests that DISC’s identification ability relies on more than just memorizing known PIE instances.
Visualizing the attention matrices (matrix H as described in Section 3) for a sample of instances showed that the model attends to only the correct idiom (in relatively shorter sentences) or to many phrases in a sentence (longer or those with literal phrases). In the sentence but they’d had a thorough look through his life and just to be sure and hit the jackpot entirely by chance, underlined phrases are those with high attention and hit the jackpot was correctly selected as the output.2 In some instances of incorrect prediction, that is, incompletely identified IE tokens or wrongly predicting the sentence to be literal, we found that the model still attended to the correct phrase. We hypothesize that the two attention flow layers have a hierarchical relation in their functions: The first attention flow layer, using static word embeddings and their POS tags, identifies candidate phrases that could have idiomatic meanings, and the second attention flow layer, by checking for SC, identifies the idiomatic expression’s span if it exists. Hence, accurate span prediction requires the model to (1) attend to the right tokens, (2) generate/extract meaningful token representations (from the attention phase), and then (3) correctly classify the tokens. Based on the fact that the model is attending to the phrases correctly, future studies should improve upon the prediction phase using models that more efficiently leverage the features for improved token classification.
Moreover, we present case studies on the wrongly predicted instances from the MAGPIE type-aware split. Toward this, we randomly sample 25% of incorrectly predicted instances by DISC (300 instances), and group them into 6 case types: (1) alternative, (2) partial, (3) meaningful, (4) literal, (5) missing, and (6) other. These are shown in Table 6 and we discuss them below.
Case # . | Error Type (Pct.) . | Sentence with PIE . | Prediction . |
---|---|---|---|
1 | Alternative (9.7%) | But an on-the-ball whisky shop could make a killing with its special ec-label malt scotch at £27.70 a bottle. | make a killing |
2 | Partial (29.3%) | Dragons can lie for dark centuries brooding over their treasures, bedding down on frozen flames that will never see the light of day. | of day |
3 | Meaningful (4.3%) | Given a method, we can avoid mistaken ideas which, confirmed by the authority of the past, have taken deep root, like weeds in men’s minds. | weeds in men’s minds |
4 | Literal (8.0%) | If you must jump out of the loop, you should use until true to “pop” the stack. | out of the loop |
5 | Missing (42.3%) | We have friends in high places, they said. | Empty string |
6 | Other (6.3%) | With the chips down, we had to dig down. | With down |
Case # . | Error Type (Pct.) . | Sentence with PIE . | Prediction . |
---|---|---|---|
1 | Alternative (9.7%) | But an on-the-ball whisky shop could make a killing with its special ec-label malt scotch at £27.70 a bottle. | make a killing |
2 | Partial (29.3%) | Dragons can lie for dark centuries brooding over their treasures, bedding down on frozen flames that will never see the light of day. | of day |
3 | Meaningful (4.3%) | Given a method, we can avoid mistaken ideas which, confirmed by the authority of the past, have taken deep root, like weeds in men’s minds. | weeds in men’s minds |
4 | Literal (8.0%) | If you must jump out of the loop, you should use until true to “pop” the stack. | out of the loop |
5 | Missing (42.3%) | We have friends in high places, they said. | Empty string |
6 | Other (6.3%) | With the chips down, we had to dig down. | With down |
Case 1 is the “alternative” case, which is a common ‘mis-identification’ where DISC only identifies one of the IEs when multiple IEs are present; hence, the model detects the alternative IE to the IE originally labeled as the ground truth. Strictly speaking, this is not a limitation of our method but rather an artifact of the available dataset; all the datasets used in our experiments only label at most one PIE for each sentence even when there may be more than one. Case 2 is the “partial” case, which is another common wrong prediction where only a portion of the idiom span is recognized, namely, the boundary of the entire idiom is not precisely localized. Case 3 is the “meaningful” case in which DISC identifies figurative expressions instead of the ground truth idiom (and in this sense relates to Case 1 above). As an example, when the ground truth is taken deep root, DISC identifies weeds in men’s minds, which is clearly used metaphorically and so could have been an acceptable answer. Since the idioms are unknown to DISC during test time, we argue that, as in Case 1, the identification is still meaningful, although the detected phrases are not exactly the same as the ground truth. Case 4 is the “literal” case in which DISC identifies a PIE that is actually used in the literal sense. Case 5, the “missing” case, is the opposite of Case 4, where DISC fails to recognize the presence of an idiom completely and returns only an empty string. Case 6 is the final error type “other” in which DISC returns words or phrases that are not meaningful or figurative, nor part of any PIEs.
After categorizing the 300 incorrect instances according to the above definitions, we found that 42.3% of them are of the “missing” case and around 43.4% are samples with partially correct predictions or meaningful alternative predictions. Their detailed breakdown is listed in Table 6. Tackling the erroneous cases will be a fruitful future endeavor.
6 Conclusion and Future Work
In this work, we studied how a neural architecture that fuses multiple levels of syntactic and semantic information of words can effectively perform idiomatic expression identification. Compared to competitive baselines, the proposed model yielded state-of-the-art performance on PIEs that varied with respect to syntactic patterns, degree of compositionality and syntactic flexibility. A salient feature of the model is its ability to generalize to PIEs unseen in the training data.
Although the exploration in this work is limited to IEs, we made no idiom-specific assumptions in the model. Future directions should extend the study to nested and syntactically flexible PIEs (verbal MWEs) and other figurative/literal constructions such as metaphors—categories that were not sufficiently represented in the datasets considered in this study. Other concrete research directions include performing the task in cross- and multi-lingual settings.
Acknowledgments
We thank the anonymous reviewers for their comments on earlier drafts that significantly helped improve this manuscript. This work was supported by the IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR)— a research collaboration as part of the IBM AI Horizons Network.
Notes
The implementation of DISC is available at https://github.com/zzeng13/DISC.
Owing to space constraints we were unable to present detailed illustrations of attention matrices to make our point.
References
Author notes
Action Editor: Nitin Madnani