It’s not Rocket Science: Interpreting Figurative Language in Narratives

Figurative language is ubiquitous in English. Yet, the vast majority of NLP research focuses on literal language. Existing text representations by design rely on compositionality, while figurative language is often non- compositional. In this paper, we study the interpretation of two non-compositional figurative languages (idioms and similes). We collected datasets of fictional narratives containing a figurative expression along with crowd-sourced plausible and implausible continuations relying on the correct interpretation of the expression. We then trained models to choose or generate the plausible continuation. Our experiments show that models based solely on pre-trained language models perform substantially worse than humans on these tasks. We additionally propose knowledge-enhanced models, adopting human strategies for interpreting figurative language types: inferring meaning from the context and relying on the constituent words’ literal meanings. The knowledge-enhanced models improve the performance on both the discriminative and generative tasks, further bridging the gap from human performance.


Introduction
Figurative language is a medium for making language expressive, communicating abstract ideas otherwise difficult to visualize, and provoking emotions (Roberts and Kreuz, 1994;Fussell and Moss, 1998). Despite the ubiquity of figurative language across various forms of speech and writing, the vast majority of NLP research focuses primarily on literal language. Figurative language is often more challenging due to its implicit nature and is seen as ''a bottleneck in automatic text understanding'' (Shutova, 2011). * Work done at the Allen Institute for AI.
In recent years, transformer-based language models (LMs) achieved substantial performance gains across various NLP tasks, however, they still struggle with figurative language. In particular, one of the challenges is that figurative expressions are often non-compositional, that is, the phrase meaning deviates from the literal meanings of its constituents. For instance, the idiom ''chicken feed'' in Figure 1 denotes ''a ridiculously small sum of money'' instead of ''food for poultry''. By design, transformer-based LMs compute a word representation as a function of the representation of its context. LM-based phrase representations encode the meanings of the constituent words but hardly capture any meaning that is introduced by the composition itself (Yu and Ettinger, 2020). Even though LMs may recognize when a word is used non-literally, and potentially attend to it less, they still struggle to represent the implied, non-literal meaning of such phrases (Shwartz and Dagan, 2019).
While LMs potentially memorize familiar idioms, we can expect them to further struggle with similes, which are often created ad hoc (Carston and Wearing, 2011). For example, in Figure 1, the person is compared to ''a high mountain lake without a wind stirring it'' to imply calmness. Many such figurative expressions compose in a non-trivial way, and introduce implicit meaning that requires multiple reasoning steps to interpret.
In this paper we work on interpreting idioms and similes in narratives, where they are especially abundant. Existing work on narrative understanding focuses on literal stories, testing models on their ability to answer questions about a narrative (Kočiský et al., 2018) or continue an incomplete narrative (Story Cloze Test; Mostafazadeh et al., 2016). We follow the latter 589 Figure 1: Example narratives from our datasets, containing an idiom (top) or a simile (bottom), along with human-written plausible and implausible continuations. setup. We extracted short narratives from the Toronto Book corpus (Zhu et al., 2015), each containing a figurative expression, and crowdsourced plausible and implausible continuations that rely on correct interpretation of the figurative expression. We defined two tasks: a discriminative setting, where the goal is to choose the plausible continuation among two candidates, and a generative setting, where the goal is to generate a plausible continuation that is coherent with the narrative and complies with the meaning of the figurative expression.
We report the performance of an extensive number of state-of-the-art LMs on both tasks, in zero-shot, few-shot, and supervised settings. Our results show that pre-trained LMs including GPT-3 (Brown et al., 2020) perform poorly in the zero-shot and few-shot settings. While the supervised model's performance is closer to humans, the gap is still substantial: In the discriminative tasks, the gap from human performance was 10 and 14.6 points in accuracy for idioms and similes, respectively. In the generative tasks, there was a striking 24 and 28 points difference in human evaluation of the plausibility of generated continuations.
To further close this gap, we developed knowledge-enhanced models inspired by two human strategies for interpreting unknown idioms, as studied by Cooper (1999) and discussed in Shwartz and Dagan (2019). The first strategy is to infer the expression's meaning from its context, for which we incorporate event-centered inferences from ParaCOMET (Gabriel et al., 2021b). The second relies on the literal meanings of the constituent words, using concept-centered knowledge from COMET-ConceptNET (Hwang et al., 2021). Additionally similes are often interpreted by humans using the literal property of the vehicle or object of comparison and thus we use concept-centered knowledge here as well. The knowledge-enhanced models consistently outperformed other models on both datasets and settings, with a substantial gap on the generative tasks.
Furthermore, different strategies were favored for each case: The generative context model performed well on idioms, in line with Cooper's findings, while the literal model was favored for similes, which are by design based on a constituent's literal attribute (e.g., calm lake). The knowledge-enhanced models leave room for improvement on our dataset. We hope that future work will use additional techniques inspired by the properties of figurative language and human processing of it. Our code and data are available at https://github.com/tuhinjubcse /FigurativeNarrativeBenchmark and our leaderboard is available at https:// leaderboard.allenai.org/idiom-simile/.

Idioms
Idioms are figurative expressions with a non-literal meaning. For instance, ''break a leg'' is a good luck greeting before a performance and shouldn't be taken literally as wishing someone to injure themselves. Idioms are typically noncompositional (i.e., the meaning of an idiom is not derived from the meanings of its constituents) and fixed (i.e., allowing little variance in syntax and lexical choice). 1 Idiomatic expressions include proverbs (''actions speak louder than words''), clichés (''what goes around comes around''), euphemisms (''rest in peace''), and more.
Prior work on idioms largely focused on identifying the idiomaticity of a multi-word expression. This is a classification task, defined either at the token-level (is the phrase idiomatic within a given context?), or the type-level (may the phrase be idiomatic in some context?) (Fazly et al., 2009;Li and Sporleder, 2009;Verma and Vuppuluri, 2015;Peng and Feldman, 2016;Salton et al., 2016;Liu and Hwa, 2017). Compared to identification, the interpretation of idioms has been less explored. Approaches for representing idiomatic expressions include substituting idioms with literal paraphrases (Liu and Hwa, 2016;Zhou et al., 2021), representing them as a single token, or learning to compose them at the character level rather than the word level .
With the rising popularity of pre-trained LMs, several recent papers studied their capacity to accurately represent idioms. Shwartz and Dagan (2019) found that while LMs excelled at detecting non-literal word usage (e.g., ''flea'' in ''flea market''), the representation of idiomatic expressions was of lower quality than that of literal ones. Yu and Ettinger (2020) showed that LMs encode the words that appear in a given text, but capture little information regarding phrase meaning. Finally, Garcia et al. (2021) studied the compositionality of noun compounds in English and Portuguese, and found that LM-based models did not perform well on detecting compositionality, and represented idiomaticity differently from humans.

Similes
Similes are a figure of speech that compares two things, usually with the intent to make the description more emphatic or vivid, and spark the reader's imagination (Paul et al., 1970). Similes may either be explicit, namely, specify the topic, vehicle, and similarity property, as in ''The house was cold like Antarctica'' (where the topic is ''house'', the vehicle is ''Antarctica'' and the property of comparison is ''cold''), or implicit, namely, omitting the property, as in ''the house was like Antarctica'' (Section 3.2). Most work in NLP has focused on simile detection, that is, distinguishing literal from figurative comparisons. Earlier work relied on semantic and syntactic characteristics, namely, higher semantic similarity between the topic and the vehicle in literal comparisons than in figurative comparisons (Niculae and Danescu-Niculescu-Mizil, 2014;Qadir et al., 2015;Mpouli, 2017), and dictionary definitions (Qadir et al., 2016), while more recent work is based on neural methods (Liu et al., 2018;Zeng et al., 2020). Simile interpretation focused on inferring the implicit property (Qadir et al., 2016). In other lines of work, Chakrabarty et al. (2020b) and Zhang et al. (2021) proposed methods for generating similes from their literal counterparts, while Chakrabarty et al. (2021a) showed that state-of-the-art NLI models fail on pragmatic inferences involving similes.

Human Processing of Figurative Language
The ways in which humans process figurative language may inspire computational work on figurative language interpretation. Cooper (1999) studied how L2 English speakers interpret unfamiliar English idioms. He found that the leading strategy was to infer the meaning from the given context, which led to successful interpretation 57% of the time, followed by relying on the literal meaning of the constituent words (22% success rate). For example, a participant asked to interpret ''robbing the cradle'' in the context ''Robert knew that he was robbing the cradle by dating a sixteen-year-old girl'' used the literal meaning of cradle to associate the meaning with babies and indirectly with young age, and along with the context inferred that it meant to ''date a very young person''. Asl (2013) repeated the same experiment with stories, and concluded that longer contexts improved people's ability to interpret unknown idioms. With respect to novel similes and metaphors, they are interpreted through shared literal attributes between the topic and vehicle (e.g., ''Antarctica is cold, can a house also be cold?'') (Wolff and Gentner, 2000;Carston and Wearing, 2011).

Narrative Understanding
Early computational work on narrative understanding extracted chains of subevents and their participants from narratives (Chambers and Jurafsky, 2009). An alternative task is machine reading comprehension, that is, answering multiple-choice questions based on a narrative, such as MCTest (Richardson et al., 2013) and NarrativeQA (Kočiský et al., 2018).
The most commonly used benchmark for narrative understanding today is ROCStories (Mostafazadeh et al., 2016), a collection of 50k five-sentence commonsense stories pertaining to everyday life. The story cloze task requires models to identify the plausible continuation sentence among two candidate continuations in its discriminative form, or generate a plausible sentence, in its generative form. Since the release of this dataset, many computational approaches for the task have been developed (Chaturvedi et al., 2017;Schwartz et al., 2017b;Cai et al., 2017;Srinivasan et al., 2018;Li et al., 2019;Cui et al., 2020;Brown et al., 2020, inter alia). In this paper, we follow the story cloze benchmark setup, and collect benchmarks particularly aimed at testing the comprehension of figurative language in narratives.

Commonsense Knowledge Models
Many language tasks require relying on implicit commonsense knowledge that is never mentioned explicitly because it is assumed to be known by everyone. To that end, commonsense knowledge bases (KBs) record such facts. Notably, Concept-Net (Speer et al., 2017) is a large-scale conceptcentric KB, while ATOMIC  contains event-centric knowledge about causes, effects, and the mental states of the participants. To overcome the sparsity of KBs, knowledge models such as COMET (Bosselut et al., 2019;Hwang et al., 2021) fine-tuned an LM on structured KB triplets. COMET is capable of providing inferences for new events or concepts. Para-COMET (Gabriel et al., 2021a) is an extension of ATOMIC-COMET that works at the paragraph level and generates discourse-aware commonsense knowledge. Recently, several works have used such commonsense knowledge models for improved natural language understanding or generation such as Bhagavatula et al. In our work we use the knowledge models COMET (Hwang et al., 2021) and ParaCOMET (Gabriel et al., 2021a), respectively, to provide Idioms Similes any port in a storm like a psychic whirlpool been there, done that like a moth-eaten curtain slap on the wrist like a first date no time like the present like a train barreling of control lay a finger on like a sodden landscape of melting snow walk the plank like a Bunsen burner flame curry favour like a moldy old basement not to be sneezed at like a street-bought Rolex no peace for the wicked like an endless string of rosary beads Table 1: Examples of idioms and similes present in the narratives in our datasets. more information about the literal meaning of constituent words or the narrative context useful to infer the figurative expressions meaning.

Data
We build datasets aimed at testing the understanding of figurative language in narratives, focusing on idioms (Section 3.1) and similes (Section 3.2). We posit that a model that truly understands the meaning of a figurative expression, like humans do, should be able to infer or decide what happens next in the context of a narrative. Thus, we construct a dataset in the form of the story-cloze test.

Idioms
We compile a list of idioms, automatically find narratives containing these idioms, and then elicit plausible and implausible continuations from crowdsourcing workers, as follows.
Collecting Idioms. We compile a list of 554 English idioms along with their definitions from online idiom lexicons. 2 Table 1 presents a sample of the collected idioms.
Collecting Narratives. We use the Toronto Book corpus (Zhu et al., 2015), a collection of 11,038 indie ebooks extracted from smashwords.com. We extract sentences from the corpus containing an idiom from our list, and prepend the 4 preceding sentences to create a narrative. We manually discarded paragraphs that did not form a coherent narrative. We extracted 1,455 narratives with an average length of 80 words, spanning 554 distinct idioms.
Collecting Continuations. We collected plausible and implausible continuations to the narrative. We used Amazon Mechanical Turk to recruit 117 workers. We provided these workers with the narrative along with the idiom definition, and instructed them to write plausible and implausible continuations that are pertinent to the context, depend on the correct interpretation of the idiom, but that don't explicitly give away the meaning of the idiom. We collected continuations from 3 to 4 workers for each narrative. The average plausible continuation contained 12 words, while the implausible continuations contained 11 words.
To ensure the quality of annotations, we required that workers have an acceptance rate of at least 99% for 10,000 prior HITs (Amazon Mechanical Turk tasks), and pass a qualification test. We then manually inspected the annotations to identify workers who performed poorly in the initial batches, disqualified them from further working on the task, and discarded their annotations.
Our automatic approach for collecting narratives does not account for expressions that may be used figuratively in some contexts but literally in others. For example, the idiom ''run a mile'' (i.e., avoiding something in any way possible) may also be used literally to denote running a distance of one mile. To avoid including literal usages, we instructed the workers to flag such examples, which we discard from the dataset. We further manually verified all the collected data. Overall, we removed 12 such narratives.
The final idiom dataset contains 5,101 <narrative, continuation> tuples, exemplified in the top part of Figure 1. We split the examples to train (3,204), validation (355), and test (1,542) sets. To test models' ability to generalize to unseen idioms, we split the data such that there are no overlaps in idioms between train and test.

Similes
A simile is a figure of speech that usually consists of a topic and a vehicle (typically noun phrases) that are compared along a certain property using comparators such as ''like'' or ''as'' (Hanks, 2013;Niculae and Danescu-Niculescu-Mizil, 2014). The property may be mentioned (explicit simile) or hidden and left for the reader to infer (implicit simile). We focus on implicit similes, which are less trivial to interpret than their ex-plicit counterparts (Qadir et al., 2016), and test a model's ability to recover the implicit property.
Collecting Similes. Because there are no reliable methods for automatically detecting implicit similes, we first identify explicit similes based on syntactic cues, and then deterministically convert them to implicit similes. We look for sentences in the Toronto Book corpus containing one of the syntactic structures ''as ADJ/ADV as'' or ''ADJ/ADV like'' as a heuristic for identifying explicit similes. We additionally add the constraint of the vehicle being a noun phrase to avoid examples like ''I worked as hard as him''. We remove the adjectival property to convert the simile to implicit, as demonstrated below: Explicit: He feels calm, like a high mountain lake without a wind stirring it. He feels as calm as a high mountain lake without a wind stirring it. Implicit: He feels like a high mountain lake without a wind stirring it.
We collected 520 similes along with their associated property. We asked workers to flag any expression that was not a simile, and manually verified all the collected data. Table 1 presents a sample of the collected similes. Many of the similes are original, such as ''like a streetbought Rolex'' which implies that the subject is fake or cheap.
Collecting Narratives. Once we identified the explicit simile and converted it to its implicit form, we similarly prepend the 4 previous sentences to form narratives. The average length of the narrative was 80 words.
Collecting Continuations. We repeat the same crowdsourcing setup as for idioms, providing the explicit simile property as the definition. Each narrative was annotated by 10 workers. The average length of continuations was identical to the idiom dataset (12 for plausible and 11 for implausible).
The simile dataset contains 4,996 <narrative, continuation> tuples, exemplified in the bottom part of Figure 1. We split the examples to train (3,100), validation (376), and test (1,520) sets with no simile overlaps between the different sets.

Discriminative Task
The first task we derive from our dataset is of discriminative nature in the setup of the story cloze task. Given a narrative N and two candidate continuations {C 1 , C 2 }, the goal is to choose which of the continuations is more plausible.

Methods
For both idioms and similes, we report the performance of several zero-shot, few-shot, and supervised methods as outlined below. Most of our experiments were implemented using the transformers package (Wolf et al., 2020).
Zero-shot. The first type of zero-shot models is based on standard language model score as a proxy for plausibility. We use GPT-2 XL (Radford et al., 2019) and GPT-3 (Brown et al., 2020) to compute the normalized log-likelihood score of each continuation given the narrative, predicting the continuation with the highest probability: argmax i P LM (C i |N).
We also use UnifiedQA (Khashabi et al., 2020), a T5-3B model (Raffel et al., 2020) trained on 20 QA datasets in diverse formats. We don't fine-tune it on our dataset, but instead use it in a zero-shot manner, with the assumption that the model's familiarity with QA format and with the narrative domain through training on the NarrativeQA dataset (Kočiský et al., 2018) would be useful. To cast our task as a QA problem we format the input such that the question is ''Which is more plausible between the two based on the context?''.
Few-shot. Language models like GPT-3 have shown impressive performance after being prompted with a small number of labelled examples. A prompting example in which the correct continuation is the first is given in the following format: Q: N (1) C 1 (2) C 2 A: (1).
We provided the model with as many prompting examples as possible within the GPT-3 API limit of 2,048 tokens, which is 6 examples. The test examples are provided without the answer and the model is expected to generate (1) or (2).
We also use the recently proposed Pattern Exploiting Training model (PET; Schick and Schütze, 2021). PET reformulates the tasks as a cloze question and fine-tunes smaller masked LMs to solve it using a few training examples. 3 We use the following input pattern: ''N. C 1 . You are '' for idioms and ''N. C 1 . That was '' for similes. PET predicts the masked token and maps it to the label inventory using the verbalizer {''right'', ''wrong''} for idioms and {''expected'', ''unexpected''} for similes respectively mapping them to {TRUE, FALSE}. 4 We provide each model 100 training examples, train it for 3 epochs, and select the model that yields the best validation accuracy.
Supervised. We fine-tune RoBERTa-large  as a multiple-choice model. For a given instance, we feed each combination of the narrative and a continuation separately to the model in the following format: We pool the representation of the start token to get a single vector representing each continuation, and feed it into a classifier that predicts the continuation score. The model predicts the continuation with the higher score. We fine-tune the model for 10 epochs with a learning rate of 1e−5 and a batch size of 8, and save the best checkpoint based on validation accuracy.
Knowledge-Enhanced. Inspired by how humans process figurative language, we develop RoBERTa-based models enhanced with commonsense knowledge. We develop two models: The first model obtains additional knowledge to better understand the narrative (context), while the second seeks knowledge pertaining to the literal meaning of the constituents of the figurative expression (Section 2.3). In both cases, in addition to the narrative and candidate continuations, the model is also provided with a set of inferences: {Inf 1 , . . . , Inf n } that follow from the narrative, as detailed below and demonstrated in Figure 2.
The literal model uses the COMET model (Hwang et al., 2021), a BART-based language model trained to complete incomplete tuples from ConceptNet. As opposed to extracting knowledge from ConceptNet directly, COMET can generate inferences on demand for any textual input. For an idiom, we retrieve knowledge pertaining to the content words among its constituents,  focusing on the following relations: UsedFor, Desires, HasProperty, MadeUpOf, At-Location, and CapableOf. For each content word, we extract the top 2 inferences for each relation using beam search. For example, given the idiom ''run the gauntlet'', we obtain inferences for ''run'' and ''gauntlet''. We convert the inferences to natural language format based on the templates in Guan et al. (2019). Given the nature of the simile task, we focused solely on the vehicle's HasProperty relation and obtain the top 12 inferences. For example, given the simile ''like a psychic whirlpool'', we obtain inferences for the phrase ''psychic whirlpool''.
The context model is enhanced with knowledge from ParaCOMET (Gabriel et al., 2021a), trained on ATOMIC. We feed into ParaCOMET all but the last sentence from the narrative, excluding the sentence containing the figurative expression. We generate inferences along ATOMIC dimensions pertaining to the narrator (PersonX), namely: xIntent, xNeed, xAttr, xWant, xEffect, and xReact. Again, we extract the top 2 inferences for every relation using beam search.
In both models, as demonstrated in Figure 3, the input format X i,j for continuation C i and inference Inf j is: Inf j < s/ > N <s/> C i .
We compute the score of each of these statements separately, and sum the scores across inferences to get a continuation score: where scorer is a dropout layer with dropout probability of 0.1 followed by a linear classifier. Finally, the model predicts the continuations with the higher score. We fine-tune the context and literal models for 10 epochs with a learning rate of 1e−5 and an effective batch size of 16 for idioms and 64 for similes, and save the best checkpoint based on validation accuracy. Table 2 shows the performance of all models on the discriminative tasks. For both similes and idioms, supervised models perform substantially better than few-shot and zero-shot models, but still leave a gap of several points of accuracy behind human performance. Human performance is the average accuracy of two native English speakers on the task. We did not provide them with the idiom definition, and we assume they were familiar with the more common idioms. The models performed somewhat better on idioms  than on similes, possibly due to the LMs' familiarity with some common idioms as opposed to the novel similes. Among the zero-shot models, GPT-2 performed worse than GPT-3 and UnifiedQA, each of which performed best on one of the tasks. In particular, UnifiedQA performed well on idioms, likely thanks to its familiarity with the QA format and with the narrative domain.

Results
In the idiom task, PET outperformed few-shot GPT-3 by a large margin of 12 points in accuracy for idioms and 3.5 points for simile, which we conjecture is attributed to the different number of training examples: 6 for GPT-3 vs. 100 for PET. The small number of examples used to prompt GPT-3 is a result of the API limit on the number of tokens (2,048) as well as the setup in which all prompting examples are concatenated as a single input.
Overall, few-shot models performed worse than zero-shot models on both datasets. We conjecture that this is due to two advantages of the zero-shot models. First, the GPT-2 and GPT-3 models performed better than the majority baseline thanks to the similarity between the task (determining which continuation is more plausible) and the language model objective (guessing the next word). Second, the UnifiedQA model performed particularly well thanks to its relevant training. At the same time, both few-shot models had to learn a new task from just a few examples.
The supervised models leave some room for improvement, and the knowledge-enhanced mod-els narrow the gap for idioms. For similes we see a minor drop in the context model and nearly comparable performance for the literal model. Annotation Artifacts. Human-elicited texts often contain stylistic attributes (e.g., sentiment, lexical choice) that make it easy for models to distinguish correct from incorrect answers without solving the actual task (Schwartz et al., 2017a;Cai et al., 2017;Gururangan et al., 2018;Poliak et al., 2018). Following previous work, we trained a continuation-only baseline, which is a RoBERTa-based supervised model that was trained only on the candidate continuations without the narrative. The results in Table 2 (-narrative) show that the performance is above majority baseline, indicating the existence of some bias. However, the performance of this baseline is still substantially worse than the supervised baseline that has access to the full input, with a gap of 17 points for idioms and 12 points for similes, indicating that this bias alone is not enough for solving the task.

Analysis
The knowledge-enhanced models provide various types of inferences corresponding to different relations in ConceptNet and ATOMIC. We are interested in understanding the source of improvements from the knowledge-enhanced models over the supervised baseline, by identifying the relations that were more helpful than others. To that end, we analyze the test examples that were incorrectly predicted by the supervised baseline but correctly predicted by each of the knowledge-enhanced models. We split the examples such that every example consists of a single inference, and feed the following input into the model to predict the plausible continuation: Inf <s/> N <s/> C. We focus on the idiom dataset, since for the literal model for similes the only used relation was HasProperty and the context model performed slightly worse than the baseline. Table 3 shows the percents of successful test set predictions for each relation type. The relations in the context model perform similarly, with the best relation xReact performing as well as all of the relations (Table 2). In the literal model, it seems that the combination of all relations is beneficial, whereas the best relation, CapableOf, performs slightly worse than the full model. For  a narrative snippet ''Since Dominic isn't up for grabs anymore, I figure that I will concentrate on something else, Carmen declares'', the inference ''grabs is capable of hold on to'' was compliant with the meaning of ''up for grabs'' (available or obtainable), and led to the correct prediction of the plausible continuation ''The good news is that there are many other available bachelors out there''. Conversely, the inference corresponding to the Desires relation was ''grab desires making money'' which was irrelevant and led to an incorrect prediction.

Generative Task
In the generative task, given a narrative N, the goal is to generate a plausible next sentence that is coherent with the context and consistent with the meaning of the figurative expression. Each instance consists of a reference plausible continuation C.

Methods
We similarly experiment with zero-shot, few-shot, and supervised models.
Zero-shot. We use standard LMs, GPT-2 XL and GPT-3, to generate the next sentence following the narrative. We let the models generate up to 20 tokens, stopping when an end of sentence token was generated. Following preliminary experiments, for GPT-2 XL and the rest of the models we use top-k sampling (Fan et al., 2018) as the decoding strategy with k = 5 and a softmax temperature of 0.7, while for GPT-3 we use the method provided in the API which is nucleus sampling (Holtzman et al., 2020) with a cumulative probability of p = 0.9.
Few-shot. We prompt GPT-3 with 4 training examples of the form Q: N A: C followed by each individual test example, and decode the answer.
Supervised. We fine-tune GPT-2 XL with a language model objective for 3 epochs with a batch size of 2. We also trained T5 large (Raffel et al., 2020) and BART large (Lewis et al., 2020) as encoder-decoder models. Both were trained for 5 epochs for idioms and 20 epochs for similes, with an effective batch size of 64. For each model, we kept the best checkpoint based on the validation set perplexity, and used top-k decoding with k = 5 and a temperature of 0.7.
Knowledge-Enhanced. We followed the same intuition and inferences we used for the knowledgeenhanced discriminative models (Section 4.1). We fine-tune the models for one epoch as the effective data size is multiplied by the number of inferences per sample. The overall architecture of the generative knowledge-enhanced model is depicted in Figure 4. The models are based on GPT-2 XL and trained with a language model objective to predict the next sentence given the narrative and a single inference. The input format for inference Inf j is: Inf j <sep1> N <sep2>, where <sep1> and <sep2> are special tokens, and the expected output is the plausible continuation C. During inference, we combine the generations from all inferences pertaining to a given narrative. Inspired by , who ensemble logits from multiple LMs, we ensemble the logits predicted for multiple input prompts using the same model. A standard decoding process gets at each time step an input prompt text x <t of length t − 1. The prompt is encoded and the model outputs the logits for the next (t th ) token, denoted by z t ∈ IR |V | , where V is the vocabulary. To get a discrete next token, z t is normalized and exponentiated to resemble a probability distribution over the vocabulary: P (X t |x <t ) = softmax(z t ), and the next token x t is sampled from P (X t |x <t ). This token is then appended to the prompt and the process iteratively continues until a predefined length or until an end of sentence token had been generated.
Our decoding process differs in that at time step t, we compute the logits z t 12 j=1 corresponding to the prompts derived from each of the inferences: Inf j <sep1> N <sep2> for j = 1 . . . 12. We sum the logits vectors to obtain z t = 12 j=1 z tj , from which we decode the next token as usual.   Table 4: Model performance on the generative tasks in terms of automatic metrics. R-L denotes Rouge-L and B-S denotes BERT-Score.

Results
Automatic Evaluation. Table 4 shows the performance of all the models on the generative tasks in terms of automatic metrics. We report the performance of the recall-oriented n-gram overlap metric Rouge-L (Lin, 2004), typically used for summarization tasks, and the similarity-based BERT-Score (Zhang et al., 2019). We use the latest implementation to date, which replaces BERT with deberta-large-mnli-a DeBERTa model (He et al., 2021) fine-tuned on MNLI (Williams et al., 2018). In terms of automatic evaluation, the best-performing knowledge-enhanced model (context for idioms and literal for similes) performs similarly to the GPT-2 XL supervised baseline, with slight preference to the baseline for idioms and to the knowledge-enhanced model for similes. Both types of supervised models outperform the zero-shot and few-shot models.
Human Evaluation. Although automatic metrics provides an estimate of relative model performance, these metrics were often found to have very little correlation with human judgments (Novikova et al., 2017;Krishna et al., 2021). To account for this we also performed human evaluation of the generated texts for a sample of the  test narratives. The human judgments were collected using Amazon Mechanical Turk. Workers were shown a narrative, the meaning of the idiom (or the property of the simile), and a list of 3 generated continuations, one from each of the supervised GPT-2 model, the context model, and the literal model. We performed two types of evaluations. In the absolute evaluation, we randomly sampled 50 narratives for each task, and asked workers to determine for each of the generated continuations along with the human references whether it is plausible or not. In the comparative evaluation, we randomly sampled 100 narratives for idioms and 75 for similes, and presented the workers with a randomly shuffled list of continuations, asking them to choose the most plausible one (or indicate that ''neither of the generations were good'' or ''all are equally good'').
In both evaluations, workers were instructed to consider whether the generation is sensical, coherent, follows the narrative, and consistent with the meaning of the figurative expression. Each example was judged by 3 workers and aggregated using majority voting. The inter-annotator agreement was moderate with Krippendorff's α = 0.68 and α = 0.63 for the absolute and comparative evaluations, respectively (Krippendorff, 2011).
In both absolute and comparative performance, Table 5 shows that for each of the tasks, Figure 5: Narratives ending in an idiom (top) or a simile (bottom) with the continuations generated by the baseline GPT-2 model and a knowledge-enhanced model, as preferred by human judges.

Model
Category Example  a knowledge-enhanced model outperformed the baseline GPT-2 model. What makes a more compelling case is that the context model was favored for idioms while the literal model was favored for similes, complying with prior theoretical grounding on these figurative language types. Figure 5 shows examples generated by the baseline and the best model for each task. We note that 80% of the human-written continuations for idioms and 88% of those in the simile task were judged as plausible. Based on our analysis, the gap from 100% may be explained by the ambiguity of the narratives that leaves room for subjective interpretation.

Error Analysis
We analyze the continuations labeled as implausible by the annotators, for the best model in each task: context for idioms and literal for similes. We found the following error categories, with percent details in Table 7 and exemplified in Table 6: Cat. Literal (Simile) Context (Idioms)   1  50  72  2  33  14  3 17 14 understanding its actual meaning ''closest of friends''.
2 Inconsistent with the narrative: The continuation is inconsistent or contradictory to the flow of the narrative. For instance, the narrative in the second row in Table 6 states that ''the owners who are humans are standing'', while the continuation states they are jumping. The model further predicts that the humans are barking, instead of the dogs. In general, across multiple examples we have found that models tend to confuse the various characters in the narrative.
3 Spelling or grammar errors: Some generations contained spelling mistakes or introduced grammar errors such as starting with a punctuation or having extra blank spaces. Although we instructed the crowdsourcing workers to ignore such errors, they may have affected their plausibility judgments.

Conclusion
We introduced a narrative understanding benchmark focused on interpreting figurative language, specifically idioms and similes. Following the story cloze test, we designed discriminative and generative tasks with the goal of continuing a narrative. We found that pre-trained LMs irrespective of their size struggle to perform well in zero-shot and few-shot setting, and that the supervised models while competitive are still behind human performance by a significant margin. We further bridged some of this gap with knowledge-enhanced models that are inspired by the way humans interpret figurative expressions. Our analysis reassessed known findings that although LMs generate grammatical human-like texts, they are often inconsistent and the model's ability to distinguish characters in a story is limited. We hope this work will spark additional interest in the research community to further advance the representations and modeling of figurative language, which is too common to ignore.