A Computational Framework for Slang Generation

Slang is a common type of informal language, but its flexible nature and paucity of data resources present challenges for existing natural language systems. We take an initial step toward machine generation of slang by developing a framework that models the speaker's word choice in slang context. Our framework encodes novel slang meaning by relating the conventional and slang senses of a word while incorporating syntactic and contextual knowledge in slang usage. We construct the framework using a combination of probabilistic inference and neural contrastive learning. We perform rigorous evaluations on three slang dictionaries and show that our approach not only outperforms state-of-the-art language models, but also better predicts the historical emergence of slang word usages from 1960s to 2000s. We interpret the proposed models and find that the contrastively learned semantic space is sensitive to the similarities between slang and conventional senses of words. Our work creates opportunities for the automated generation and interpretation of informal language.


Introduction
Slang is a common type of informal language that appears frequently in daily conversations, social media, and mobile platforms. The flexible and ephemeral nature of slang (Eble, 1989;Landau, 1984) poses a fundamental challenge for computational representation of slang in natural language systems. As of today, slang constitutes only a small portion of text corpora used in the natural language processing (NLP) community, and it is severely under-represented in standard lexical resources (Michel et al., 2011). Here we propose a novel framework for automated generation of slang with a focus on generative modeling of slang word meaning and choice.
I have a feeling he's gonna ice himself someday.
"frozen water; cold juice; unfriendliness" ice "to kill" Figure 1: A slang generation framework that models speaker's choice of a slang term (ice) based on the novel sense ("to kill") in context and relations with conventional senses (e.g., "frozen water").
Existing language models trained on large-scale text corpora have shown success in a variety of NLP tasks. However, they are typically biased toward formal language and under-represent slang. Consider the sentence "I have a feeling he's gonna ___ himself someday". Directly applying a stateof-the-art GPT-2 (Radford et al., 2019) based language infilling model (e.g., Donahue et al. (2020)) would result in the retrieval of kill as the most probable word choice (probability = 7.7%). However, such a language model is limited and near-insensitive to slang usage, e.g., ice-a common slang alternative for kill-received virtually 0 probability, suggesting that existing models of distributional semantics, even the transformer-type models, do not capture slang effectively, if at all.
Our goal is to extend the capacity of NLP systems toward slang in a principled framework. As an initial step, we focus on modeling the generative process of slang, specifically the problem of slang word choice that we illustrate in Figure 1. Given an intended slang sense such as "to kill", we ask how we can emulate the speaker's choice of slang word(s) in informal context 1 . We are particularly interested in how the speaker chooses existing words from the lexicon and makes innovative use of those words in novel slang context (such as the use of ice in Figure 1).
Our basic premise is that sensible slang word choice depends on linking conventional or established senses of a word (such as "frozen water" for ice) to its emergent slang senses (such as "to kill" for ice). For instance, the extended use of ice to express killing could have emerged from the use of cold ice to refrigerate one's remains. A principled semantic representation should adapt to such similarity relations. Our proposed framework is aimed at encoding slang that relates informal and conventional word senses, hence capturing semantic similarities beyond those from existing language models. In particular, contextualized embedding models such as BERT would consider "frozen water" to be semantically distant or irrelevant from "to kill", so they cannot predict ice to be appropriate for expressing "to kill" in slang context.
The capacity for generating novel slang word usages will have several implications and applications. From a scientific view, modeling the generative process of slang word choice will help explain the emergence of novel slang usages over timewe show how our framework can predict the emergence of slang in the history of English. From a practical perspective, automated slang generation paves the way for automated slang interpretation. Existing psycho-linguistic work suggests that language generation and comprehension rely on similar cognitive processes (e.g., Pickering and Garrod (2013); Ferreira Pinto Jr. and Xu (2021)). Similarly, a generative model of slang can be an integral component of slang comprehension that informs the relation between a candidate sense and a query word, where the mapping can be unseen during training. Furthermore, a generative approach to slang may also be applied to downstream tasks such as naturalistic chatbots, sentiment analysis, and sarcasm detection (see work by Aly and van der Haar (2020) and Wilson et al. (2020)).
We propose a neural-probabilistic framework that involves three components: 1) a probabilistic choice model that infers an appropriate word for expressing a query slang meaning given its context, 2) an encoder based on contrastive learning 1 We will use the terms meaning and sense interchangeably. that captures slang meaning in a modified embedding space, and 3) a prior that incorporates different forms of context. Specifically, the slang encoder we propose transforms slang and conventional senses of a word into a slang-sensitive embedding space where they will lie in close proximity. As such, senses like "frozen water", "unfriendliness" and "to kill" will be encouraged to be in close proximity in the learned embedding space. Furthermore, the resulting embedding space will also set apart slang senses of a word from senses of other unrelated words, and hence contrasting within-word senses from across-word senses in the lexicon. A practical advantage of this encoding method is that semantic similarities pertinent to slang can be extracted automatically from a small amount of training data, and the learned semantic space will be sensitive to slang.
Our framework also captures the flexible nature of slang usages in natural context. Here, we focus on syntax and linguistic context, although our framework should allow for the incorporation of social or extra-linguistic features as well. Recent work has found that the flexibility of slang is reflected prominently in syntactic shift (Pei et al., 2019). For example, ice-most commonly used as a noun-is used as a verb to express "to kill" (in Figure 1). We build on these findings by incorporating syntactic shift as a prior in the probabilistic model, which is integrated coherently with the contrastive neural encoder that captures flexibility in slang sense extension. We also show how a contextualized language infilling model can provide additional prior information from linguistic context (c.f. Erk (2016)).
To preview our results, we show that our framework yields a substantial improvement on the accuracy of slang generation against state-of-theart embedding methods including deep contextualized models, in both few-shot and zero-shot settings. We evaluate our framework rigorously on three datasets constructed from slang dictionaries and in a historical prediction task. We show evidence that the learned slang embedding space yields intuitive interpretation of slang and offers future opportunities for informal natural language processing.
Computational Studies of Slang. Slang has been extensively studied as a social phenomenon (Mattiello, 2005), where social variables such as gender (Blodgett et al., 2016), ethnicity (Bamman et al., 2014), and social-economic status (Labov, 1972(Labov, , 2006 have been shown to play important roles in slang construction. More recently, an analysis of social media text has shown that linguistic features also correlate with the survival of slang terms, where linguistically appropriate terms have a higher likelihood of being popularized (Stewart and Eisenstein, 2018).
Recent work in the NLP community has also analyzed slang. Ni and Wang (2017) studied slang comprehension as a translation task. In their study, both the spelling of a word and its context are provided as input to a translation model to decode a definition sentence. Pei et al. (2019) proposed end-to-end neural models to detect and identify slang automatically in natural sentences. Kulkarni and Wang (2018) have proposed computational models that derive novel word forms of slang from spellings of existing words. Here, we instead explore the generation of novel slang usage from existing words and focus on word sense extension toward slang context, based on the premise that new slang senses are often derived from existing conventional word senses.
Our work is inspired by previous research suggesting that slang relies on reusing words in the existing lexicon (Eble, 2012). Previous work has applied cognitive models of categorization to predict novel slang usage . In that work, the generative model is motivated by research on word sense extension (Ramiro et al., 2018). In particular, slang generation is opera-tionalized by categorizing slang senses based on their similarities to dictionary definitions of candidate words, enhanced by collaborative filtering (Goldberg et al., 1992) to capture the fact that words with similar senses are likely to extend to similar novel senses (Lehrer, 1985). However, this approach presupposes that slang senses are similar to conventional senses of a word represented in standard embedding space, an assumption that is not warranted and yet to be addressed.
Our work goes beyond the existing work in three important aspects: 1) We capture semantic flexibility of slang usage by contributing a novel method based on contrastive learning. Our method encodes slang meaning and conventional meaning of a word under a common embedding space, thereby improving the inadequate existing methodology for slang generation that uses common, slang-insensitive embeddings; 2) We capture syntactic and contextual flexibility of slang usage in a coherent probabilistic framework, an aspect that was ignored in previous work; 3) We rigorously test our framework against slang sense definition entries from three large slang dictionaries and contribute a new dataset for slang research to the community.
Contrastive Learning. Contrastive learning is a semi-supervised learning technique used to extract semantic representations in data-scarce situations. It can be incorporated into neural networks in the form of twin networks, where two exact copies of an encoder network are applied to two different examples. The encoded representations are then compared and back-propagated. Alternative loss schemes such as Triplet (Weinberger and Saul, 2009;Wang et al., 2014) and Quadruplet loss (Law et al., 2013) have also been proposed to enhance stability in training. In NLP, contrastive learning has been applied to learn similarities between text (Mueller and Thyagarajan, 2016;Neculoiu et al., 2016) and speech utterances (Kamper et al., 2016) with recurrent neural networks.
The contrastive learning method we develop has two main differences: 1) We do not use recurrent encoders because they perform poorly on dictionary-based definitions; 2) We propose a joint neural-probabilistic framework on the learned embedding space instead of resorting to methods such as nearest-neighbor search for generation.
Our computational framework for slang generation comprises three interrelated components: 1) A probabilistic formulation of word choice, extending that in  to leverage encapsulated slang senses from a modified embedding space; 2) A contrastive encoder-inspired by variants of twin network (Baldi and Chauvin, 1993;Bromley et al., 1994)-that constructs a modified embedding space for slang by adapting the conventional embeddings to incorporate new senses of slang words; 3) A contextually informed prior for capturing flexible uses of naturalistic slang.

Probabilistic Slang Choice Model
Given a query slang sense M S and its context C S , we cast the problem of slang generation as inference over candidate words w in our vocabulary. Assuming all candidate words w are drawn from a fixed vocabulary V , the posterior is as follows: Here, we define the prior P (w|C S ) based on regularities of syntax and/or linguistic context in slang usage (described in Section 3.4). We formulate the likelihood P (M S |w) 2 by specifying the relations between conventional senses of word w (denoted by M w = {M w 1 , M w 2 , · · · , M wm }, i.e., the set of senses drawn from a standard dictionary) and the query M S (i.e., slang sense that is outside the standard dictionary). Specifically, we model the likelihood by a similarity function that measures the proximity between the slang sense M S and the set of conventional senses M w of word w in a continuous, embedded semantic space: Here, f (·) is a similarity function in range [0, 1], while E S and E w represent semantic embeddings of the slang sense M S and the set of conventional senses M w . We derive these embeddings from contrastive learning which we describe in detail in Section 3.2, and we compare this proposed method with baseline methods that draw embeddings from existing sentence embedding models.
Our choice of the similarity function f (·) is motivated by prior work on few-shot classification. Specifically, we consider variants of two established methods: One Nearest Neighbor (1NN) matching (Koch et al., 2015;Vinyals et al., 2016) and Prototypical learning (Snell et al., 2017).
The 1NN model postulates that a candidate word should be chosen according to the similarity between the query slang sense and the closest conventional sense: In contrast, the prototype model postulates that a candidate word should be chosen if its aggregate (or average) sense is in close proximity of the query slang sense: In both cases, the similarity between two senses is defined by the exponentiated negative squared Euclidean distance in semantic embedding space: Here, h s is a learned kernel width parameter. We also consider an enhanced version of the posterior using collaborative filtering (Goldberg et al., 1992), where words with similar meaning are predicted to shift to similar novel slang meanings. We operationalize this by summing over the close neighborhood of candidate word L(w): Here, P (w |M S , C S ) is a fixed term calculated identically as in Equation (1) and P (w|w ) is the weighting of words in the close neighborhood of a candidate word w. This weighting probability is set proportional to the exponentiated negative cosine distance between w and its neighbor w defined in pre-trained word embedding space, and the kernel parameter h cf is also estimated from the training data: Here, d(w, w ) is the cosine distance between two words in a word embedding space.

Contrastive Semantic Encoding (CSE)
We develop a contrastive semantic encoder for constructing a new embedding space representing slang and conventional word senses that do not bear surface similarities. For instance, the conventional sense of kick such as "propel with foot" can hardly be related to the slang sense of kick such as "a strong flavor". The contrastive embedding space we construct seeks to redefine or warp similarities, such that the otherwise unrelated senses will be in closer proximity than they are under existing embedding methods. For example, two metaphorically related senses can bear strong similarity in slang usage, even though they may be far apart in a literal sense. We sample triplets of word senses as input to contrastive learning, following work on twin networks (Baldi and Chauvin, 1993;Bromley et al., 1994;Chopra et al., 2005;Koch et al., 2015). We use dictionary definitions of conventional and slang senses to obtain the initial sense embeddings (See Section 4.4 for details). Each triplet consists of 1) an anchor slang sense M S , 2) a positive conventional sense M P , and 3) a negative conventional sense M N . The positive sense should ideally be encouraged to lie closely to the anchor slang sense (in the resulting embedding space), whereas the negative sense should ideally be further away from both the positive conventional and anchor slang senses. Section 3.3 describes the detailed sampling procedures.
Our triplet network uses a single neural encoder g to project each word sense representation into a joint embedding space in R d .
We choose a 1-layer fully connected network with ReLU (Nair and Hinton, 2010) as the encoder g for pre-trained word vectors (e.g. fastText). For contextualized embedding models we consider, g will be a Transformer encoder (Vaswani et al., 2017) . In both cases, we apply the same encoder network g to each of the three inputs. We train the triplet network using the max-margin triplet loss (Weinberger and Saul, 2009), where the squared distance between the positive pair is constrained to be closer than that of the negative pair with a margin m:

Triplet Sampling
To train the triplet network, we build data triplets from every slang lexical entry in our training set. For each slang sense M S of word w, we create a positive pair with each conventional sense M w i of the same word w. Then for each positive pair, we sample a negative example every training epoch by randomly selecting a conventional sense M w from a word w that is sufficiently different from w, such that the corresponding definition sentence D w has less than 20% overlap in the set of content words compared to M S and any conventional definition sentence D w i of word w. We rank all candidate words in our vocabulary against w by computing cosine distances from pre-trained word embeddings and consider a word w to be sufficiently different if it is not in the top 20 percent.
Neighborhood Sampling (NS). In addition to using conventional senses of the matching word w for constructing positive pairs, we also sample positive senses from a small neighborhood L(w) of similar words. This sampling strategy provides linguistic knowledge from parallel semantic change to encourage neighborhood structure in the learned embedding space. Sampling from neighboring words also augments the size of the training data considerably in this data-scarce task. We sample negative senses in a similar way, except that we also consider all conventional definition sentences from neighboring words when checking for overlapping senses.

Contextual Prior
The final component of our framework is the prior P (w|C S ) (see Equation (1)) that captures flexible use of slang words with regard to syntax and distributional semantics. For example, slang exhibits flexible Part-of-Speech (POS) shift, e.g., noun→verb transition as in the example ice, and surprisals in linguistic context, e.g., ice in "I have a feeling he's gonna [blank] himself someday." Here, we formulate the context C S in two forms: 1) a syntactic-shift prior, namely the POS information P S to capture syntactic regularities in slang, and/or 2) a linguistic context prior, namely the linguistic context K S to capture distributional semantic context when this is available in the data.
Syntactic-Shift Prior (SSP). Given a query POS tag P S , we construct the syntactic prior by comparing POS distribution P w from literal natu-ral usage of a candidate word w with a smoothed POS distribution P S centered at P S . However, we cannot directly compare P S to P w because slang usage often involves shifting POS (Eble, 2012;Pei et al., 2019). To account for this, we apply a transformation T by counting the number of POS transitions for each slang-conventional definition pair in the training data. Each column of the transformation matrix T is then normalized, so column i of T can be interpreted as the expected slanginformed POS distribution given the i'th POS tag in conventional context (e.g., the noun column gives the expected slang POS distribution of a word that is used exclusively as a noun in conventional usage). The slang-contextualized POS distribution P * S can then be computed by applying T on P S : P * S = T × P S . The prior can be estimated by comparing the POS distributions P w and P * S via Kullback-Leibler (KL) divergence: (10) Intuitively, this prior captures the regularities of syntactic shift in slang usage, and it favors candidate words with POS characteristics that fits well with the queried POS tag in a slang context.
Linguistic Context Prior (LCP). We use a language model P LM to a given linguistic context K S to estimate the probability of each candidate word: P (w|C S ) = P (w|K S ) ∝ P LM (w|K S )+α (11) Here, α is a Laplace smoothing constant. We use the GPT-2 based language infilling model from Donahue et al. (2020) as P LM and discuss the implementation in Section 4.3.

Lexical Resources
We collected lexical entries of slang and conventional words/phrases from three separate online dictionaries: 3 1) Online Slang Dictionary (OSD), 4 2) Green's Dictionary of Slang (GDoS) (Green, 2010), 5 and 3) an open source subset of Urban Dictionary (UD) data from Kaggle. 6 In addition, Slang Dictionary. Both slang dictionaries (OSD and GDoS) are freely accessible online and contain slang definitions with meta-data such as Partof-Speech tags. Each data entry contains the word, its slang definition, and its part-of-speech (POS) tag. In particular, OSD includes example sentence(s) for each slang entry which we leverage as linguistic context, and GDoS contains timetagged references that allow us to perform historical prediction (described later). We removed all acronyms (i.e., fully capitalized words) as they generally do not extend meaning, and slang definitions that share more than 50% content words with any of their corresponding conventional definitions to account for conventionalized slang. We also removed slang with novel word forms where no conventional sense definitions are available. Slang phrases were treated as unigrams because our task only concerns the association between senses and lexical items. Each sense definition was considered a data point during both learning and prediction. We later partitioned definition entries from each dataset to be used for training, validation, and testing. Note that a word may appear in both training and testing but the pairing between word senses are unique (See Section 5.3 for discussion).
Conventional Word Senses. We focused on the subset of OD containing word forms that are also available in the slang datasets described. For each word entry, we removed all definitions that have been tagged as informal because these are likely to represent slang senses. This results in 10,091 and 29,640 conventional sense definitions corresponding to the OSD and GDoS datasets respectively. Data Split. We used all definition entries from the slang resources such that the corresponding slang word also exists in the collected OD subset. The resulting datasets (OSD and GDS) had 2,979 and 29,300 definition entries respectively, from 1,635 and 6,540 unique slang words, of which 1,253 are shared across both dictionaries. For each dataset, the slang definition entries were partitioned into a 90% training set and a 10% test set. 5% of the data in the training set were set aside for validation when training the contrastive encoder.
Urban Dictionary. In addition to the two datasets described above, we provide a third dataset based on Urban Dictionary (UD) that are made available via Kaggle. Unlike the previous two datasets, we are able to make this one publicly available without requiring one to obtain prior permission from the data owners. 8 To guard against the crowd-sourced and noisy nature of UD, we ensure quality by keeping definition entries such that 1) it has at least 10 more upvotes than downvotes, 2) the word entry exists in one of OSD or GDoS, and 3) at least one of the corresponding definition sentences in these dictionaries have a 20% or greater overlap in the set of content words with the UD definition sentence. We also remove entries with more than 50% overlap in content words with any other UD slang definitions under the same word to remove duplicated senses. This results in 2,631 definitions entries from 1,464 unique slang words. The corresponding OD subset has 10,357 conventional sense entries. We find entries from UD to be more stylistically variable and lengthier, with a mean entry length of 9.73 in comparison to 7.54 and 6.48 for OSD and GDoS respectively.

Part-of-Speech Data
The natural POS distribution P w for each candidate word w is obtained using POS counts from the most recent available decade of the HistWords project (Hamilton et al., 2016). For word entries that are not available, mostly phrases, we estimate P w by counting POS tags from Oxford Dictionary (OD) entries of w.
When estimating the slang POS transformation for the syntactic prior, we mapped all POS tags into one of the following six categories: {verb, other, adv, noun, interj, adj} for the OSD experiments. For GDS, the tag 'interj' was excluded as it is not present in the dataset.

Contextualized Language Model Baseline
We considered a state-of-the-art GPT-2 based language infilling model from Donahue et al. (2020) as both a baseline model and a prior to our framework (on the OSD data where context sentences are available for the slang entries). For each entry, we blanked out the corresponding slang word in the example sentence, effectively treating our task as a cloze task. We applied the infilling model to 8 Code and data available at: https://github.com/ zhewei-sun/slanggen obtain probability scores for each of the candidate words and apply a Laplace smoothing of 0.001. We fine-tuned the LM infilling model using all example sentences in the OSD training set until convergence. We also experiment with a combined prior where the two priors are combined using element-wise multiplication and re-normalization.

Baseline Embedding Methods
To compare with and compute the baseline embedding methods M for definition sentences, we used 300-dimensional fastText embeddings (Bojanowski et al., 2017) pre-trained with subword information on 600 billion tokens from Common Crawl 9 as well as 768-dimensional Sentence-Bert (SBERT) (Reimers and Gurevych, 2019) encoders pretrained on Wikipedia and fine-tuned on NLI datasets (Bowman et al., 2015;Williams et al., 2018). The fastText embeddings were also used to compute cosine distances d(w, w ) in Eq. 7. Embeddings for phrases and the fastText-based sentence embeddings were both computed by applying average pooling to normalized word-level embeddings of all content words. In the case of SBERT, we fed in the original definition sentence.

Training Procedures
We trained the triplet networks for a maximum of 20 epochs using Adam (Kingma and Ba, 2015) with a learning rate of 10 −4 for fastText and 2 −5 for SBERT based models. We preserved dimensions of the input sense vectors for the contrastive embeddings learned by the triplet network (that is, 300 for fastText and 768 for SBERT). We used 1,000 fully-connected units in the contrastive encoder's hidden layer for fastText based models. Triplet margins of 0.1 and 1.0 were used with fast-Text and SBERT embeddings respectively.
We trained the probabilistic classification framework by minimizing negative log likelihood of the posterior P (w * |M S , C S ) on the groundtruth words for all definition entries in the training set. We jointly optimized kernel width parameters using L- BFGS-B (Byrd et al., 1995). To construct a word w's neighborhood L(w) in both collaborative filtering and triplet sampling, we considered the 5 closest words in cosine distances of their fastText embeddings.

Model Evaluation
We first evaluated our models quantitatively by predicting slang word choices: Given a novel slang sense (a definition taken from a slang dictionary) and its part-of-speech, how likely is the model to predict the ground-truth slang recorded in the dictionary? To assess model performance, we allowed each model to make up to |V | ranked predictions where V is the vocabulary of the dataset being evaluated, and we used standard Area-Under-Curve (AUC) percentage from Receiver-Operator Characteristic (ROC) curves to assess overall performance.
We show the ROC curves for the OSD evaluation in Figure 2 as an illustration. The AUC metric is similar to and a continuous extension to an F1 score by comprehensively sweeping through the number of candidate words a model is allowed to predict. We find this metric to be the most appropriate because multiple words may be appropriate to express a probe slang sense.
To examine the effectiveness of the contrastive embedding method, we varied the semantic representation as input to the models by considering both fastText and SBERT (described in Sec 4.4). For both embeddings, we experimented with the baseline variant without the contrastive encoding (e.g., vanilla embeddings from fastText and SBERT). We then augmented the models incrementally with the contrastive encoder and the priors whenever applicable to examine their respective and joint effects on model performance in slang word choice prediction. We observed that, under both datasets, models leveraging the con-trastively learned sense embeddings more reliably predict the ground-truth slang words, indicated by both higher AUC scores and consistent improvement in precision over all retrieval ranks. Note that the vanilla SBERT model, despite being a much larger model trained on more data, only presented minor performance gains when compared with the plain fastText model. This suggests that simply training larger models on more data does not better encapsulate slang semantics.
We also analyzed whether the contrastive embeddings are robust under different choices of the probabilistic models. Specifically, we considered the following four variants of the models: 1) 1-Nearest Neighbor (1NN), 2) Prototype, 3) 1NN with collaborative filtering (CF), and 4) Prototype with CF. Our results show that applying contrastively learned semantic embeddings consistently improves predictive accuracy across all probabilistic choice models. The complete set of results for all 3 datasets is summarized in Table 1.
We noted that the syntactic information from the prior improves predictive accuracy in all settings, while by itself predicting significantly better than chance. On OSD, we used the context sentences alone in a contextualized language infilling model for prediction and also incorporating it as a prior. Again, while the prior consistently improves model prediction, both by itself and when paired with the syntactic-shift prior, the language model alone is not sufficient.
We found the syntactic-shift prior and linguistic context prior to be capturing complementary information (mean Spearman correlation of 0.054 ± 0.003 across all examples), resulting in improved performance when they are combined together.
However, the majority of the performance gain is attributed to the augmented contrastive embeddings, which highlights the importance and supports our premise that encoding of slang and conventional senses is crucial to slang word choice.

Historical Analysis of Slang Emergence
We next performed a temporal analysis to evaluate whether our model explains slang emergence over time. We used the time tags available in the GDoS dataset and predicted historically emerged slang from the past 50 years (1960s-2000s). For a given slang entry recorded in history, we tagged its emergent decade using the earliest dated reference available in the dictionary. For each future decade d, we trained our model using all entries before d and assessed whether our model can predict the choices of slang words for slang senses that emerged in the future decade. We scored the models on slang words that emerged during each subsequent decade, simulating a scenario where future slang usages are incrementally predicted. Table 2 summarizes the result from the historical analysis for the non-contrastive SBERT baseline and our full model (with contrastive embeddings), based on the GDoS data. AUC scores are similar to the previous findings but slightly lower for both models in this historical setting. Overall, we find the full model to improve the baseline consistently over the course of history examined and achieve similar performance as in the synchronic evaluation. This provides strong evidence that our framework is robust and has explanatory power over the historical emergence of slang.

Model Error Analysis and Interpretation
Few-shot vs Zero-shot Prediction. We analyze our model errors and note that one source of error stems from whether the probe slang word has appeared during training versus not. Here, each candidate word is treated as a class and each slang sense of a word seen in the training set is considered a 'shot'. In the few-shot case, although the slang sense in question was not observed in prediction, the model has some a priori knowledge  about its target word and how it has been used in slang context (because a word may have multiple slang senses), thus allowing the model to generalize toward novel slang usage of that word. In the zero-shot case, the model needs to select a novel slang word (i.e., one that never appeared in training) and hence has no direct knowledge about how that word should be extended in a slang context. Such knowledge must be inferred indirectly, and in this case, from the conventional senses of the candidate words. The model can then infer how words with similar conventional senses might extend to slang context. Table 3 outlines the AUC scores of the collaboratively filtered prototype models under few-shot and zero-shot settings. For each dataset, we partitioned the corresponding test set by whether the target word appears at least once within another definition entry in the training data. This results in 179, 2,661, and 165 few-shot definitions in OSD, GDoS and UD respectively, along with 120, 269, 96 zero-shot definitions. From our results, we observed that it is more challenging for the model to generalize usage patterns to unseen words, with AUC scores often being higher in the few-shot case. Overall, we found the model to have the most issues handling zero-shot cases from GDoS due to the fine-grained senses recorded in this dictionary, where a word has more slang senses on average (in comparison to the OSD and UD data). This issue caused the models to be more biased towards generalizing usage patterns from more commonly observed words. Finally, the SBERT-based models tend to be more robust towards unseen word-forms, potentially benefiting from their contextualized properties.  Table 3: Model AUC scores (%) for Few-shot and Zero-shot test sets ("CSE" for contrastive embedding, "SSP" for syntactic prior, "LCP" for contextual prior, and "FT" for fastText).
Synonymous Slang Senses. We also examined the influence of synonymy (or sense overlap) in the slang datasets. We quantified the degree of sense synonymy by checking each test sense against all training senses and computing the edit distance between the corresponding sets of constituent content words of the sense definitions. Figure 3 shows the distribution of degree of synonymy across all test examples where the edit distance to the closest training example is considered. We perform our evaluation by binning based on the degree of synonymy and summarize the results in Figure 4. We do not observe any substantial  changes in performance when controlling for the degree of synonymy, and in fact, the highly synonymous definitions appear to be more difficult (as opposed to easier) for the models. Overall, we find the models to yield consistent improvement across different degrees of synonymy, particularly with the SBERT based full model which offers improvement in all cases.
Semantic Distance. To understand the consequence of contrastive embedding, we examine the relative distance between conventional and slang senses of a word in embedding space and the extent to which the learned semantic relations might generalize. We measured the Euclidean distance between each slang embedding with the prototype sense vector of all candidate words, without applying the probabilistic choice models. Table 4 shows the ranks of the corresponding candidate words, averaged over all slang sense embeddings considered and normalized between 0 and 1. We observed that contrastive learning indeed brings closer slang and conventional senses (from the same word), as indicated by lower mean semantic distance after the embedding procedure is applied. Under both fastText and SBERT, we obtained sig-  Example slang word predictions from the contrastively learned full model and SBERT baseline (with no contrastive embedding) on slang usage from the Green's Dictionary. Each example shows the true slang, the probe slang sense, a sample usage, the alternative slang words predicted by each model, and the predicted rank (colored bars indicate inverse rank) of the true slang from a lexicon of 6,540 words.
nificant improvement on both the OSD and GDoS test sets (p < 0.001). On UD, the improvement is significant for SBERT (p = 0.018) but marginal for fastText (p = 0.087).
Examples of Model Prediction. Table 5 shows 5 example slangs from the GDoS test set and the top words predicted by both the baseline SBERT model and the full SBERT-based model with contrastive learning. The full model exhibits a greater tendency to choose words that appear remotely related to the queried sense (e.g., spill, swallow for the act of killing), while the baseline model favors words that share only surface semantic similarity (e.g., retrieving murder and homicide directly). We found cases where the model extends meaning metaphorically (e.g., animal to action, in the case of chirp), euphemistically (e.g., spill and swallow for kill), and generalization of a concept (e.g., brigade and mob for gang), all of which are commonly attested in slang usage (Eble, 2012).
We found the full model to achieve better retrieval accuracy in cases where the queried slang undergoes a non-literal sense extension, whereas the baseline model is situated at retrieving candidate words with incremental or literal changes in meaning. We also noted many cases where the true slang word is difficult to predict without appropriate background knowledge. For instance, the full-model suggested words such as orange and bluey to mean "a communist" but could not pinpoint the color red without knowing its cultural association to communism. Finally, we observed that our model to perform generally worse when the target slang sense can hardly be related to conventional senses of the target word, suggesting that cultural knowledge may be important to consider in the future.

Conclusion
We have presented a framework that combines probabilistic inference with neural contrastive learning to generate novel slang word usages. Our results suggest that capturing semantic and contextual flexibility simultaneously helps to improve the automated generation of slang word choices with limited training data. To our knowledge this work constitutes the first formal computational approach to modeling slang generation, and we have shown the promise of the learned semantic space for representing slang senses. Our framework will provide opportunities for future research in the natural language processing of informal language, particularly the automated interpretation of slang.