Abstract
Much previous work characterizing language variation across Internet social groups has focused on the types of words used by these groups. We extend this type of study by employing BERT to characterize variation in the senses of words as well, analyzing two months of English comments in 474 Reddit communities. The specificity of different sense clusters to a community, combined with the specificity of a community’s unique word types, is used to identify cases where a social group’s language deviates from the norm. We validate our metrics using user-created glossaries and draw on sociolinguistic theories to connect language variation with trends in community behavior. We find that communities with highly distinctive language are medium-sized, and their loyal and highly engaged users interact in dense networks.
1 Introduction
Internet language is often popularly characterized as a messy variant of “standard” language (Desta, 2014; Magalhães, 2019). However, work in sociolinguistics has demonstrated that online language is not homogeneous (Herring and Paolillo, 2006; Nguyen et al., 2016; Eisenstein, 2013). Instead, it expresses immense amounts of variation, often driven by social variables. Online language contains lexical innovations, such as orthographic variants, but also repurposes words with new meanings (Pei et al., 2019; Stewart et al., 2017). There has been much attention on which words are used across these social groups, including work examining the frequency of types (Zhang et al., 2017; Danescu-Niculescu-Mizil et al., 2013). However, there is also increasing interest in how words are used in these online communities as well, including variation in meaning (Yang and Eisenstein, 2017; Del Tredici and Fernández, 2017). For example, a word such as python in Figure 1 has different usages depending on the community in which it is used. Our work examines both lexical and semantic variation, and operationalizes the study of the latter using BERT (Devlin et al., 2019).
Social media language is an especially interesting domain for studying lexical semantics because users’ word use is far more dynamic and varied than is typically captured in standard sense inventories like WordNet. Online communities that sustain linguistic norms have been characterized as virtual communities of practice (Eckert and McConnell-Ginet, 1992; Del Tredici and Fernández, 2017; Nguyen and Rosé, 2011). Users may develop wiki pages, or guides, for their communities that outline specific jargon and rules. However, some communities exhibit more language variation than others. One central goal in sociolinguistics is to investigate what social factors lead to variation, and how they relate to the growth and maintenance of sociolects, registers, and styles. To enable our ability to answer these types of questions from a computational perspective, we must first develop metrics for measuring variation.
Our work quantifies how much the language of an online community deviates from the norm and identifies communities that contain unique language varieties. We define community-specific language in two ways, one based on word choice variation, and another based on meaning variation using BERT. Words used with community-specific senses match words that appear in glossaries created by users for their communities. Finally, we test several hypotheses about user-based attributes of online English varieties drawn from sociolinguistics literature, showing that communities with more distinctive language tend to be medium- sized and have more loyal and active users in dense interaction networks. We release our code, our dataset of glossaries for 57 Reddit communities, and additional information about all communities in our study at https://github.com/lucy3/ingroup_lang.
2 Related Work
The sheer number of conversations on social media platforms allow for large-scale studies that were previously impractical using traditional sociolinguistics methods such as ethnographic interviews and surveys. Earlier work on computer-mediated communication identified the presence and growth of group norms in online settings (Postmes et al., 2000), and how new and veteran community members adapt to a community’s changing linguistic landscape (Nguyen and Rosé, 2011; Danescu-Niculescu-Mizil et al., 2013).
Much work in computational sociolinguistics has focused on lexical variation (Nguyen et al., 2016). Online language contains an abundance of “nonstandard” words, and these dynamic trends rise and decline based on social and linguistic factors (Rotabi and Kleinberg, 2016; Altmann et al., 2011; Stewart and Eisenstein, 2018; Del Tredici and Fernández, 2018; Eisenstein et al., 2014). Online communities’ linguistic norms and differences are often defined by which words are used. For example, Zhang et al. (2017) quantify the distinctiveness of a Reddit community’s identity by the average specificity of its words and utterances. They define specificity as the PMI of a word in a community relative to the entire set of communities, and find that distinctive communities are more likely to retain users. To identify community-specific language, we extend Zhang et al. (2017)’s approach to incorporate semantic variation, mirroring the sense-versus-type dichotomy of language variation put forth by previous work on slang detection (Dhuliawala et al., 2016; Pei et al., 2019).
There has been less previous work on cross-community semantic variation. Yang and Eisenstein (2017) use social networks to address sentiment variation across Twitter users, accounting for cases such as sick being positive in this sick beat or negative in I feel tired and sick.Del Tredici and Fernández (2017) adapt the model of Bamman et al. (2014) for learning dialect-aware word vectors to Reddit communities discussing programming and football. They find that sub-communities for each topic share meaning conventions, but also develop their own. A line of future work suggested by Del Tredici and Fernández (2017) is extending studies on semantic variation to a larger set of communities, which our present work aims to achieve.
The strength of BERT to capture word senses presents a new opportunity to measure semantic variation in online communities of practice. BERT embeddings have been shown to capture word meaning (Devlin et al., 2019), and different senses tend to be segregated into different regions of BERT’s embedding space (Wiedemann et al., 2019). Clustering these embeddings can reveal sense variation and change, where distinct senses are often represented as cluster centroids (Hu et al., 2019; Giulianelli et al., 2020). For example, Reif et al. (2019) use a nearest-neighbor classifier for word sense disambiguation, where word embeddings are assigned to the nearest centroid representing a word sense. Using BERT-base, they achieve an F1 of 71.1 on SemCor (Miller et al., 1993), beating the state of the art at that time. Part of our work examines how well the default behavior of contextualized embeddings, as depicted in Figure 1, can be used for identifying niche meanings in the domain of Internet discussions.
As online language may contain semantic innovations, our domain necessitates word sense induction (WSI) rather than disambiguation. We evaluate approaches for measuring usage or sense variation on two common WSI benchmarks, SemEval 2013 Task 13 (Jurgens and Klapaftis, 2013) and SemEval 2010 Task 14 (Manandhar and Klapaftis, 2009), which provide evaluation metrics for unsupervised sense groupings of different occurrences of words. The current state of the art in WSI clusters representations consisting of substitutes, such as those shown in Figure 1, predicted by BERT for masked target words (Amrami and Goldberg, 2018, 2019). We also adapt this method on our Reddit dataset to detect semantic variation.
3 Data
Our data is a subset of all comments on Reddit made during May and June 2019 (Baumgartner et al., 2020). Reddit is broken up into forum-based communities called subreddits, which discuss different topics, such as parenting or gaming, or target users in different social groups, such as LGBTQ+ or women. We select the top 500 most popular subreddits based on number of comments and remove subreddits that have less than 85% English comments, using the language identification method proposed by Lui and Baldwin (2012). This process yields 474 subreddits, from which we randomly sample 80,000 comments each. The number of comments per subreddit originally ranged from over 13 million to over 80,000, so this sampling ensures that more popular communities do not skew comparisons of word usage across subreddits. Each sampled subreddit had around 20k unique users on average, where a user is defined as a unique username associated with comments.1 We lowercase the text, remove urls, and replace usernames, numbers, and subreddit names each with their own special token type. The resulting dataset has over 1.4 billion tokens.
To understand how users in these communities define and catalog their own language, we also manually gather all available glossaries of the subreddits in our dataset. These glossaries are usually written as guides to newcomers to the community and can be found in or linked from communitywiki pages. We exclude glossary links that are too general and not specific to that Reddit community, such as r/tennis’s link to the Wikipedia page for tennis terms. We provide the names of these communities and the links we used in our Github repo.2 Our 57 subreddit glossaries have an average of 72.4 terms per glossary, with a wide range from a minimum of 4 terms to a maximum of 251. We removed 1044 multi-word expressions from analysis, because counting phrases would conflate the distinction we make between examining which individual words are used (type) and how they are used (meaning). We evaluate on 2814 single-token words from these glossaries that appear in comments within their respective subreddits based on exact string matching. Since many of these words appear in multiple subreddits’ glossaries, we have 2226 unique glossary words overall.
4 Methods for Identifying Community-Specific Language
4.1 Type
Much previous work on Internet language has focused on lexical choice, examining the word types unique to a community. The subreddit r/vegan, for example, uses carnis, omnis, and omnivores to refer to people who eat meat.
For our type-based analysis, we only examine words that are within the 20% most frequent in a subreddit; even though much of a community’s unique language is in its long tail, words with fewer than 10 occurrences may be noisy misspellings or too rare for us to confidently determine usage patterns. To keep our vocabularies compatible with our sense-based method described in §4.2, we calculate word frequencies using the basic (non-WordPiece) tokenizer in Hugging Face’s transformers library3 (Wolf et al., 2020). Following Eisenstein et al. (2014), we define frequency for a word t in a subreddit s, fs(t), as the number of users that used it at least once in the subreddit. We experiment with several different methods for finding distinctive and salient words in subreddits.
Table 1 shows example words with high NPMI in three subreddits. The community r/justnomil, whose name means “just no mother-in-law”, discusses negative family relationships, so many of its common and distinctive words refer to relatives. Words specific to other communities tend to be topical as well. The gaming community r/ps4 (PlayStation 4) uses acronyms to denote company and game entities and r/gardening has words for different types of plants.
subreddit . | word . | definition . | count . | type NPMI . |
---|---|---|---|---|
r/justnomil | fdh | “future damn husband” | 354 | 0.397 |
jnmom | “just no mom”, an annoying mother | 113 | 0.367 | |
justnos | annoying family members | 110 | 0.366 | |
jnso | “just no significant other”, an annoying romantic partner | 36 | 0.345 | |
r/gardening | clematis | a type of flower | 150 | 0.395 |
milkweed | a flowering plant | 156 | 0.389 | |
perennials | plants that live for multiple years | 139 | 0.383 | |
bindweed | a type of weed | 38 | 0.369 | |
r/ps4 | siea | Sony Interactive Entertainment America | 60 | 0.373 |
ps5 | PlayStation 5 | 892 | 0.371 | |
tlou | The Last of Us, a video game | 193 | 0.358 | |
hzd | Horizon Zero Dawn, a video game | 208 | 0.357 |
subreddit . | word . | definition . | count . | type NPMI . |
---|---|---|---|---|
r/justnomil | fdh | “future damn husband” | 354 | 0.397 |
jnmom | “just no mom”, an annoying mother | 113 | 0.367 | |
justnos | annoying family members | 110 | 0.366 | |
jnso | “just no significant other”, an annoying romantic partner | 36 | 0.345 | |
r/gardening | clematis | a type of flower | 150 | 0.395 |
milkweed | a flowering plant | 156 | 0.389 | |
perennials | plants that live for multiple years | 139 | 0.383 | |
bindweed | a type of weed | 38 | 0.369 | |
r/ps4 | siea | Sony Interactive Entertainment America | 60 | 0.373 |
ps5 | PlayStation 5 | 892 | 0.371 | |
tlou | The Last of Us, a video game | 193 | 0.358 | |
hzd | Horizon Zero Dawn, a video game | 208 | 0.357 |
As another metric, we examine the use of TextRank, which is commonly used for extracting keywords from documents (Mihalcea and Tarau, 2004). TextRank applies the PageRank algorithm (Brin and Page, 1998) on a word co-occurrence graph, where the resulting scores based on words’ positions in the graph correspond their importance in a document. For our use case, we construct a graph of unlemmatized tokens using the same parameter and model design choices as Mihalcea and Tarau (2004). This means we run PageRank on an unweighted, undirected graph of adjectives and nouns that co-occur in the same comment, using a window size of 2, a convergence threshold of 0.0001, and a damping factor of 0.85.
4.2 Meaning
Some words may have low scores with our type-based metrics, yet their use should still be considered community-specific. For example, the word ow is common to many subreddits, but is used as an acronym for a video game name in r/overwatch, a clothing brand in r/sneakers, and how much a movie makes in its opening weekend in r/boxoffice. We use interpretable metrics for senses, analogous to type NPMI, that allow us to compare semantic variation across communities.
Since words on social media are dynamic and niche, making them difficult to be comprehensively cataloged, we frame our task as word sense induction. We investigate two types of methods: one that clusters BERT embeddings, and Amrami and Goldberg’s (2019) current state-of-the-art model that clusters representatives containing word substitutes predicted by BERT (Figure 1).
The current state-of-the-art WSI model associates each example of a target word with 15 representatives, each of which is a vector composed of 20 sampled substitutes for the masked target word (Amrami and Goldberg, 2019). This method then transforms these sparse vectors with tf-idf and clusters them using aggolomerative clustering, dynamically merging less probable senses with more dominant ones. In our use of this model, each example is assigned to its most probable sense based on how its representativesare distributed across sense clusters. One version of their model uses Hearst-style patterns such as target (or even [MASK]), instead of simply masking out the target word. We do not use dynamic patterns in our study, because these patterns assume that target words are nouns, verbs, or adjectives, and our Reddit experiments do not filter out any words based on part of speech.
5 Word Sense Induction
We develop and evaluate word sense induction models using SemEval WSI tasks in a manner that is designed to parallel their later use on larger Reddit data.
5.1 Evaluation on SemEval Tasks
In SemEval 2010 Task 14 (Jurgens and Klapaftis, 2013) and SemEval 2013 Task 13 (Manandhar and Klapaftis, 2009), models are evaluated based on how well predicted sense clusters for different occurrences of a target word align with gold sense clusters.
Amrami and Goldberg (2019)’s performance scores reported in their paper were obtained from running their model directly on test set data for the two SemEval tasks, which had typically fewer than 150 examples per word. However, these tasks were released as multi-phase tasks and provide both training and test sets (Jurgens and Klapaftis, 2013; Manandhar and Klapaftis, 2009), and our study requires methods that can scale to larger datasets. Some words in our Reddit data appear very frequently, making it too memory-intensive to cluster all of their embeddings or representatives at once (for example, the word pass appearsover 96k times). It is more feasible to learn senses from a fixed number of examples, and then match remaining examples to these senses. We evaluate how well induced senses generalize to new examples using separate train and test sets.
We tune parameters for models using SemEval 2010 Task 14. In this task, the test set contains 100 target noun and verb lemmas, where each occurrence of a lemma is labeled with a single sense (Manandhar and Klapaftis, 2009). We use WSI models to first induce senses for 500 randomly sampled training examples, and then match test examples to these senses. There are a few lemmas in SemEval 2010 that occur fewer than 500 times in the training set, in which case we use all instances. We also evaluate the top-performing versions of each model on SemEval 2013 Task 13, after clustering 500 instances of each noun, verb, or adjective lemma in their training corpus, ukWaC (Jurgens and Klapaftis, 2013; Baroni et al., 2009). In SemEval 2013 Task 13, each occurrence of a word is labeled with multiple senses, but we evaluate and report past scores using their single-sense evaluation key, where each word is mapped to one sense.
For the substitution-based method, we match test examples to clusters by pairing representatives with the sense label of their nearest neighbor in the training set. We found that Amrami and Goldberg’s (2019) default model is sensitive to the number of examples clustered. The majority of target words in the test data for the two SemEval tasks on which this model was developed have fewer than 150 examples. When this same model is applied on a larger set of 500 examples, the vast majority of examples often end up in a single cluster, leading to low or zero-value V-Measure scores for many words. To mitigate this problem, we experimented with different values for the upper-bound on number of clusters c, ranging from 10 to 35 in increments of 5. This upper-bound determines the distance threshold for flattening dendrograms, where allowing more clusters lowers these thresholds and breaks up large clusters. We found c = 25 produces the best SemEval 2010 results for our training set size, and use it for our Reddit experiments as well.
For the k-means embedding-based method, we match test examples to the nearest centroid representing an induced sense using cosine distance. During training, we initialize centroids using k-means++ (Arthur and Vassilvitskii, 2007). Weexperimented with different values of the weighting factor γ ranging from 1000 to 20,000 on SemEval 2010, and choose γ = 10,000 for our experiments on Reddit data. Preliminary experiments suggest that this method is less sensitive to the number of training examples, where directly clustering SemEval 2010’s smaller test set led to similar results with the same parameters.
For the spectral embedding-based method, we match a test example to a cluster by assigning it the label of its nearest training example. To construct the K-nearest neighbor similarity graph during training, we experimented with different K around , where for n = 500, K ∼ 6 (von Luxburg, 2007; Brito et al., 1997). For K = 6,...,10, we found that K = 7 worked best, though performance scores on SemEval 2010 for all other K were still within one standard deviation of K = 7’s average across multiple runs.
The bolded rows of Table 2 and Table 3 show performance scores of these models using our evaluation setup, compared against scores reported in previous work.6 These results show that for embedding-based WSI, k-means works better than spectral clustering. In addition, clustering BERT embeddings performs better than most methods, but not as well as clustering substitution-based representatives.
Model . | F Score . | V Measure . | Average . |
---|---|---|---|
BERT embeddings, k-means, γ = 10000 | 0.594 (0.004) | 0.306 (0.004) | 0.426 (0.003) |
BERT embeddings, spectral, K = 7 | 0.581 (0.025) | 0.283 (0.017) | 0.405 (0.020) |
BERT substitutes, Amrami and Goldberg (2019), c = 25 | 0.683 (0.003) | 0.339 (0.012) | 0.481 (0.009) |
Amrami and Goldberg (2019), default parameters | 0.709 | 0.378 | 0.517 |
Amplayo et al. (2019) | 0.617 | 0.098 | 0.246 |
Song et al. (2016) | 0.551 | 0.098 | 0.232 |
Chang et al. (2014) | 0.231 | 0.214 | 0.222 |
MFS | 0.635 | 0.000 | 0.000 |
Model . | F Score . | V Measure . | Average . |
---|---|---|---|
BERT embeddings, k-means, γ = 10000 | 0.594 (0.004) | 0.306 (0.004) | 0.426 (0.003) |
BERT embeddings, spectral, K = 7 | 0.581 (0.025) | 0.283 (0.017) | 0.405 (0.020) |
BERT substitutes, Amrami and Goldberg (2019), c = 25 | 0.683 (0.003) | 0.339 (0.012) | 0.481 (0.009) |
Amrami and Goldberg (2019), default parameters | 0.709 | 0.378 | 0.517 |
Amplayo et al. (2019) | 0.617 | 0.098 | 0.246 |
Song et al. (2016) | 0.551 | 0.098 | 0.232 |
Chang et al. (2014) | 0.231 | 0.214 | 0.222 |
MFS | 0.635 | 0.000 | 0.000 |
Model . | NMI . | B-Cubed . | Average . |
---|---|---|---|
BERT embeddings, k-means, γ = 10000 | 0.157 (0.006) | 0.575 (0.005) | 0.300 (0.007) |
BERT embeddings, spectral, K = 7 | 0.135 (0.010) | 0.588 (0.007) | 0.282 (0.010) |
BERT substitutes, Amrami and Goldberg (2019), c = 25 | 0.192 (0.011) | 0.638 (0.003) | 0.350 (0.010) |
Amrami and Goldberg (2019), default parameters | 0.183 | 0.626 | 0.339 |
Başkaya et al. (2013) | 0.045 | 0.351 | 0.126 |
Lau et al. (2013) | 0.039 | 0.441 | 0.131 |
Model . | NMI . | B-Cubed . | Average . |
---|---|---|---|
BERT embeddings, k-means, γ = 10000 | 0.157 (0.006) | 0.575 (0.005) | 0.300 (0.007) |
BERT embeddings, spectral, K = 7 | 0.135 (0.010) | 0.588 (0.007) | 0.282 (0.010) |
BERT substitutes, Amrami and Goldberg (2019), c = 25 | 0.192 (0.011) | 0.638 (0.003) | 0.350 (0.010) |
Amrami and Goldberg (2019), default parameters | 0.183 | 0.626 | 0.339 |
Başkaya et al. (2013) | 0.045 | 0.351 | 0.126 |
Lau et al. (2013) | 0.039 | 0.441 | 0.131 |
5.2 Adaptation to Reddit
We apply the k-means embedding-based method and Amrami and Goldberg’s (2019) substitution-based method to Reddit, with the parameters that performed best on SemEval 2010 Task 14. We induce senses for a vocabulary of non-lemmatized 13,240 tokens, including punctuation, that occur often enough for us to gain a strong signal of semantic deviation from the norm. These are non-emoji tokens that are very common in a community (in the top 10% most frequent tokens of a subreddit), frequent enough to be clustered (appear at least 500 times overall), and also used broadly (appear in at least 350 subreddits). When clustering BERT embeddings, to gain the representation for a token split into wordpieces, we average their vectors. With each WSI method, we induce senses using 500 randomly sampled comments containing the target token.7 Then, we match all occurrences of words in our selected vocabulary to their closest sense, as described earlier.
Though the embedding-based method has lower performance than the substitution-based one on SemEval WSI tasks, the former is an order of magnitude faster and more efficient to scale (Table 4).8 During the training phase of clustering, both models learn sense clusters for each word by making a single pass over that word’s set of examples; we then match every vocab word in a subreddit to its appropriate cluster. While the substitution-based method is 1.7 times slower than the embedding-based method during the training phase, it becomes 47.9 times slower during the matching phase. The particularly large difference in runtime is due to the substitution-based method’s need to run BERT multiple times for each sentence (in order to individually mask each vocab word in the sentence), while the embedding-based method passes over each sentence once. We also noticed that the substitution-based method sometimes created very small clusters, which often led to very rare senses (e.g., occurring fewer than 5 times overall).
Model . | Clustering per word . | Matching persubreddit . |
---|---|---|
BERT embeddings, γ = 10000 | 47.60 sec | 28.85 min |
Amrami and Goldberg (2019)’s BERT substitutes, c = 25 | 80.99 sec | 23.04 hr |
Model . | Clustering per word . | Matching persubreddit . |
---|---|---|
BERT embeddings, γ = 10000 | 47.60 sec | 28.85 min |
Amrami and Goldberg (2019)’s BERT substitutes, c = 25 | 80.99 sec | 23.04 hr |
A word may map to more than one sense, so to determine if a word t has a community-specific sense in subreddit s, we use the NPMI of the word’s most common sense in s. We refer to this value as the sense NPMI, or ℳs(t). We calculate these scores using both the embedding-based method, denoted as ℳs*(t), and the substitution-based method, denoted as ℳs†(t).
These two sense NPMI metrics tend to score words very similarly across subreddits, with an overall Pearson’s correlation of 0.921 (p < 0.001). Words that have high NPMI with one model also tend to have high NPMI with the other (Table 5). There are some disagreements, such as the scores for flu in r/keto, which does not refer to influenza but instead refers to symptoms associated with starting a ketogenic diet (ℳ* = 0.388, ℳ† = 0.248). Still, both metrics place r/keto’s flu in the 98th percentile of scored words. Thus, for large datasets, it would be worthwhile to use the embedding-based method instead of the state-of-the-art substitution-based method to save substantial time and computing resources and yield similar results.
subreddit . | word . | ℳ† . | ℳ* . | . | subreddit’s sense example . | other sense example . |
---|---|---|---|---|---|---|
r/elitedangerous | python | 0.383 | 0.347 | 0.286 | “Get a Python, stuff it with passenger cabins...” | “I self taught some Python over the summer...” |
r/fashionreps | haul | 0.374 | 0.408 | 0.358 | “Plan your first haul, don’t just buy random nonsense...” | “...discipline is the long haul of getting it done...” |
r/libertarian | nap | 0.370 | 0.351 | 0.185 | “The nap is just a social contract.” | “Move bedtime earlier to compensate for no nap...” |
r/90dayfiance | nickel | 0.436 | 0.302 | 0.312 | “Nickel really believes that Azan loves her.” | “...raise burrito prices by a nickel per month...” |
r/watches | dial | 0.461 | 0.463 | 0.408 | “...the dial has a really nice texturing...” | “...you didn’t have to dial the area code...” |
subreddit . | word . | ℳ† . | ℳ* . | . | subreddit’s sense example . | other sense example . |
---|---|---|---|---|---|---|
r/elitedangerous | python | 0.383 | 0.347 | 0.286 | “Get a Python, stuff it with passenger cabins...” | “I self taught some Python over the summer...” |
r/fashionreps | haul | 0.374 | 0.408 | 0.358 | “Plan your first haul, don’t just buy random nonsense...” | “...discipline is the long haul of getting it done...” |
r/libertarian | nap | 0.370 | 0.351 | 0.185 | “The nap is just a social contract.” | “Move bedtime earlier to compensate for no nap...” |
r/90dayfiance | nickel | 0.436 | 0.302 | 0.312 | “Nickel really believes that Azan loves her.” | “...raise burrito prices by a nickel per month...” |
r/watches | dial | 0.461 | 0.463 | 0.408 | “...the dial has a really nice texturing...” | “...you didn’t have to dial the area code...” |
Some of the words with high sense NPMI in Table 5, such as haul (a set of purchased products), dial (a watch face) have well documented meanings in WordNet or the Oxford English Dictionary that are especially relevant to the topic of the community. Others are less standard, including python to refer to a ship in a game, nap as an acronym for “non-aggression principle”, and Nickel as a fan-created nickname for a character named Nicole in a reality TV show. Some terms have low ℳ across most subreddits, such as the period punctuation mark (average ℳ* = −0.008, ℳ† = −0.009).
Glossary Analysis
To provide additional validation for our metrics, we examine how they score words listed in user-created subreddit glossaries (as described in §3).New members may spend 8 to 9 months acquiring a community’s linguistic norms (Nguyen and Rosé, 2011), and some Reddit communities have such distinctive language that their posts can be difficult to understand to outsiders. This makes the manual annotation of linguistic norms across hundreds of communities difficult, and so for the purposes of our study, we use user-created glossaries to provide context for what our metrics find. Still, glossaries only contain words deemed by a few users to be important for their community, and the lack of labeled negative examples inhibits their use in a supervised machine learning task. Therefore, we focus on whether glossary words, on average, tend to have high scores using our methods.
Table 6 shows that glossary words have higher median scores than non-glossary words for all listed metrics (U-tests, p < 0.001). In addition, a substantial percentage of glossary words are in the 98th percentile of scored words for each metric.
. | metric . | mean reciprocal rank . | median, glossary words . | median, non-glossary words . | 98th percentile, all words . | % of scored glossary words in 98th percentile . |
---|---|---|---|---|---|---|
type | PMI () | 0.0938 | 2.7539 | 0.2088 | 5.0063 | 18.13 |
NPMI () | 0.4823 | 0.1793 | 0.0131 | 0.3035 | 22.30 | |
TFIDF | 0.2060 | 0.5682 | 0.0237 | 3.0837 | 16.76 | |
TextRank | 0.0616 | 6.95e-5 | 7.90e-5 | 0.0002 | 24.91 | |
JSD | 0.2644 | 2.02e-5 | 2.44e-7 | 5.60e-05 | 29.07 | |
sense | BERT substitutes (ℳ†) | 0.2635 | 0.1165 | 0.0143 | 0.1745 | 28.75 |
BERT embeddings (ℳ*) | 0.3067 | 0.1304 | 0.0208 | 0.1799 | 30.73 |
. | metric . | mean reciprocal rank . | median, glossary words . | median, non-glossary words . | 98th percentile, all words . | % of scored glossary words in 98th percentile . |
---|---|---|---|---|---|---|
type | PMI () | 0.0938 | 2.7539 | 0.2088 | 5.0063 | 18.13 |
NPMI () | 0.4823 | 0.1793 | 0.0131 | 0.3035 | 22.30 | |
TFIDF | 0.2060 | 0.5682 | 0.0237 | 3.0837 | 16.76 | |
TextRank | 0.0616 | 6.95e-5 | 7.90e-5 | 0.0002 | 24.91 | |
JSD | 0.2644 | 2.02e-5 | 2.44e-7 | 5.60e-05 | 29.07 | |
sense | BERT substitutes (ℳ†) | 0.2635 | 0.1165 | 0.0143 | 0.1745 | 28.75 |
BERT embeddings (ℳ*) | 0.3067 | 0.1304 | 0.0208 | 0.1799 | 30.73 |
We have five different possible metrics for scoring community-specific word types: type PMI, type NPMI, tf-idf, TextRank, and JSD. Of these, TextRank has the lowest MRR, but still scores a competitive percentage of glossary words in the 98th percentile. This is because the TextRank algorithm only determines how important a word is within each subreddit, without any comparison to other subreddits to determine how a word’s frequency in a subreddit differs from the norm. Type NPMI has the highest MRR, followed by JSD. Though JSD has more glossary words in the 98th percentile than type NPMI, we notice that many high-scoring JSD terms include words that have a very different probability in a subreddit compared to the rest of Reddit, but are not actually distinctive to that subreddit. For example, in r/justnomil, words such as husband, she, and her are within the top 10 ranked words by JSD score. This contrasts the words in Table 1 with high NPMI scores that are more unique to r/justnomil’s vocabulary. Therefore, for the remainder of this paper, we focus on NPMI as our type-based metric for measuring lexical variation.
Figure 2 shows the normalized distributions of type NPMI and sense NPMI. Though glossary words tend to have higher NPMI scores than non-glossary words, there is still overlap between the two distributions, where some glossary words have low scores and some non-glossary words have high ones. Sometimes this is because many glossary words with low type NPMI instead have high sense NPMI. For example, the glossary word envy in r/competitiveoverwatch refers to an esports team and has low type NPMI () but sense NPMI in the 98th percentile (ℳ* = 0.2640, ℳ† = 0.2136). Only 21 glossary terms, such as aha, a popular type of skin exfoliant in r/skincareaddiction, are both in the 98th percentile of and the 98th percentiles of ℳ* and ℳ† scores. Thus, examining variation in the meaning of broadly used words provides a complementary metric to counting distinctive word types, and overall provides a more comprehensive understanding of community-specific language.
Other cases of overlap are due to model error. Manual inspection reveals that some glossary words that actually have unique senses have low ℳ scores. Sometimes a WSI method splits a glossary term in a community into too many senses or fails to disambiguate different meanings. For example, the glossary word spawn in r/childfree refers to children, but the embedding-based method assigns it to the same sense used in gaming communities, where it instead refers to the creation of characters or items. As another example of a failure case, the substitution-based method splits the majority of occurrences of rep, an exercise movement, in r/bodybuilding into two large but separate senses. Though new methods using BERT have led to performance boosts, WSI is still a challenging task.
The use of glossaries in our study has several limitations. Some non-glossary terms have high scores because glossaries are not comprehensive. For example, dips (ℳ* = 0.2920, ℳ† = 0.2541) is not listed in r/fitness’s glossary, but it regularly refers to a type of exercise. This suggests the potential of our methods for uncovering possible additions to these glossaries. The vast majority of glossaries contain community-specific words, but a few also include common Internet terms that have low values across all metrics, such as lol, imo, and fyi. In addition, only 71.12% of all single-token glossary words occurred often enough to have scores calculated for them. Some words are relevant to the topic of the community (e.g., christadelphianism in r/christianity), but are actually rarely used in discussions. We do not compute scores for rarely occurring words, so they are excluded from our results. Despite these limitations, however, user-created glossaries are valuable resources for outsiders to understand the terminology used in niche online communities, and offer one of the only sources of in-domain validation for these methods.
7 Communities and Variation
In this section, we investigate how language variation relates to characteristics of users and communities in our dataset. For these analyses, we use the metrics that aligned the most with user-created glossaries (Table 6): for lexical variation and ℳ* for semantic variation. We define , or the distinctiveness of a community’s language variety, as the fraction of unique words in the community’s top 20% most frequent words that have or ℳ* in the 98th percentile of all scores for each metric. That is, a word in a community is counted as a “community-specific word” if its or if its ℳ* > 0.1799. Though in the following subsections we report numerical results using these cutoffs, the U-tests for community-level attributes and are statistically significant (p < 0.0001) for cutoffs as low as the 50th percentile.
7.1 User Behavior
Online communities differ from those in the offline world due to increased anonymity of the speakers and a lack of face-to-face interactions. However, the formation and survival of online communities still tie back to social factors. One central goal of our work is to see what behavioral characteristics a community with unique language tends to have. We examine four user-based attributes of subreddits: community size, user activity, user loyalty, and network density. We calculate values corresponding to these attributes using the entire, unsampled dataset of users and comments. For each of these user-based attributes, we propose and test hypotheses on how they relate to how much a community’s language deviates from the norm. Some of these hypotheses are pulled from established sociolinguistic theories previously developed using offline communities and interactions, and we test their conclusions in our large-scale, digital domain. We construct U-tests for each attribute after z-scoring them across subreddits, comparing subreddits separated into two equal-sized groups of high and low .
Del Tredici and Fernández (2018), when choosing communities for their study, claim that “small-to-medium sized” communities would be more likely to have lexical innovations. We define community size to be the number of unique users in a subreddit, and find that large communities tend to have less community-specific language (p < 0.001, Figure 3). Communities need to reach a “critical mass” to sustain meaningful interactions, but very large communities such as r/askreddit and r/news may suffer from communication overload, leading to simpler and shorter replies by users and fewer opportunities for group identity to form (Jones et al., 2004). We also collected subscriber counts from the last post of each subreddit made in our dataset’s timeframe, and found that communities with more subscribers have lower (p < 0.001), and communities with a higher ratio of subscribers to commenters also have lower (p < 0.001). Multiple subreddits were outliers with extremely large subscriber counts, perhaps due to past users being auto-subscribed to default communities or historical popularity spikes. Future work could look into more refined methods of estimating the number of users who browse but do not comment in communities (Sun et al., 2014).
Active communities of practice require regular interaction among their members (Holmes and Meyerhoff, 1999; Wenger, 2000). Our metric for measuring user activity is the average number of comments per user in that subreddit, and we find that communities with more community-specific language have more active users (p < 0.001, Figure 3). However, within each community, we did not find significant or meaningful correlations between a user’s number of comments in that community and the probability of them using a community-specific word.
Speakers with more local engagement tend to use more vernacular language, as it expresses local identity (Eckert, 2012; Bucholtz and Hall, 2005). Our proxy for measuring this kind of engagement is the fraction of loyal users in a community, where loyal users are those who have at least 50% of their comments in that particular subreddit. We use the definition of user loyalty introduced by Hamilton et al. (2017), filtering out users with fewer than 10 comments and counting only top-level comments. Communities with more community-specific language have more loyal users, which extends Hamilton et al. (2017)’s conclusion that loyal users value collective identity (p < 0.001, Figure 3). We also found that in 93% of all communities, loyal users had a higher probability of using a word with ℳ* in the 98th percentile than a nonloyal user (U-test, p < 0.001), and in 90% of all communities, loyal users had a higher probability of using a word with in the 98th percentile (U-test, p < 0.001). Thus, users who use Reddit mostly to interact in a single community demonstrate deeper acculturation into the language of that community.
A speech community is driven by the density of its communication, and dense networks enforce shared norms (Guy, 2011; Milroy and Milroy, 1992; Sharma and Dodsworth, 2020). Previous studies of face-to-face social networks may define edges using friend or familial ties, but Reddit interactions can occur between strangers. For network density, we calculate the density of the undirected direct-reply network of a subreddit based on comment threads: an edge exists between two users if one replies to the other. Following Hamilton et al. (2017), we only consider the top 20% of users when constructing this network. More dense communities exhibit more community-specific language (p < 0.001, Figure 3). Previous work using ethnography and friendship naming data has shown that a speaker’s position in a social network is sometimes reflected in the language they use, where individuals on the periphery adopt less of the vernacular of a social group compared to those in the core (Labov, 1973; Milroy, 1987; Sharma and Dodsworth, 2020). To see whether users’ position in Reddit direct-reply networks show a similar phenomena, we use Cohen et al. (2014)’s method to approximate users’ closeness centrality (ϵ = 10−7, k = 5000). Within each community, we did not find a meaningful correlation between closeness centrality and the probability of a user using a community-specific word. This finding suggests that conversation networks on Reddit may not convey a user’s degree of belonging to a community in the same manner as relationship networks in the physical world.
The four attributes we examine also have significant relationships with language variation when is separated out into its two lexical and semantic components (the fraction of words with and the fraction of words with ℳ* > 0.1799). In other words, the patterns in Figure 3 persist when counting only unique word types and when counting only unique meanings. This is because communities with greater lexical distinctiveness also exhibit greater semantic variation (Spearman’s rs = 0.7855, p < 0.001, Figure 4). So, communities with strong linguistic identities express both types of variation. Further causal investigations could reveal whether the same factors, such as users’ need for efficiency and expressivity, produce both unique words and unique meanings (Blank, 1999).
7.2 Topics
Language varieties can be based on interest or occupation (Fishman, 1972; Lewandowski, 2010), so we also examine what topics tend to be discussed by communities with distinctive language (Figure 5). We use r/ListofSubreddit’s categorization of subreddits, focusing on the 474 subreddits in our study.9 This categorization is hierarchical, and we choose a level of granularity so that each topic contains at least five of our subreddits. Video Games, TV, Sports, Hobbies/Occupations, and Technology tend to have more community-specific language. These communities often discuss a particular subset of the overall topic, such as a specific hobby or video game, which are rich with technical terminology. For example, r/mechanicalkeyboards () is categorized under Hobbies/Occupations. Their highly community-specific words include keyboard stores (e.g., kprepublic), types of keyboards (e.g., ortholinear), and keyboard components (e.g., pudding, reds).
7.3 Modeling Variation
Finally, we run ordinary least squares regressions with attributes of Reddit communities as features and the dependent variable as communities’ scores. The first model has only user-based attributes as features, while the second includes a topic-related feature. These experiments help us untangle whether the topic discussed in a community has a greater impact on linguistic distinctiveness than the behaviors of the community’s users. For the topic variable, we code the value as 1 if the community belongs to a topic identified as having high (Technology, TV, Video Games, Hobbies/Occ., Sports, or Other), and 0 otherwise.
Once we account for other user-based attributes, higher network density actually has a negative effect on variation (Table 7), suggesting that its earlier marginal positive effect is due to the presence of correlated features. We find that even when a community discusses a topic that tends to have high amounts of community-specific language, attributes related to user behavior still have a bigger and more significant relationship with language use, with similar coefficients for those variables between the two models. This suggests that who is involved in a community matters more than what these community members discuss.
. | Dependent variable: . | |
---|---|---|
. | . | |
. | (1) . | (2) . |
intercept | 0.0318*** | 0.0318*** |
(0.001) | (0.001) | |
community size | −0.0050*** | −0.0042*** |
(0.001) | (0.001) | |
user activity | 0.0181*** | 0.0179*** |
(0.001) | (0.001) | |
user loyalty | 0.0178*** | 0.0162*** |
(0.001) | (0.001) | |
network density | −0.0091*** | −0.0091*** |
(0.001) | (0.001) | |
topic | 0.0057*** | |
(0.001) | ||
Observations | 474 | 474 |
R2 | 0.505 | 0.529 |
Adjusted R2 | 0.501 | 0.524 |
. | Dependent variable: . | |
---|---|---|
. | . | |
. | (1) . | (2) . |
intercept | 0.0318*** | 0.0318*** |
(0.001) | (0.001) | |
community size | −0.0050*** | −0.0042*** |
(0.001) | (0.001) | |
user activity | 0.0181*** | 0.0179*** |
(0.001) | (0.001) | |
user loyalty | 0.0178*** | 0.0162*** |
(0.001) | (0.001) | |
network density | −0.0091*** | −0.0091*** |
(0.001) | (0.001) | |
topic | 0.0057*** | |
(0.001) | ||
Observations | 474 | 474 |
R2 | 0.505 | 0.529 |
Adjusted R2 | 0.501 | 0.524 |
Note: *p < 0.05, **p < 0.01, ***p < 0.001
8 Ethical Considerations
The Reddit posts and comments in our study are accessible by the public and were crawled by Baumgartner et al. (2020). Our project was deemed exempt from institutional review board review for human subjects research by the relevant administrative office at our institution. Even so, there are important ethical considerations to take when using social media data (franzke et al., 2020; Webb et al., 2017). Users on Reddit are not typically aware of research being conducted using their data, and therefore care needs to be taken to ensure that these users remain anonymous and unidentifable. In addition, posts and comments that are deleted by users after data collection still persist in the archived dataset. Our study minimizes risks by focusing on aggregated results, and our research questions do not involve understanding sensitive information about individual users. There is debate on whether to include direct quotes of users’ content in publications (Webb et al., 2017; Vitak et al., 2016). We include a few excerpts from comments in our paper to adequately illustrate our ideas, especially since the exact wording of text can influence the predictions of NLP models, but we choose examples that do not pertain to users’ personal information.
9 Conclusion
We use type- and sense-based methods to detect community-specific language in Reddit communities. Our results confirm several sociolinguistic hypotheses related to the behavior of users and their use of community-specific language. Future work could develop annotated WSI datasets for online language similar to the standard SemEval benchmarks we used, since models developed directly on this domain may better fit its rich diversity of meanings.
We set a foundation for further investigations on how BERT could help define unknown words or meanings in niche communities, or how linguistic norms vary across communities discussing similar topics. Our community-level analyses could be expanded to measure linguistic similarity between communities and map the dispersion of ideas among them. It is possible that the preferences of some communities towards specific senses is due to words being commonly polysemous and one meaning being particularly relevant to the topic of that community, while others might be linguistic innovations created by users. More research on semantic shifts may help untangle these differences.
Acknowledgments
We are grateful for the helpful feedback of the anonymous reviewers and our action editor, Walter Daelemans. In addition, Olivia Lewke helped us collect and organize subreddits’ glossaries. This work was supported by funding from the National Science Foundation (Graduate Research Fellowship DGE-1752814 and grant IIS-1813470).
Notes
Some Reddit users may have multiple usernames due to the creation of “throwaway” accounts (Leavitt, 2015), but we define a single user by its account username.
The single-sense scores for Amrami and Goldberg (2019) are not reported in their paper. To generate these scores, we ran the default model in their code base directly on the test set using SemEval 2013’s single-sense evaluation key, reporting average performance over ten runs.
To avoid sampling repeated comments written by bots, we disregarded comments where the context window around a target word (five tokens to the left and five tokens to the right) repeat 10 or more times.
We used a Tesla K80 GPU for the majority of these experiments, but we used a TITAN Xp GPU for three of the 474 subreddits for the substitution-based method.