Abstract
We propose a method that uses neural embeddings to improve the performance of any given LDA-style topic model. Our method, called neural embedding allocation (NEA), deconstructs topic models (LDA or otherwise) into interpretable vector-space embeddings of words, topics, documents, authors, and so on, by learning neural embeddings to mimic the topic model. We demonstrate that NEA improves coherence scores of the original topic model by smoothing out the noisy topics when the number of topics is large. Furthermore, we show NEA’s effectiveness and generality in deconstructing and smoothing LDA, author-topic models, and the recent mixed membership skip-gram topic model and achieve better performance with the embeddings compared to several state-of-the-art models.
1 Introduction
In recent years, methods for automatically learning representations of text data have become an essential part of the Natural Language Processing (NLP) pipeline. Word embedding models such as the skip-gram improve the performance of NLP methods by revealing the latent structural relationship between words (Mikolov et al. 2013a, b). These embeddings have proven valuable for a variety of NLP tasks such as statistical machine translation (Vaswani et al. 2013), part-of-speech tagging, chunking, and named entity recognition (Collobert et al. 2011). Because word vectors encode distributional information, the similarity relationships between the semantic meanings of the words are reflected in the similarity of the vectors (Sahlgren 2008).
On the other hand, topic models such as latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan 2003) construct latent representations of topical themes and of documents. Unlike word embeddings, topic models recover human interpretable semantic themes in the corpus. However, because topic models represent words using only dictionary indices rather than using vector-space embeddings, they are not able to directly capture or leverage the nuanced/distinct similarity relationships between words that are afforded by such embeddings.
We therefore desire a unified method that gains the benefits of both word embeddings (nuanced semantic relationships) and topic models (interpretable topical themes) in a mutually informing manner. A number of models have been proposed that combine aspects of word embeddings and topic models, by modeling them conditionally or jointly (Das, Zaheer, and Dyer 2015; Liu et al. 2015; Nguyen et al. 2015; Moody 2016; Shi et al. 2017; Meng et al. 2020), or by using neural variational inference for topic models (Miao, Yu, and Blunsom 2016; Zhu, He, and Zhou 2020). These models have not yet supplanted standard word embedding and topic modeling techniques in most applications, perhaps due to their complexity.
More recently, transformer-based language models such as BERT (Devlin et al. 2019) and its variant RoBERTa (Liu et al. 2019) have emerged as a powerful technique to improve over word embeddings with state-of-the-art performance at learning text representations for many tasks, but they do not aim to learn representations of topical semantics that are meaningful to humans. Another transformer-based autoregressive language model called GPT (Radford et al. 2018, 2019; Brown et al. 2020) can produce human-like text and also perform various NLP tasks such as text summarization, question answering, textual entailment, and so forth. However, Bender et al. (2021) criticized these large language models for the environmental impact of training and storing the models.
A more parsimonious approach, first used in mixed membership word embeddings (Foulds 2018), and subsequently in the embedded topic model (ETM) (Dieng, Ruiz, and Blei 2020) (proposed independently of this work), is to parameterize a topic model’s categorical distributions via embeddings, thereby obtaining mutually informing topics and embeddings without complicated joint or conditional modeling. These two models improve performance over their corresponding topic models, but do not apply more generally.
Building on the aforementioned line of work, in this article we propose a method for efficiently and accurately training a general class of embedding-parameterized topic models. Our approach, which we call neural embedding allocation (NEA), is to deconstruct topic models by reparameterizing them using vector-space embeddings. Given as input any arbitrary pre-trained topic model that is parameterized by categorical distributions, NEA outputs vector-space embeddings that encode its topical semantics. As well as learning effective embeddings, we demonstrate that the embeddings can be used to improve the quality of the topics. To this end, NEA uses the learned lower-dimensional representations to smooth out noisy topics. It outputs a smoothed version of the categorical topics, which is typically less noisy and more semantically coherent than the original topic model.
We can view our NEA method as learning to mimic a topic model with a skip-gram style embedding model to reveal underlying semantic vector representations. Our approach is thus reminiscent of model distillation for supervised models (Buciluǎ, Caruana, and Niculescu-Mizil 2006; Hinton, Vinyals, and Dean 2015). Inspired by subtle connections between embedding and topic model learning algorithms, we train NEA by minimizing the KL-divergence to the data distribution of the corresponding topic model, using a stream of simulated data from the model (subsets of data are randomly drawn from the topic model’s parameters). The resulting embeddings allow us to:
- (1)
improve the coherence of topic models by “smoothing out” noisy topics,
- (2)
improve classification performance by producing topic-informed document vectors, and
- (3)
construct embeddings and smoothed distributions over general topic modeling variables such as authors.
In short, NEA takes any general off-the-shelf LDA-style topic model, and improves it by making it more coherent and extracting powerful latent representations. Since it takes the pre-trained topic model as input, NEA can achieve this for a range of sophisticated models that extend LDA, such as those with additional latent variables, without complicating inference for the input topic model. This also enables it to be applied in use-cases where we are given a pre-trained topic model by someone else and we do not have access to the original data (e.g., due to privacy concerns).
NEA is related to the ETM (Dieng, Ruiz, and Blei 2020), a variant of neural network-based topic models, and to the mixed membership skip-gram (MMSG) (Foulds 2018), in that they each learn topic models that are parameterized by vector representations of both words and topics. However, whereas ETM and MMSG are models with specific architectures and associated special-purpose learning algorithms for those particular architectures, NEA is an algorithm that learns vector representations for any LDA-style topic model.
We show the benefits and generality of our NEA method by applying it to LDA, author-topic models (ATMs) (Rosen-Zvi et al. 2004), and the mixed membership skip gram topic model (MMSGTM) (Foulds 2018). NEA is compatible with sublinear algorithms for topic models (Li et al. 2014) and embeddings (Mikolov et al. 2013a; Mnih and Kavukcuoglu 2013), thereby readily scaling to tens of thousands of topics, unlike previous topical embedding methods. To the best of our knowledge, NEA is the first general method for improving arbitrary LDA-style topic models via embeddings.1
2 Background
For completeness, and to establish notation, we provide necessary background on topic models and word embeddings.
2.1 Latent Dirichlet Allocation
Probabilistic topic models, for example, LDA (Blei, Ng, and Jordan 2003), use latent variables to encode co-occurrences between words in text corpora and other bag-of-words represented data. A simple way to model text corpora is using multinomial naive Bayes with a latent cluster assignment for each document, which is a multinomial distribution over words, called a topick ∈{1,…K}. LDA topic models improve over naive Bayes using mixed membership, by relaxing the condition that all words in a document d belong to the same topic. In LDA’s generative process, for each word wdi of a document d, a topic assignment zdi is sampled from document-topic distribution θ(d) followed by drawing the word from topic-word distribution (see Table 1, bottom-right). Dirichlet priors encoded by αk and βw are used for these parameters, respectively.
. | Embedding Models . | Topic Models . | |
---|---|---|---|
Skip-gram | Naive Bayes skip-gram topic model (SGTM) | ||
Words|Input Word | • For each word in the corpus wi | • For each word in the corpus wi | |
– Draw input word wi ∼ pdata(wi) | – Draw input word wi ∼ pdata(wi) | ||
– For each word wc ∈ context(i) | – For each word wc ∈ context(i) | ||
Draw | Draw | ||
Neural embedding allocation | Latent Dirichlet allocation | ||
Words|Topics | • For each document d | • For each document d | |
– For each word in the document wdi | – For each word in the document wdi | ||
Draw zdi|d ∼Discrete(θ(d)) | Draw zdi|d ∼Discrete(θ(d)) | ||
Draw | Draw |
. | Embedding Models . | Topic Models . | |
---|---|---|---|
Skip-gram | Naive Bayes skip-gram topic model (SGTM) | ||
Words|Input Word | • For each word in the corpus wi | • For each word in the corpus wi | |
– Draw input word wi ∼ pdata(wi) | – Draw input word wi ∼ pdata(wi) | ||
– For each word wc ∈ context(i) | – For each word wc ∈ context(i) | ||
Draw | Draw | ||
Neural embedding allocation | Latent Dirichlet allocation | ||
Words|Topics | • For each document d | • For each document d | |
– For each word in the document wdi | – For each word in the document wdi | ||
Draw zdi|d ∼Discrete(θ(d)) | Draw zdi|d ∼Discrete(θ(d)) | ||
Draw | Draw |
2.2 Author Topic Model
The ATM is a probabilistic model for both authors and topics that extends LDA to include authorship information (Rosen-Zvi et al. 2004). In the generative process of ATM, for each word wdi of a document d, an author assignment adi is uniformly chosen from the set of authors Ad and then a topic assignment zdi is sampled from the author-topic distribution followed by drawing the word from topic-word distribution as follows:
For each document d
- –
For each word in the document wdi
Draw
Draw
Draw
- –
As for LDA, Dirichlet priors αa and βw are used for θ(a) and ϕ(z) parameters, respectively.
2.3 Word Embeddings
Traditional neural probabilistic language models predict words given their context words using a joint probability for sequences of words in a language (Bengio et al. 2003) based on distributed representations (Hinton et al. 1986) from neural network weights. Later, word embeddings were found to be useful for semantic representations of words, even without learning a full joint probabilistic language model. In particular, the skip-gram model is an effective method for learning better quality vector representations of words from big unstructured text data.
2.4 MMSG Topic Model
We also consider our prior work, a topic model called the MMSGTM (Foulds 2018), which combines ideas from topic models and word embeddings to recover domain specific embeddings for small data (e.g., 2,000 articles). The generative model for MMSGTM is:
For each word wi in the corpus
– Sample a topic
– For each word wc ∈ context(i)
- *
Sample a context word
- *
- *
Finally, the MMSG is trained for word and topic embeddings with the topic assignments z as input and surrounding wc as output. Because the MMSG training algorithm depends on the topic assignments for the whole corpus, it is not scalable for big data, unlike our proposed method NEA (introduced in Section 4.1). NEA is a general method that can be applied to train the MMSG model, and our experiments will demonstrate that it improves over the original MMSG training algorithm, as well as improving the MMSGTM’s representations (see Section 5.3).
3 Connections Between Word Embeddings and Topic Models
According to the distributional hypothesis, words that typically occur in similar contexts are likely to have similar meanings (Sahlgren 2008). Hence, the skip-gram’s conditional distributions over context words, and the vector representations that encode these distributions, are expected to be informative of the semantic relationships between words (Mikolov et al. 2013b). Similarly, Griffiths, Steyvers, and Tenenbaum (2007) modeled semantic relationships between words based on LDA, which they successfully used to solve a word association task. This suggests that topic models implicitly encode semantic relationships between words, motivating methods to recover this information, as we shall propose.
In this section, we first develop a bridge to connect word embedding methods such as the skip-gram with topic models, which will inform our approach going forward. First, we will show how word embedding models such as the skip-gram can be reinterpreted as a version of a corresponding topic model. Then, we will show how the learning algorithm for the skip-gram model can be understood as learning to mimic this topic model. We will use this perspective to motivate our proposed NEA method in Section 4.
3.1 Interpreting Embedding Models as Topic Models
The relationship between the skip-gram and topic models goes beyond their common ability to recover semantic representations of words. The skip-gram (Mikolov et al. 2013b) and LDA (Blei, Ng, and Jordan 2003) models are summarized in Table 1 (top-left, bottom-right), where we have interpreted the skip-gram, which is discriminative, as a “conditionally generative” model. As the table makes clear, the skip-gram and LDA both model conditional discrete distributions over words; conditioned on an input word in the former, and conditioned on a topic in the latter. To relate the two models, we hence reinterpret the skip-gram’s conditional distributions over words as “topics” , and the input words wi as observed cluster assignments, analogous to topic assignments z. From this perspective, the skip-gram can be understood as a particular “topic model,” in which the “topics” are parameterized via embeddings, and are assumed to generate context words.
In more detail, Table 1 (top) shows how the skip-gram (top-left) can be re-interpreted as a certain parameterization of a fully supervised naive Bayes topic model (top-right), which Foulds (2018) calls the (naive Bayes) skip-gram topic model (SGTM). These two models have the same assumed generative process, differing only in how they parameterize the distribution of context words wc given input words wi, namely, p(wc|wi). The skip-gram parameterizes this distribution using embeddings and via a log-bilinear model, while the SGTM parameterizes it as a “topic,” that is, a discrete distribution over words, . The models are equivalent up to this parameterization.
3.2 Interpreting Embedding Model Training as Mimicking a Topic Model
We next show how learning algorithms for the skip-gram are related to the SGTM. We will do this by introducing a variational interpretation of skip-gram training. This interpretation will provide a new perspective on the skip-gram learning algorithm: training the skip-gram on dataset corresponds to learning to mimic the optimal SGTM for that dataset. We first overview our argument before describing it in more precise detail.
3.2.1 Informal Summary of our Argument.
It is well known that maximizing the log likelihood for a model is equivalent to minimizing the KL-divergence to the model’s empirical data distribution (cf. Hinton 2002). When trained via maximum likelihood estimation (MLE), the skip-gram (SG) and its corresponding topic model both aim to approximate this same empirical data distribution. The SGTM can encode any set of conditional discrete distributions, and so its MLE recovers this distribution exactly. Thus, we can see that the skip-gram, trained via MLE, also aims to approximate the MLE skip-gram topic model in a variational sense.
3.2.2 More Mathematically Precise Argument.
While the above holds for maximum likelihood training, it should be noted that the skip-gram is more typically trained via negative sampling (NEG) (Mikolov et al. 2013b) or noise contrastive estimation (NCE) (Gutmann and Hyvärinen 2010, 2012), rather than MLE. These methods are, however, used for computational reasons, while MLE is the gold-standard “ideal” training procedure that NEG and NCE aim to approximate. The NCE algorithm was derived as an approximate method for MLE (Gutmann and Hyvärinen 2010, 2012), and NEG (Mikolov et al. 2013b) can be understood as an approximate version of NCE (Dyer 2014). We can therefore view both NCE and NEG as approximately solving the same variational problem as MLE, with some bias in their solutions due to the approximations that they make. In this sense, our claim that “skip-gram training aims to mimic a topic model” extends beyond the idealized MLE training procedure to the NEG and NCE implementations used in practice.3
4 Neural Embedding Allocation
We have seen that the skip-gram model (approximately) minimizes the KL-divergence to the distribution over data at the maximum likelihood estimate of its corresponding topic model. The skip-gram deconstructs its topic model into vector representations that aim to encode the topic model’s distributions over words. We can view this as learning to mimic a topic model with an embedding model.
In the skip-gram model, the training objective is to learn word vector representations that are good at predicting the nearby (or context) words (Mikolov et al. 2013b) as shown in Figure 1(a). The resulting vectors capture semantic relationships between words that were not directly available in its original “topic model” in which words were simply represented as dictionary indices. We therefore propose to apply this same approach, deconstructing topic models into neural embedding models, to other topic models. By doing so, we aim to similarly extract vector representations which encode the semantics of words (and topics, etc.) which were latent in the target topic model’s parameters.
The resulting method, which we refer to as neural embedding allocation (NEA), corresponds to reparameterizing the discrete distributions in topic models with embeddings. The neural embedding model generally loses some model capacity relative to the topic model, but it provides vector representations that encode valuable similarity information between words.
As we shall see, NEA’s reconstruction of the discrete distributions also smooths out noisy estimates of the topics, informed by the vectors’ similarity patterns, mitigating overfitting in the topic model training. For example, we show the “generative” model for NEA in Table 1 (bottom-left), which reparameterizes the LDA model by topic vectors and “output” word vectors vw′ that mimic LDA’s topic distributions over words, ϕ(k), by re-encoding them using log-bilinear models. In the generative model, θ(d) draws a topic for a document and the topic vectors are used as the input vectors to draw a word vw′.4 The schematic diagram of NEA framework to deconstruct and reconstruct (i.e., smooth out) a topic model is shown in Figure 2.
4.1 Training NEA for LDA
Finally, we construct document vectors by summing the corresponding (normalized) topic vectors according to the pre-trained LDA model’s topic assignments Z, for each token of that document.5 We normalize all document vectors to unit length to avoid any impact on the document length, producing the final document embeddings V(D). The pseudocode for training NEA to mimic LDA is shown in Algorithm 1.
4.2 General NEA Algorithm
More generally, the NEA method can be extended to encode any topic model’s parameters, which are typically conditional distributions given a single parent assignment, P(ai|parent(ai)), into vector representationsV(i), V(i)′ while also providing smoothed versions of the parameters PNEA(ai|parent(ai)). We illustrate the model architecture to train NEA for general topic models in Figure 3. In the embedding steps, for each iteration, we draw samples ai from the conditional discrete distributions for documents, authors, topics, words, and so on, followed by updating the input and output vectors by optimizing log-bilinear classification problems using negative sampling (discussed in Section 4.1). In the smoothing steps, we can recover the smoothed version of the parameters PNEA(ai|parents(ai)) by the dot product of the corresponding input and output vectors learned in embeddings steps followed by a softmax projection onto the simplex. Our NEA algorithm for general topic models is shown in Algorithm 2.
4.3 Relationship Between NEA and the ETM
Now that we have described NEA, at this juncture it is worth comparing and contrasting it with another related approach, the ETM (Dieng, Ruiz, and Blei 2020). The ETM is a model encoding topics via embedding vectors. Its assumed generative process is identical to the one shown in the bottom-left corner of Table 1. More concretely, the ETM parameterizes each topic with the inner product of vector representations for each word and its topic embedding, followed by a softmax function to produce word probabilities. The ETM further assumes a logistic normal prior on the document-topic proportions θ(d) for each document d. Dieng, Ruiz, and Blei (2020) train the ETM via a variational inference algorithm.
The main similarity between NEA and the ETM is that when NEA is applied to standard LDA, the same underlying generative model is assumed, up to the prior (Table 1). However, there are a number of differences. First, whereas the ETM (Dieng, Ruiz, and Blei 2020) is a model, NEA is an algorithm. While the ETM has a dedicated inference algorithm that applies specifically to that model, NEA is a general-purpose algorithm for deconstructing any given topic model into an embedding representation. For example, in this article we apply it to the ATM (Rosen-Zvi et al. 2004) and to the MMSG (Foulds 2018), in addition to LDA.
Furthermore, their learning algorithms differ significantly. To fit the ETM, we must solve a challenging inference problem over unobserved variables, which generally requires approximate inference algorithms, in this case, variational inference. The learning problem is simpler for NEA because it fits to a pre-trained topic model. NEA optimizes its objective, which aims to reconstruct the target topic model, using a standard stochastic gradient descent method via negative sampling on simulated data from the pre-trained topic model. We speculate that its superior performance to the ETM in our experiments (cf. Section 5.1.1) is due in part to fewer issues with local optima (since the latent variables in the topic model are circumvented), and in part due to avoidance of the need to use a variational approximation.
5 Experiments
The goals of our experiments were to evaluate the NEA algorithm both for topic modeling and as a feature engineering method for classification tasks. The code implementing NEA is provided in the following GitHub link: https://github.com/kkeya1/NEA.
We considered six datasets: the NIPS (a.k.a. NeurIPS) corpus with 1,740 scientific articles from years 1987–1999 with 2.3M tokens and a dictionary size of 13,649 words, the New York Times corpus with 4,676 articles and a dictionary size of 12,042 words (another version of this corpus, denoted by New York Times V2, that contains 1.37M documents including all stop words is used for direct comparison to the ETM), Bibtex6 containing 7,395 references as documents with a dictionary size of 1,643 words, the Reuters-150 news wire articles corpus (15,500 articles with dictionary size of 8,349 words), Ohsumed medical abstracts (20,000 articles where document classes are 23 cardiovascular diseases), and a large Wikipedia corpus containing 4.6M articles with 811M tokens from the online encyclopedia with a dictionary of 7,700 words. Note that we removed stop words from all datasets as a standard pre-processing step except New York Times V2 dataset.
5.1 Performance for LDA
We start our analysis by evaluating how NEA performs at mimicking and improving LDA topic models. We fixed LDA’s hyperparameters at α = 0.1 and β = 0.01 when K < 500, otherwise we used α = 0.01 and β = 0.001. We trained LDA via the Metropolis-Hastings-Walker algorithm (MHW) (Li et al. 2014), due to its scalability in K. In NEA, NEG was performed for 1 million minibatches of size 128 with 300-dimensional embeddings. We also considered an ensemble model where each topic is chosen between LDA and its corresponding NEA reconstruction, whichever has the highest coherence.
5.1.1 Comparison to ETM, LDA, and Other Baselines.
For a direct comparison to the reported results for our strongest baseline, the ETM (Dieng, Ruiz, and Blei 2020) (a model that parameterizes a topic model similar to LDA with embeddings and is trained via variational inference), we first study the performance of NEA on another version of the New York Times corpus (New York Times V2) which contains 1.37M documents with a dictionary size of 10,283 including all stop words.7 In Table 2 we directly report the results from Dieng, Ruiz, and Blei (2020) for LDA, the Δ-NVDM (a variant of the multinomial factor model of documents), and the labeled ETM (a variant of the ETM with pre-trained embeddings) with K = 300.8 Specifically, Dieng, Ruiz, and Blei (2020) report the normalized pointwise mutual information (NPMI) (Lau, Newman, and Baldwin 2014) coherence metric, topic diversity (the percentage of unique words in the top 25 words of all topics), and overall “topic quality” (simply the product of NPMI and diversity). NEA (trained to mimic MHW LDA—an approximation of LDA for scalability in number of topics) clearly outperformed the state-of-the-art ETM and the other baselines on all metrics on NYT in the presence of stop words. Its success was due in part to clustering the stop words as separate topics rather than mixing the stop words with the other topics. In Table 3, we show that NEA forms high quality topics while clustering the stop words as separate topics rather than mixing the stop words with the other topics (see last topic, italicized), as in LDA.
. | NPMI . | Diversity . | Quality . |
---|---|---|---|
LDA | 0.13 | 0.14 | 0.0173 |
MHW LDA | 0.15 | 0.10 | 0.0152 |
Δ-NVDM | 0.17 | 0.11 | 0.0187 |
Labeled ETM | 0.18 | 0.22 | 0.0405 |
NEA | 0.26 | 0.27 | 0.0693 |
. | NPMI . | Diversity . | Quality . |
---|---|---|---|
LDA | 0.13 | 0.14 | 0.0173 |
MHW LDA | 0.15 | 0.10 | 0.0152 |
Δ-NVDM | 0.17 | 0.11 | 0.0187 |
Labeled ETM | 0.18 | 0.22 | 0.0405 |
NEA | 0.26 | 0.27 | 0.0693 |
LDA . | NEA . | LDA . | NEA . | LDA . | NEA . | LDA . | NEA . | LDA . | NEA . |
---|---|---|---|---|---|---|---|---|---|
the | republicans | the | book | the | health | and | wine | the | of |
to | democrats | of | books | health | patients | the | restaurant | of | is |
republican | republican | and | read | to | doctors | with | restaurants | and | in |
state | senator | to | author | of | medical | of | dishes | in | the |
for | senate | book | write | for | hospitals | is | food | to | be |
mr | democrat | is | authors | and | care | in | dinner | for | and |
senate | democratic | in | reading | care | drug | at | bar | is | on |
on | election | books | pages | in | drugs | to | menu | on | for |
democrats | governor | that | magazine | medical | patient | wine | street | as | at |
republicans | campaign | it | readers | drug | hospital | restaurant | wines | at | as |
LDA . | NEA . | LDA . | NEA . | LDA . | NEA . | LDA . | NEA . | LDA . | NEA . |
---|---|---|---|---|---|---|---|---|---|
the | republicans | the | book | the | health | and | wine | the | of |
to | democrats | of | books | health | patients | the | restaurant | of | is |
republican | republican | and | read | to | doctors | with | restaurants | and | in |
state | senator | to | author | of | medical | of | dishes | in | the |
for | senate | book | write | for | hospitals | is | food | to | be |
mr | democrat | is | authors | and | care | in | dinner | for | and |
senate | democratic | in | reading | care | drug | at | bar | is | on |
on | election | books | pages | in | drugs | to | menu | on | for |
democrats | governor | that | magazine | medical | patient | wine | street | as | at |
republicans | campaign | it | readers | drug | hospital | restaurant | wines | at | as |
As clustering the stop words as separate topics is also one of the advantages for the ETM model, we found that NEA is better than the ETM on this task in terms of performance measurements in Table 2. Note that even though the model parameterization between “NEA mimicking MHW LDA” and the ETM is the same, the training algorithm is very different. Because NEA fits to a pre-trained topic model, the learning problem is easier, and is likely less vulnerable to local optima. The ETM also needs to make a variational approximation which may hurt its performance.
5.1.2 Quality of Topics
In the previous section, we demonstrated that NEA constructs high-quality topics on New York Times V2 corpus by disallowing the stop words to be mixed with other topics and by clustering stop words as separate topics. Here, we performed a comprehensive analysis on the other datasets, where stop words were removed, to better understand the performance of NEA. The analysis in this section demonstrates the advantage of NEA over LDA even if a dataset does not contain any stop words.
To get a quantitative comparison, we compared the topics’ UMass coherence metric, which measures the semantic quality of a topic based on its T most probable words (we choose T = 10 words), thereby quantifying the user’s viewing experience (Mimno et al. 2011). Larger coherence values indicate greater co-occurrence of the words, hence higher quality topics. Coherence is very closely related to NPMI but is simpler, more widely used, and correlates similarly with human judgment. In Figure 4, the average topic coherence of LDA, NEA, and their ensemble model9 is shown with respect to the number of topics K. LDA works well with small K values, but when K becomes large, NEA outperforms LDA in average topic coherence scores on all datasets. Since the ensemble model chooses the best topics between NEA and LDA, it always performs the best.
We also conduct qualitative analysis that is complementary to our experiment in Figure 4. Instead of reporting results for the wide range of number of topics K, as in Figure 4, we pick some example K for the ease of demonstration in the qualitative analysis. First, we found that NEA generally recovers the same top words for LDA’s “good topics.” Most of the topics produced by both the LDA and NEA models are interpretable, and NEA was able to approximately recover the original LDA’s topics. In Table 4, we show a few randomly selected example topics from LDA and NEA, while LDA was trained on the NIPS corpus for K =2,000. In Table 5, we show the four worst topics from LDA, based on per-topic coherence score, and their corresponding NEA topics, on NIPS for K =7,000. In this case, NEA generated noticeably more meaningful topics than LDA. For fairness, we also show the four worst topics reconstructed by NEA, based on per-topic coherence score, and their corresponding LDA-generated topics for the same model in Table 6. In this case, LDA perhaps generates slightly more meaningful topics than NEA, although the relative performance is somewhat subjective. Note that in practice, we can always use the ensemble approach, choosing the best topic between LDA and NEA based on coherence.
LDA . | NEA . | LDA . | NEA . | LDA . | NEA . | LDA . | NEA . |
---|---|---|---|---|---|---|---|
TC: −3.016 . | TC: −1.270 . | TC: −1.577 . | TC: −1.272 . | TC: −2.578 . | TC: −1.376 . | TC: −1.584 . | TC: −1.316 . |
bayesian | bayesian | images | images | phrase | sentences | regression | regression |
prior | bayes | image | image | sentences | phrase | linear | linear |
bayes | posterior | recognition | visual | clause | structure | ridge | ridge |
posterior | priors | vision | recognition | structure | sentence | quadratic | quadratic |
framework | likelihood | pixel | pixels | sentence | clause | squared | variables |
priors | prior | techniques | pixel | phrases | activation | nonparametric | nonparametric |
likelihood | framework | pixels | illumination | syntactic | connectionist | dimensionality | squared |
bars | note | visual | intensity | connectionist | phrases | variables | multivariate |
note | probability | computed | pairs | tolerance | roles | smoothing | kernel |
compute | bars | applied | matching | previous | agent | friedman | basis |
LDA . | NEA . | LDA . | NEA . | LDA . | NEA . | LDA . | NEA . |
---|---|---|---|---|---|---|---|
TC: −3.016 . | TC: −1.270 . | TC: −1.577 . | TC: −1.272 . | TC: −2.578 . | TC: −1.376 . | TC: −1.584 . | TC: −1.316 . |
bayesian | bayesian | images | images | phrase | sentences | regression | regression |
prior | bayes | image | image | sentences | phrase | linear | linear |
bayes | posterior | recognition | visual | clause | structure | ridge | ridge |
posterior | priors | vision | recognition | structure | sentence | quadratic | quadratic |
framework | likelihood | pixel | pixels | sentence | clause | squared | variables |
priors | prior | techniques | pixel | phrases | activation | nonparametric | nonparametric |
likelihood | framework | pixels | illumination | syntactic | connectionist | dimensionality | squared |
bars | note | visual | intensity | connectionist | phrases | variables | multivariate |
note | probability | computed | pairs | tolerance | roles | smoothing | kernel |
compute | bars | applied | matching | previous | agent | friedman | basis |
LDA . | NEA . | LDA . | NEA . | LDA . | NEA . | LDA . | NEA . |
---|---|---|---|---|---|---|---|
TC: −8.184 . | TC: −1.204 . | TC: −8.023 . | TC: −1.062 . | TC: − 7.984 . | TC: −1.390 . | TC: − 7.787 . | TC: −0.798 . |
corresponds | parameters | symbolics | values | ryan | learning | paths | total |
change | important | addressing | case | learning | methods | close | paths |
cut | neural | choice | increase | bit | text | path | global |
exact | change | perturbing | systems | inhibited | space | make | path |
coincides | results | radii | rate | nice | combined | numbering | time |
duplicates | report | lefted | point | automatica | averaging | channels | fixed |
volatility | cut | damping | feedback | tucson | area | rep | function |
trapping | multiple | merits | input | infinitely | apply | scalars | yields |
reading | experiments | vax | reduces | stacked | recognition | anism | close |
ters | minimizing | unexplored | stage | exceeded | bit | viously | computation |
LDA . | NEA . | LDA . | NEA . | LDA . | NEA . | LDA . | NEA . |
---|---|---|---|---|---|---|---|
TC: −8.184 . | TC: −1.204 . | TC: −8.023 . | TC: −1.062 . | TC: − 7.984 . | TC: −1.390 . | TC: − 7.787 . | TC: −0.798 . |
corresponds | parameters | symbolics | values | ryan | learning | paths | total |
change | important | addressing | case | learning | methods | close | paths |
cut | neural | choice | increase | bit | text | path | global |
exact | change | perturbing | systems | inhibited | space | make | path |
coincides | results | radii | rate | nice | combined | numbering | time |
duplicates | report | lefted | point | automatica | averaging | channels | fixed |
volatility | cut | damping | feedback | tucson | area | rep | function |
trapping | multiple | merits | input | infinitely | apply | scalars | yields |
reading | experiments | vax | reduces | stacked | recognition | anism | close |
ters | minimizing | unexplored | stage | exceeded | bit | viously | computation |
LDA . | NEA . | LDA . | NEA . | LDA . | NEA . | LDA . | NEA . |
---|---|---|---|---|---|---|---|
TC: −2.027 . | TC: − 4.690 . | TC: −2.615 . | TC: − 3.737 . | TC: −2.433 . | TC: − 3.712 . | TC: −3.183 . | TC: − 3.696 . |
blake | models | strain | structure | insertion | space | learning | learning |
condensation | exp | mars | length | hole | reinforcement | steps | steps |
isard | blake | yield | variance | gullapalli | learning | computer | computer |
models | similar | rolling | equal | reinforcement | fig | testing | testing |
observations | condensation | mill | mars | smoothed | insertion | observation | people |
entire | modified | cart | strain | reactive | hole | predetermined | bin |
oxford | generally | tuning | weight | extreme | fit | cheng | observation |
rabiner | cortical | material | intelligence | ram | gullapalli | utilizes | efficient |
gelb | isard | friedman | cycle | gordon | maximum | efficient | utilizes |
north | consisting | plot | friedman | consecutive | regions | updating | birth |
LDA . | NEA . | LDA . | NEA . | LDA . | NEA . | LDA . | NEA . |
---|---|---|---|---|---|---|---|
TC: −2.027 . | TC: − 4.690 . | TC: −2.615 . | TC: − 3.737 . | TC: −2.433 . | TC: − 3.712 . | TC: −3.183 . | TC: − 3.696 . |
blake | models | strain | structure | insertion | space | learning | learning |
condensation | exp | mars | length | hole | reinforcement | steps | steps |
isard | blake | yield | variance | gullapalli | learning | computer | computer |
models | similar | rolling | equal | reinforcement | fig | testing | testing |
observations | condensation | mill | mars | smoothed | insertion | observation | people |
entire | modified | cart | strain | reactive | hole | predetermined | bin |
oxford | generally | tuning | weight | extreme | fit | cheng | observation |
rabiner | cortical | material | intelligence | ram | gullapalli | utilizes | efficient |
gelb | isard | friedman | cycle | gordon | maximum | efficient | utilizes |
north | consisting | plot | friedman | consecutive | regions | updating | birth |
In Table 7, we show the 4 topics with the largest improvement in coherence scores by NEA, for Reuters-150 with 7,000 topics. We observe that these LDA topics were uninterpretable, and likely had very few words assigned to them. NEA tends to improve the quality of these “bad” topics, for example, by replacing noisy or stop words with more semantically related ones. In particular, we found that NEA gave the most improvement for topics with few words assigned to them (see Figure 5 (left)) and when K becomes large, the majority of topics have few assigned words (see Figure 5 (right)). As a result, NEA improves the quality of most of the topics.
LDA . | NEA . | LDA . | NEA . | LDA . | NEA . | LDA . | NEA . |
---|---|---|---|---|---|---|---|
TC: − 18.928 . | TC: −2.601 . | TC: − 19.367 . | TC: −3.120 . | TC: − 20.805 . | TC: −4.844 . | TC: − 17.906 . | TC: −2.035 . |
share | International | tonnes | announced | blah | blah | dlrs | debt |
pittsburgh | common | yr | tonnes | aa | company | aa | canadian |
aa | share | aa | addition | aaa | account | aaa | today |
aaa | pittsburgh | aaa | asked | ab | advantage | ab | canada |
ab | general | ab | accounts | abandon | acquisitions | abandon | decline |
abandon | agreement | abandon | shares | abandoned | loss | abandoned | competitive |
abandoned | tender | abandoned | surplus | abc | proposed | abc | conditions |
abc | market | abc | secretary | abdul | considered | abdul | dlrs |
abdul | june | abdul | heavy | aberrational | announced | aberrational | price |
aberrational | dividend | aberrational | held | abide | base | abide | week |
LDA . | NEA . | LDA . | NEA . | LDA . | NEA . | LDA . | NEA . |
---|---|---|---|---|---|---|---|
TC: − 18.928 . | TC: −2.601 . | TC: − 19.367 . | TC: −3.120 . | TC: − 20.805 . | TC: −4.844 . | TC: − 17.906 . | TC: −2.035 . |
share | International | tonnes | announced | blah | blah | dlrs | debt |
pittsburgh | common | yr | tonnes | aa | company | aa | canadian |
aa | share | aa | addition | aaa | account | aaa | today |
aaa | pittsburgh | aaa | asked | ab | advantage | ab | canada |
ab | general | ab | accounts | abandon | acquisitions | abandon | decline |
abandon | agreement | abandon | shares | abandoned | loss | abandoned | competitive |
abandoned | tender | abandoned | surplus | abc | proposed | abc | conditions |
abc | market | abc | secretary | abdul | considered | abdul | dlrs |
abdul | june | abdul | heavy | aberrational | announced | aberrational | price |
aberrational | dividend | aberrational | held | abide | base | abide | week |
To further study this phenomenon, we showcase the improvement of “bad topics,” those which have less than 200 words assigned to them, by the NEA model for all datasets in Figure 6. For such topics (i.e., those with less than 200 words assigned to them), NEA leads to an improvement in coherence in almost all of the cases.
5.1.3 Performance for LDA Model on Big Data (Wikipedia).
In this experiment, we evaluated NEA in a big data setting, using the Wikipedia corpus with K = 10,000 topics. We scaled up LDA using a recent online inference algorithm for high-dimensional topic models, called SparseSCVB0 (Islam and Foulds 2019), which leverages both stochasticity and sparsity. The big data LDA model was trained on Wikipedia for 72 hours using SparseSCVB0 while NEA was trained on the SparseSCVB0 parameters for 24 hours with 128-dimensional embeddings. In Table 8, we see that NEA improves SparseSCVB0’s average topic coherence and topic diversity on the Wikipedia dataset.
5.2 Performance for Author-Topic Model (ATM)
Random Chance . | ATM . | NEA-Embed . | TF-IDF . | NEA-Smooth . |
---|---|---|---|---|
0.043 | 0.083 | 0.086 | 0.091 | 0.106 |
Random Chance . | ATM . | NEA-Embed . | TF-IDF . | NEA-Smooth . |
---|---|---|---|---|
0.043 | 0.083 | 0.086 | 0.091 | 0.106 |
Similarly to LDA, Figure 7 shows that NEA improves the ATM’s topics while the ensemble model outperforms NEA in terms of per-topic coherence, on the NIPS corpus with K = 1,000.
5.3 Performance for Mixed Membership Skip-Gram Topic Model (MMSGTM)
We trained NEA for the MMSGTM (Foulds 2018) using the same hyperparameter values as in previous experiments, while setting MMSGTM-specific hyperparameters to the values suggested by Foulds (2018). The original MMSG algorithm learns topic embeddings based on the MMSGTM’s topic assignments Z, while NEA uses simulated data from the topic model. NEA is arguably more principled than the algorithm of Foulds (2018) due to its global variational objective. We found that NEA smooths and improves the speed of the training process (shown in Figure 8), while greatly reducing memory requirements as the topic assignments Z need not be stored for NEA.
5.4 Downstream Task: Document Categorization
In this set of experiments, we tested the performance of the learned vectors using NEA’s document embeddings V(D) as features for document categorization/classification. We used two standard document categorization benchmark datasets: Reuters-150, and Ohsumed.10 We used the standard train/test splits from the literature (e.g., for Ohsumed, 50% of documents were assigned to training and to test sets). Note that we also held out 20% of documents from training data as validation set to select hyperparameters including the number of topics and embedding size via grid search in terms of accuracy on the validation set. The documents in the validation set were merged again with the training data after completing hyperparameter tuning. Logistic regression classifiers were trained on the features extracted on the training set for each method while classification accuracy was computed on the held-out test data. Continuing from the previous section, we first evaluated document categorization with NEA and MMSG, where both models were trained based on the MMSGTM (Table 10). Both NEA and MMSG improved document categorization accuracy compared to the MMSGTM on Reuters-150 and Ohsumed, while NEA performed the best.
Datasets . | MMSGTM . | MMSG . | NEA . |
---|---|---|---|
Reuters-150 | 66.97 | 67.72 | 68.59 |
Ohsumed | 32.41 | 33.63 | 34.89 |
Datasets . | MMSGTM . | MMSG . | NEA . |
---|---|---|---|
Reuters-150 | 66.97 | 67.72 | 68.59 |
Ohsumed | 32.41 | 33.63 | 34.89 |
We studied NEA’s performance in more detail in the context of LDA, which was trained with the same hyperparameters used in Section 5.1.2. We compared NEA with LDA and several popular models such as the SG (Mikolov et al. 2013a, b), paragraph vector (Doc2Vec) (Le and Mikolov 2014), and a convolutional neural network (CNN) (Kim 2014). All baseline models were trained using reported hyperparameters in the corresponding literatures. The results are given in Table 11.
Datasets . | #Classes . | #Topics . | Doc2Vec . | LDA . | NEA . | CNN . | SG . | SG+LDA . | SG+NEA . |
---|---|---|---|---|---|---|---|---|---|
Reuters-150 | 116 | 500 | 55.89 | 64.26 | 67.15 | 69.43 | 70.80 | 69.13 | 72.29 |
Ohsumed | 23 | 500 | 34.02 | 32.05 | 34.38 | 27.17 | 37.26 | 37.33 | 38.88 |
Datasets . | #Classes . | #Topics . | Doc2Vec . | LDA . | NEA . | CNN . | SG . | SG+LDA . | SG+NEA . |
---|---|---|---|---|---|---|---|---|---|
Reuters-150 | 116 | 500 | 55.89 | 64.26 | 67.15 | 69.43 | 70.80 | 69.13 | 72.29 |
Ohsumed | 23 | 500 | 34.02 | 32.05 | 34.38 | 27.17 | 37.26 | 37.33 | 38.88 |
We found that NEA had better classification accuracy than LDA and Doc2Vec. The CNN showed inconsistent results, outperforming Doc2Vec, LDA, and NEA on Reuters-150, but performing very poorly for Ohsumed. In NEA, the document vectors are encoded at the topic level rather than the word level, so it loses word-level information, which turned out to be beneficial for these specific classification tasks, at which SG features outperformed NEA’s features. Interestingly, however, when both SG and NEA features were concatenated (SG + NEA), this improved the classification performance over each model’s individual performance. This suggests that the combination of topic-level NEA and word-level SG vectors complement the qualities of each other and both are valuable for performance.
Note that TF-IDF, which is notoriously effective for document categorization, outperformed all embeddings. In Table 12, we show the results when concatenating TF-IDF with the other feature vectors from LDA, SG, and NEA, which in many cases improved performance over TF-IDF alone. We observed the highest improvement over TF-IDF for both document categorization tasks when we concatenated NEA vectors with TF-IDF (TF-IDF + NEA), although the difference was not statistically significant. This may be because the topical information in NEA features is complementary to TF-IDF, while SG’s word-based features were redundant, and hence actually reduced performance. While the differences in accuracy between the NEA-concatenated features and the best baselines (SG, TF-IDF) were not statistically significant, the NEA-concatenated features X + NEA outperformed the unconcatenated features X in all 6 cases in Tables 11 and 12 (X ∈{SG,TF-IDF,TF-IDF +SG}, computed over both datasets). The improvement of X + NEA over X was statistically significant according to a Wilcoxon signed-rank test (2-sided, p < 0.05).
Datasets . | #Classes . | #Topics . | TF-IDF . | TF-IDF+LDA . | TF-IDF+SG . | TF-IDF+NEA . | TF-IDF+SG+NEA . |
---|---|---|---|---|---|---|---|
Reuters-150 | 116 | 500 | 73.00 | 73.01 | 72.99 | 73.14 | 73.09 |
Ohsumed | 23 | 500 | 43.07 | 43.05 | 43.04 | 43.11 | 43.08 |
Datasets . | #Classes . | #Topics . | TF-IDF . | TF-IDF+LDA . | TF-IDF+SG . | TF-IDF+NEA . | TF-IDF+SG+NEA . |
---|---|---|---|---|---|---|---|
Reuters-150 | 116 | 500 | 73.00 | 73.01 | 72.99 | 73.14 | 73.09 |
Ohsumed | 23 | 500 | 43.07 | 43.05 | 43.04 | 43.11 | 43.08 |
5.5 Case Study: Application to Mitigating Sociolinguistic Bias in Author Embeddings
We conducted a case study to demonstrate the practical use and benefits of the NEA method. The goal of the study was to investigate its use in identifying and mitigating gender bias in natural language processing models (Bolukbasi et al. 2016; Caliskan, Bryson, and Narayanan 2017; Gonen and Goldberg 2019). Differing patterns of language usage that are correlated with protected characteristics such as gender, race, age, and nationality can facilitate unwanted discrimination, called sociolinguistic bias, and this can be encoded by machine learning models (Deshpande, Pan, and Foulds 2020). To address this issue, our approach was to learn representations of the bloggers that encode their salient topical interests but not irrelevant gender information, a task known as fair representation learning (Zemel et al. 2013). Such debiased representations would potentially be valuable for fairness in recommendation systems, resume filtering for hiring purposes, information retrieval, and so forth. For this experiment, we used the Blog Authorship corpus (Schler et al. 2006), which consists of 681,288 posts of 19,320 bloggers (approximately 35 posts per person) with their self-provided gender information (male or female). Similarly to our experiments above, we trained an ATM on the dataset with K = 1,000 topics, and then trained NEA to mimic the ATM. We then used the NEA model for both visualization and debiasing purposes, as discussed below.11
5.5.1 Visualizing NEA Embeddings for Male and Female Authors.
We first used the NEA embeddings to visualize blog posts written by male and female authors, and thus expose any differences in their distributions that may potentially be a source of bias in downstream machine learning models. Because NEA allows us to learn latent vector-space embeddings for words, topics, documents, authors, and so on, we can analyze the demographic differences with corresponding embeddings. In Figure 9, we show the t-SNE projected NEA embeddings for authors to explore the relationship between them in terms of gender: male (green asterisks) and female (orange dots). We found that there is some partial separation between the male and female author embeddings in the t-SNE space, indicating that there are indeed systematic differences in topics between male and female authors. For example, more female authors are located on the upper region and more male authors are located on the lower right region of the t-SNE space, though many authors regardless of their gender partially overlapped as well, particularly in the middle of the t-SNE space.
We also found several clusters of authors in the t-SNE space that were dominated by a particular gender. As shown in Figure 9, we investigated this phenomenon by selecting two clusters to inspect more closely: Cluster-A, which was dominated by female authors, and Cluster-B, which was dominated by male authors. For each cluster, to show the zoomed-in visualizations, we picked the top-10 nearest authors from the cluster center and annotated them with the most similar topics of these authors in terms of the Euclidean distance between NEA-generated topics and corresponding author embeddings. These female- and male-author dominated clusters (Cluster-A and Cluster-B, respectively) showed a distinct topical trend on the t-SNE space. As seen in the figure, Cluster-A’s authors are close to topics relating to emotions, while Cluster-B’s authors are close to topics relating to the Internet, business, and travel. Furthermore, the Cluster-A topics may not be salient to the authors’ topical interests in a downstream task, while potentially acting as a proxy variable for gender which could encode sociolinguistic bias (Barocas and Selbst 2016). The NEA embedding-based visualization was thus helpful for showing that debiasing interventions are likely to be impactful for this dataset.
5.5.2 Most Gendered Topics Per Bias Direction
We can analyze the demographic bias by computing a bias direction (Bolukbasi et al. 2016; Dev and Phillips 2019; Islam et al. 2019, 2021) with respect to the protected attribute (e.g., gender or race). Following Islam et al. (2021), we constructed overall male (vm) and female (vf) vectors by taking the average of NEA-generated author embeddings for male and female authors, respectively, and hence computing the overall gender bias direction, . To get the “most male” and “most female” topics, we computed the dot products of NEA topic embeddings with vB and sorted them accordingly. Note that the biggest and smallest dot products are associated with male and female vectors, respectively. Finally, we report the top male and female topics in Table 13. The results show that male and female authors tend to use very different topics with very distinct content, suggesting that there are subtle (and not-so-subtle) differences in the use of language between authors of different genders. For instance, the top male topics identified by the method were related to the military, politics, business, and computers, while the selected top female topics were related to moods, interpersonal relations, and informal language.
Male | soldier | kerry | efforts | web | fox | company | america |
government | campaign | element | urllink | republican | market | attack | |
iraq | election | useful | software | mission | industry | americans | |
report | dean | particularly | service | draft | successful | nation | |
unite | chief | conference | program | map | sales | political | |
foreign | moore | merely | windows | goals | increase | source | |
fund | political | effectively | page | pursue | benefit | conservative | |
military | bush | simply | resolution | hire | politics | ||
weapons | vote | television | source | agent | manager | evidence | |
countries | state | expression | network | expand | offer | generation | |
Female | care | hehe | shes | cuz | girls | bout | haha |
stop | wow | girlfriend | lil | friend | bday | den | |
things | rain | shell | kinda | dance | hav | dun | |
figure | ugh | girl | tho | boyfriend | luv | dunno | |
mind | yep | flower | goin | guy | sum | sch | |
days | hahahaha | shed | wanna | fun | talkin | lor | |
happy | hehehe | theyll | lol | conversation | nite | wad | |
dont | absolutely | hes | omg | night | jus | rite | |
mean | whew | birthday | mad | wait | wit | wanna | |
ill | geez | girls | gotta | talk | nothin | sia |
Male | soldier | kerry | efforts | web | fox | company | america |
government | campaign | element | urllink | republican | market | attack | |
iraq | election | useful | software | mission | industry | americans | |
report | dean | particularly | service | draft | successful | nation | |
unite | chief | conference | program | map | sales | political | |
foreign | moore | merely | windows | goals | increase | source | |
fund | political | effectively | page | pursue | benefit | conservative | |
military | bush | simply | resolution | hire | politics | ||
weapons | vote | television | source | agent | manager | evidence | |
countries | state | expression | network | expand | offer | generation | |
Female | care | hehe | shes | cuz | girls | bout | haha |
stop | wow | girlfriend | lil | friend | bday | den | |
things | rain | shell | kinda | dance | hav | dun | |
figure | ugh | girl | tho | boyfriend | luv | dunno | |
mind | yep | flower | goin | guy | sum | sch | |
days | hahahaha | shed | wanna | fun | talkin | lor | |
happy | hehehe | theyll | lol | conversation | nite | wad | |
dont | absolutely | hes | omg | night | jus | rite | |
mean | whew | birthday | mad | wait | wit | wanna | |
ill | geez | girls | gotta | talk | nothin | sia |
The NEA method allowed us to identify the gender-associated topics, information which potentially can help to mitigate gender bias in topic models, and other machine learning models trained on this dataset. For instance, the most gender-associated topics could be removed from the set of features used by a classifier or recommendation system. Potential application of this debiasing approach include fair recommendation of bloggers to followers, or more generally, combating discrimination in AI-based resume filtering for hiring (Deshpande, Pan, and Foulds 2020) and reducing gender bias in a model for recommending which articles should be cited by a new manuscript.
5.5.3 Mitigating Gender Bias in Blog Author Embeddings.
In this experiment, we demonstrate a debiasing approach to mitigate gender bias in NEA-generated author embeddings. The Blog Authorship corpus contains the categories of bloggers with respect to the contents in their blog posts. There are 40 categories of bloggers such as advertising, arts, banking, education, engineering, fashion, law, religion, science, sports, technology, and so on. Our goal was to learn representations of blog authors that captured content information, that is, by being predictive of the category labels, without encoding irrelevant gender information exposing sociolinguistic bias.
Table 14 shows accuracy and fairness metrics on held-out data for the LR models trained on the different feature sets. These features were: the ATM’s original (i.e., not debiased) author-topic distribution features and their debiased version (via removing the top-10 most male and most female topics), the original NEA-generated author embeddings, and the two debiased NEA embedding approaches (the simple method which removes the most gendered topics prior to constructing author embeddings, and our gold standard version which uses the linear projection of authors in Equation (9)). We found that the LR model trained on our linear projection-based debiased author embeddings was the fairest model. It substantially outperformed all other models in terms of all fairness metrics, and as desired, it had the lowest accuracy in predicting gender compared to the other models, which indicates lower dependence between the authors’ representations and their gender. Although the LR model with ATM’s original author-topic distributions had the highest accuracy for blogger categorization, gender prediction accuracy was also highest, which is undesirable in this context. Moreover, this model performed worse in terms of the fairness metrics compared to all of the debiased models. Finally, we also found that LR models with our debiased author representations slightly improved the bloggers categorization accuracy compared to the LR model with the original author embeddings. It may seem counter-intuitive that debiasing improves accuracy, but this likely occurred because fairness interventions can reduce overfitting, which in some cases can result in improved accuracy due to improved generalization to unseen data (Keya et al. 2021; Islam, Pan, and Foulds 2021). In this case, removing gendered topics that are irrelevant to the categorization class labels was a win-win for both accuracy and fairness.
Models . | Gender Prediction . | Bloggers Categorization . | |||
---|---|---|---|---|---|
Accuracy ↓ . | Accuracy ↑ . | ϵ-DF ↓ . | δ-DP ↓ . | p%-Rule ↑ . | |
LR on original author-topic distributions | 0.743 | 0.387 | 0.436 | 0.041 | 64.664 |
LR on debiased author-topic distributions (gendered topics removed) | 0.738 | 0.384 | 0.384 | 0.039 | 68.119 |
LR on original author embeddings | 0.684 | 0.370 | 0.459 | 0.033 | 63.185 |
LR on debiased author embeddings (gendered topics removed) | 0.672 | 0.372 | 0.429 | 0.029 | 65.133 |
LR on debiased author embeddings (linear projection) | 0.457 | 0.373 | 0.183 | 0.013 | 83.270 |
Models . | Gender Prediction . | Bloggers Categorization . | |||
---|---|---|---|---|---|
Accuracy ↓ . | Accuracy ↑ . | ϵ-DF ↓ . | δ-DP ↓ . | p%-Rule ↑ . | |
LR on original author-topic distributions | 0.743 | 0.387 | 0.436 | 0.041 | 64.664 |
LR on debiased author-topic distributions (gendered topics removed) | 0.738 | 0.384 | 0.384 | 0.039 | 68.119 |
LR on original author embeddings | 0.684 | 0.370 | 0.459 | 0.033 | 63.185 |
LR on debiased author embeddings (gendered topics removed) | 0.672 | 0.372 | 0.429 | 0.029 | 65.133 |
LR on debiased author embeddings (linear projection) | 0.457 | 0.373 | 0.183 | 0.013 | 83.270 |
6 Related Work
Some prior research has aimed to combine aspects of topic models and word embeddings. The Gaussian LDA model (Das, Zaheer, and Dyer 2015) tries to improve the performance of topic modeling given the semantic information encoded in word embeddings; however, the topics do not inform the embeddings. The reverse is true for the topical word embedding model (Liu et al. 2015), which uses LDA topic assignments of words to improve the resultant word embedding. The Skip-gram Topical Embedding model (Shi et al. 2017) aims to learn both embeddings and topics jointly by conditioning the embeddings on topics as well as words, and alternating between the updates for each model component in an EM-style algorithm. To benefit from neural networks there are other neurally-inspired ways of combining LDA and embeddings such as mixing the likelihood of LDA with the skip-gram model (Nguyen et al. 2015), learning word vectors jointly with document-level distributions of topic vectors (Moody 2016) or neural variational inference based continuous dense document representations (Miao, Yu, and Blunsom 2016). The main similarity between NEA and the above methods is that they each incorporate word embeddings and topics within a single model or algorithm. These approaches aim to achieve synergy between these models by either conditioning a topic model on word embeddings, conditioning a word embedding model on topics, or joint modeling and training of both word embedding and topic models together.
In contrast to these methods, instead of conditional or joint modeling of embeddings and topics, our NEA methodology views embeddings and topic distributions as alternate representations of the same model. The idea of parameterizing a topic model via embeddings was previously used by the ETM (Dieng, Ruiz, and Blei 2020), a variant of neural network-based topic models that trains using vector representations of both words and topics, and by the MMSG (Foulds 2018). Our goals and methods are somewhat different. The ETM and MMSG are models with their own specific topic modeling architecture, while NEA is an algorithm that is applicable to general LDA-style topic models. By learning to re-represent a topic model as an embedding model, our NEA method uses this representation to smooth the topics of a given topic model to improve coherence. It also produces topical embeddings that encode information which may be complementary to traditional neural network-based word embeddings (Mikolov et al. 2013b).
The utility of word embeddings can be further improved using feature-based approaches such as ELMo (Peters et al. 2018) or fine-tuning approaches such as BERT (Devlin et al. 2019) along with pre-trained language representations. These models can construct contextual representations, in which word representations are influenced by the context in which the words appear. Such models currently provide state-of-the-art performance at representation learning for many (perhaps even most) natural language processing tasks, particularly when using deep architectures based on the transformer (Vaswani et al. 2017), pre-trained on big data and fine-tuned on task-specific data. We view our work as complementary to that line of research, in that we focus on improving the coherence of topic models, which create interpretable representations designed for human consumption, rather than focusing on uninterpretable big data models designed for accurate prediction.
Work has recently begun on combining topic models with contextual embeddings and transformer-based language models, although much remains to be done. One method, called tBERT (Peinelt, Nguyen, and Liakata 2020), feeds both BERT embeddings and topics as features into a neural network that performs semantic similarity detection. This approach fuses information from both topics and BERT to solve a downstream task, but it does not aim to fundamentally unify the models in order to improve their representational abilities.
Another approach is to identify topics by performing clustering on the embeddings produced by models such as BERT, which can be done on vocabularly-level word embeddings (Sia, Dalmia, and Mielke 2020), document embeddings (Grootendorst 2022), or contextual word embeddings (Thompson and Mimno 2020). This strategy aims to learn a topic model based on a given (BERT-style) neural probabilistic language model, while NEA aims to learn a (word embedding-style) neural probabilistic language model based on a given topic model. An extension of NEA that uses contextual word embeddings is an exciting potential avenue for future research.
7 Conclusion and Future Work
We have proposed neural embedding allocation (NEA), a method for improving general LDA-style topic models by deconstructing them to reveal underlying semantic vector representations. Our experimental results show that NEA improves several diverse topic models’ coherence and performs better than them at many tasks. We demonstrated the practical utility of the NEA algorithm by using it to address gender bias in NLP.
In future work, we plan to extend NEA to leverage transformer models such as BERT. For example, instead of learning fixed word embeddings based on a topic model, we could adapt NEA to learn BERT-style contextual word embeddings for each token while jointly learning topic and document embeddings. These embeddings could be seeded based on a pre-trained BERT model in order to capture knowledge from a big data corpus. In this NEA extension, when simulating a word w from an LDA topic model in order to train the embedding model to mimic it, we would retrieve a context sentence from the corpus in which w occurs and use the BERT model to convert the fixed “input” word embedding to a contextual embedding. Thus, the topic embeddings are trained based on contextual rather than fixed word representations. The contextual word embeddings are further fine-tuned within the NEA process. A further possible extension is to make the topic embeddings be contextual as well.
Alternatively, instead of extending NEA to mimic a traditional topic model while leveraging BERT’s contextual embeddings, we could adapt NEA to do the reverse task: Learn a topic model based on a given BERT-style model. Just as NEA deconstructs a topic model to learn hidden vector representations, the NEA methodology could be leveraged to deconstruct a BERT-style model to recover hidden topic vectors. To accomplish this, consider a variation of BERT in which the contextual embeddings Ti for each token i are each mapped to one of a set of K topic embeddings , corresponding to the token’s topic assignment zi, before performing BERT’s training tasks such as the masked language model. We would then use a version of the NEA algorithm to teach this “BERT topic model” to mimic the original target BERT model’s behavior at its pre-training and/or fine-tuning tasks. This approach would “compress” the BERT model into a smaller topic model that aims to encode its latent semantic knowledge.
Acknowledgments
This work was performed under the following financial assistance award: 60NANB18D227 from U.S. Department of Commerce, National Institute of Standards and Technology. This material is based upon work supported by the National Science Foundation under grants IIS2046381 and IIS1850023. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Notes
With sufficiently high-dimensional vectors, the skip-gram will be able to solve this optimization problem exactly and perfectly reconstruct the MLE SGTM, assuming that a global optimum can be found. For example, if V = W, the skip-gram can trivially encode any set of “topic” distributions Φ by setting each as a one-hot vector that selects a single dimension of the embeddings that encodes p(wc|wi). Thus, we can see that whenever V ≥W, the skip-gram can encode any topic, including those of the SGTM’s MLE. Alternatively, if the skip-gram cannot encode the MLE topic model due to having too low dimensionality, which is more typically the case in practice, it will find a local optimum in the KL-divergence objective function. This is actually desirable, as the embeddings are forced to perform compression when encoding the topics, which forces the embeddings to capture patterns in the data, and hence encode meaningful information.
As a side note, we can also see from Equation (6) that the SGTM and SG’s MLEs can be completely computed using the input/output word co-occurrence count matrix as sufficient statistics. The skip-gram then has a global objective function that can be defined in terms of the word co-occurrence matrix, and the development of the GloVe model (Pennington, Socher, and Manning 2014) as an alternative with a global objective function seems unnecessary in hindsight. Levy and Goldberg (2014) further illustrated a closely related point by constructing global training objectives for NEG and NCE based on matrix factorization interpretations of these methods.
We can also consider a model variant where θ(d) is reparameterized using a log-bilinear model; however, we obtained better performance by constructing document vectors based on topic vectors, as below.
Note that other aggregation approaches like concatenation can also be used here, but for a large number of topics, the concatenation approach may encounter the curse of dimensionality issue.
The New York Times V2 with stop words corpus was obtained from https://github.com/adjidieng/ETM.
We directly compare our NEA model with the reported performance of Δ-NVDM and labeled ETM in Table 4 of the Dieng, Ruiz, and Blei (2020) paper on exactly the same dataset.
Each topic is chosen between LDA and its corresponding NEA reconstruction, whichever has the highest coherence.
Document categorization datasets available at http://disi.unitn.it/moschitti/corpora.htm.
This analysis is an observational study on one particular dataset. Therefore, our results should not be used to support any claims about the nature of gender differences, their causes, or their implications.
References
Author notes
Action Editor: Sameer Singh