Abstract
Topic modeling analyzes documents to learn meaningful patterns of words. However, existing topic models fail to learn interpretable topics when working with large and heavy-tailed vocabularies. To this end, we develop the embedded topic model (etm), a generative model of documents that marries traditional topic models with word embeddings. More specifically, the etm models each word with a categorical distribution whose natural parameter is the inner product between the word’s embedding and an embedding of its assigned topic. To fit the etm, we develop an efficient amortized variational inference algorithm. The etm discovers interpretable topics even with large vocabularies that include rare words and stop words. It outperforms existing document models, such as latent Dirichlet allocation, in terms of both topic quality and predictive performance.
1 Introduction
Topic models are statistical tools for discovering the hidden semantic structure in a collection of documents (Blei et al., 2003; Blei, 2012). Topic models and their extensions have been applied to many fields, such as marketing, sociology, political science, and the digital humanities. Boyd-Graber et al. (2017) provide a review.
Most topic models build on latent Dirichlet allocation (lda) (Blei et al., 2003). lda is a hierarchical probabilistic model that represents each topic as a distribution over terms and represents each document as a mixture of the topics. When fit to a collection of documents, the topics summarize their contents, and the topic proportions provide a low-dimensional representation of each document. lda can be fit to large datasets of text by using variational inference and stochastic optimization (Hoffman et al., 2010,s).
lda is a powerful model and it is widely used. However, it suffers from a pervasive technical problem—it fails in the face of large vocabularies. Practitioners must severely prune their vocabularies in order to fit good topic models—namely, those that are both predictive and interpretable. This is typically done by removing the most and least frequent words. On large collections, this pruning may remove important terms and limit the scope of the models. The problem of topic modeling with large vocabularies has yet to be addressed in the research literature.
In parallel with topic modeling came the idea of word embeddings. Research in word embeddings begins with the neural language model of Bengio et al. (2003), published in the same year and journal as Blei et al. (2003). Word embeddings eschew the “one-hot” representation of words—a vocabulary-length vector of zeros with a single one—to learn a distributed representation, one where words with similar meanings are close in a lower-dimensional vector space (Rumelhart and Abrahamson, 1973; Bengio et al., 2006). As for topic models, researchers scaled up embedding methods to large datasets (Mikolov et al., 2013a,s; Pennington et al., 2014; Levy and Goldberg, 2014; Mnih and Kavukcuoglu, 2013). Word embeddings have been extended and developed in many ways. They have become crucial in many applications of natural language processing (Maas et al., 2011; Li and Yang, 2018), and they have also been extended to datasets beyond text (Rudolph et al., 2016).
In this paper, we develop the embedded topic model(etm), a document model that marries lda and word embeddings. The etm enjoys the good properties of topic models and the good properties of word embeddings. As a topic model, it discovers an interpretable latent semantic structure of the documents; as a word embedding model, it provides a low-dimensional representation of the meaning of words. The etm robustly accommodates large vocabularies and the long tail of language data.
Figure 1 illustrates the advantages. This figure shows the ratio between the perplexity on held-out documents (a measure of predictive performance) and the topic coherence (a measure of the quality of the topics), as a function of the size of the vocabulary. (The perplexity has been normalized by the vocabulary size.) This is for a corpus of 11.2K articles from the 20NewsGroup and for 100 topics. The red line is lda; its performance deteriorates as the vocabulary size increases—the predictive performance and the quality of the topics get worse. The blue line is the etm; it maintains good performance, even as the vocabulary size become large.
Like lda, the etm is a generative probabilistic model: Each document is a mixture of topics and each observed word is assigned to a particular topic. In contrast to lda, the per-topic conditional probability of a term has a log-linear form that involves a low-dimensional representation of the vocabulary. Each term is represented by an embedding and each topic is a point in that embedding space. The topic’s distribution over terms is proportional to the exponentiated inner product of the topic’s embedding and each term’s embedding. Figures 2 and 3 show topics from a 300-topic etm of The New York Times. The figures show each topic’s embedding and its closest words; these topics are about Christianity and sports.
Representing topics as points in the embedding space allows the etm to be robust to the presence of stop words, unlike most topic models. When stop words are included in the vocabulary, the etm assigns topics to the corresponding area of the embedding space (we demonstrate this in Section 6).
As for most topic models, the posterior of the topic proportions is intractable to compute. We derive an efficient algorithm for approximating the posterior with variational inference (Jordan et al., 1999; Hoffman et al., 2013; Blei et al., 2017) and additionally use amortized inference to efficiently approximate the topic proportions (Kingma and Welling, 2014; Rezende et al., 2014). The resulting algorithm fits the etm to large corpora with large vocabularies. This algorithm can either use previously fitted word embeddings, or fit them jointly with the rest of the parameters. (In particular, Figures 1 to3 were made using the version of the etm that uses pre-fitted skip-gram word embeddings.)
We compared the performance of the etm to lda, the neural variational document model (nvdm) (Miao et al., 2016), and prodlda (Srivastava and Sutton, 2017).1 The nvdm is a form of multinomial matrix factorization and prodlda is a modern version of lda that uses a product of experts to model the distribution over words. We also compare to a document model that combines prodlda with pre-fitted word embeddings. The etm yields better predictive performance, as measured by held-out log-likelihood on a document comple tion task (Wallach et al., 2009b). It also discovers more meaningful topics, as measured by topic coherence (Mimno et al., 2011) and topic diver sity. The latter is a metric we introduce in this paper that, together with topic coherence, gives a better indication of the quality of the topics. The etm is especially robust to large vocabularies.
2 Related Work
This work develops a new topic model that extends lda. lda has been extended in many ways, and topic modeling has become a subfield of its own. For a review, see Blei (2012) and Boyd-Graber et al. (2017).
A broader set of related works are neural topic models. These mainly focus on improving topic modeling inference through deep neural networks (Srivastava and Sutton, 2017; Card et al., 2017; Cong et al., 2017; Zhang et al., 2018). Specifically, these methods reduce the dimension of the text data through amortized inference and the variational auto-encoder (Kingma and Welling, 2014; Rezende et al., 2014). To perform inference in the etm, we also avail ourselves of amortized inference methods (Gershman and Goodman, 2014).
As a document model, the etm also relates to works that learn per-document representations as part of an embedding model (Le and Mikolov, 2014; Moody, 2016; Miao et al., 2016; Li et al., 2016). In contrast to these works, the document variables in the etm are part of a larger probabilistic topic model.
One of the goals in developing the etm is to incorporate word similarity into the topic model, and there is previous research that shares this goal. These methods either modify the topic priors (Petterson et al., 2010; Zhao et al., 2017b; Shi et al., 2017; Zhao et al., 2017a) or the topic assignment priors (Xie et al., 2015). For example, Petterson et al. (2010) use a word similarity graph (as given by a thesaurus) to bias lda towards assigning similar words to similar topics. As another example, Xie et al. (2015) model the per-word topic assignments of lda using a Markov random field to account for both the topic proportions and the topic assignments of similar words. These methods use word similarity as a type of “side information” about language; in contrast, the etm directly models the similarity (via embeddings) in its generative process of words.
However, a more closely related set of works directly combine topic modeling and word embeddings. One common strategy is to convert the discrete text into continuous observations of embeddings, and then adapt lda to generate real-valued data (Das et al., 2015; Xun et al., 2016; Batmanghelich et al., 2016; Xun et al., 2017). With this strategy, topics are Gaussian distributions with latent means and covariances, and the likelihood over the embeddings is modeled with a Gaussian (Das et al., 2015) or a Von-Mises Fisher distribution (Batmanghelich et al., 2016). The etm differs from these approaches in that it is a model of categorical data, one that goes through the embeddings matrix. Thus it does not require pre-fitted embeddings and, indeed, can learn embeddings as part of its inference process. The etm also differs from these approaches in that it is amenable to large datasets with large vocabularies.
There are few other ways of combining lda and embeddings. Nguyen et al. (2015) mix the likelihood defined by lda with a log-linear model that uses pre-fitted word embeddings; Bunk and Krestel (2018) randomly replace words drawn from a topic with their embeddings drawn from a Gaussian; Xu et al. (2018) adopt a geometric perspective, using Wasserstein distances to learn topics and word embeddings jointly; and Keya et al. (2019) propose the neural embedding allocation (NEA), which has a similar generative process to the etm but is fit using a pre-fitted lda model as a target distribution. Because it requires lda, the nea suffers from the same limitation as lda. These models often lack scalability with respect to the vocabulary size and are fit using Gibbs sampling, limiting their scalability to large corpora.
3 Background
The etm builds on two main ideas, lda and word embeddings. Consider a corpus of D documents, where the vocabulary contains V distinct terms. Let wdn ∈{1,…,V } denote the nth word in the dth document.
Latent Dirichlet Allocation.
lda is a probabilistic generative model of documents (Blei et al., 2003). It posits K topics β1:K, each of which is a distribution over the vocabulary. lda assumes each document comes from a mixture of topics, where the topics are shared across the corpus and the mixture proportions are unique for each document. The generative process for each document is the following:
Draw topic proportion θd ∼Dirichlet(αθ).
For each word n in the document:
Draw topic assignment zdn ∼Cat(θd).
Draw word .
Word Embeddings.
Word embeddings provide models of language that use vector representations of words (Rumelhart and Abrahamson, 1973; Bengio et al., 2003). The word representations are fitted to relate to meaning, in that words with similar meanings will have representations that are close. (In embeddings, the “meaning” of a word comes from the contexts in which it is used [Harris, 1954].)
4 The Embedded Topic Model
The etm is a topic model that uses embedding representations of both words and topics. It contains two notions of latent dimension. First, it embeds the vocabulary in an L-dimensional space. These embeddings are similar in spirit to classical word embeddings. Second, it represents each document in terms of K latent topics.
In traditional topic modeling, each topic is a full distribution over the vocabulary. In the etm, however, the kth topic is a vector αk ∈ℝL in the embedding space. We call αk a topic embedding— it is a distributed representation of the kth topic in the semantic space of words.
In its generative process, the etm uses the topic embedding to form a per-topic distribution over the vocabulary. Specifically, the etm uses a log-linear model that takes the inner product of the word embedding matrix and the topic embedding. With this form, the etm assigns high probability to a word v in topic k by measuring the agreement between the word’s embedding and the topic’s embedding.
Denote the L × V word embedding matrix by ρ; the column ρv is the embedding of term v. Under the etm, the generative process of the dth document is the following:
Draw topic proportions .
For each word n in the document:
Draw topic assignment zdn ∼Cat(θd).
Draw the word wdn ∼softmax(ρ⊤.
Steps 1 and 2a are standard for topic modeling: They represent documents as distributions over topics and draw a topic assignment for each observed word. Step 2b is different; it uses the embeddings of the vocabulary ρ and the assigned topic embedding to draw the observed word from the assigned topic, as given by zdn.
The topic distribution in Step 2b mirrors the cbow likelihood in Eq. 1. Recall cbow uses the surrounding words to form the context vector αdn. In contrast, the etm uses the topic embedding as the context vector, where the assigned topic zdn is drawn from the per-document variable θd. The etm draws its words from a document context, rather than from a window of surrounding words.
The etm likelihood uses a matrix of word embeddings ρ, a representation of the vocabulary in a lower dimensional space. In practice, it can either rely on previously fitted embeddings or learn them as part of its overall fitting procedure. When the etm learns the embeddings as part of the fitting procedure, it simultaneously finds topics and an embedding space.
When the etm uses previously fitted embeddings, it learns the topics of a corpus in a particular embedding space. This strategy is particularly useful when there are words in the embedding that are not used in the corpus. The etm can hypothesize how those words fit in to the topics because it can calculate even for words v that do not appear in the corpus.
5 Inference and Estimation
We are given a corpus of documents {w1,…,wD}, where the dth document wd is a collection of Nd words. How do we fit the etm to this corpus?
The Marginal Likelihood.
Variational Inference.
We sidestep the intractable integral in Eq. eq:integral with variational inference (Jordan et al., 1999; Blei et al., 2017). Variational inference optimizes a sum of per-document bounds on the log of the marginal likelihood of Eq. 4.
To begin, posit a family of distributions of the untransformed topic proportions q(δd ; wd,ν). This family of distributions is parameterized by ν. We use amortized inference, where q(δd ; wd,ν) (called a variational distribution) depends on both the document wd and shared parameters ν. In particular, q(δd ; wd,ν) is a Gaussian whose mean and variance come from an “inference network,” a neural network parameterized by ν (Kingma and Welling, 2014). The inference network ingests a bag-of-words representation of the document wd and outputs the mean and covariance of δd. (To accommodate documents of varying length, we form the input of the inference network by normalizing the bag-of-word representation of the document by the number of words Nd.)
6 Empirical Study
We study the performance of the etm and compare it to other unsupervised document models. A good document model should provide both coherent patterns of language and an accurate distribution of words, so we measure performance in terms of both predictive accuracy and topic interpretability. We measure accuracy with log-likelihood on a document completion task (Rosen-Zvi et al., 2004; Wallach et al., 2009b); we measure topic interpretability as a blend of topic coherence and diversity. We find that, of the interpretable models, the etm is the one that provides better predictions and topics.
In a separate analysis (Section 6.1), we study the robustness of each method in the presence of stop words. Standard topic models fail in this regime—because stop words appear in many documents, every learned topic includes some stop words, leading to poor topic interpretability. In contrast, the etm is able to use the information from the word embeddings to provide interpretable topics.
Corpora.
We study the 20Newsgroups corpus and the New York Times corpus; the statistics of both corpora are summarized in Table 1.
Dataset . | Minimum df . | #Tokens Train . | #Tokens Valid . | #Tokens Test . | Vocabulary . |
---|---|---|---|---|---|
20Newsgroups | 100 | 604.9 K | 5,998 | 399.6 K | 3,102 |
30 | 778.0 K | 7,231 | 512.5 K | 8,496 | |
10 | 880.3 K | 6,769 | 578.8 K | 18,625 | |
5 | 922.3 K | 8,494 | 605.9 K | 29,461 | |
2 | 966.3 K | 8,600 | 622.9 K | 52,258 | |
New York Times | 5,000 | 226.9 M | 13.4 M | 26.8 M | 9,842 |
200 | 270.1 M | 15.9 M | 31.8 M | 55,627 | |
100 | 272.3 M | 16.0 M | 32.1 M | 74,095 | |
30 | 274.8 M | 16.1 M | 32.3 M | 124,725 | |
10 | 276.0 M | 16.1 M | 32.5 M | 212,237 |
Dataset . | Minimum df . | #Tokens Train . | #Tokens Valid . | #Tokens Test . | Vocabulary . |
---|---|---|---|---|---|
20Newsgroups | 100 | 604.9 K | 5,998 | 399.6 K | 3,102 |
30 | 778.0 K | 7,231 | 512.5 K | 8,496 | |
10 | 880.3 K | 6,769 | 578.8 K | 18,625 | |
5 | 922.3 K | 8,494 | 605.9 K | 29,461 | |
2 | 966.3 K | 8,600 | 622.9 K | 52,258 | |
New York Times | 5,000 | 226.9 M | 13.4 M | 26.8 M | 9,842 |
200 | 270.1 M | 15.9 M | 31.8 M | 55,627 | |
100 | 272.3 M | 16.0 M | 32.1 M | 74,095 | |
30 | 274.8 M | 16.1 M | 32.3 M | 124,725 | |
10 | 276.0 M | 16.1 M | 32.5 M | 212,237 |
The 20Newsgroup corpus is a collection of newsgroup posts. We preprocess the corpus by filtering stop words, words with document frequency above 70%, and tokenizing. To form the vocabulary, we keep all words that appear in more than a certain number of documents, and we vary the threshold from 100 (a smaller vocabulary, where V = 3,102) to 2 (a larger vocabulary, where V = 52,258). After preprocessing, we further remove one-word documents from the validation and test sets. We split the corpus into a training set of 11,260 documents, a test set of 7,532 documents, and a validation set of 100 documents.
The New York Times corpus is a larger collection of news articles. It contains more than 1.8 million articles, spanning the years 1987–2007. We follow the same preprocessing steps as for 20Newsgroups. We form versions of this corpus with vocabularies ranging from V = 9,842 to V = 212,237. After preprocessing, we use 85% of the documents for training, 10% for testing, and 5% for validation.
Models.
We compare the performance of the etm against several document models. We briefly describe each below.
We consider latent Dirichlet allocation (lda) (Blei et al., 2003), a standard topic model that posits Dirichlet priors for the topics βk and topic proportions θd. (We set the prior hyperparameters to 1.) It is a conditionally conjugate model, amenable to variational inference with coordinate ascent. We consider lda because it is the most commonly used topic model, and it has a similar generative process as the etm.
We also consider the neural variational document model (nvdm) (Miao et al., 2016). The nvdm is a multinomial factor model of documents; it posits the likelihood wdn ∼softmax(β⊤θd), where the K-dimensional vector is a per-document variable, and β is a real-valued matrix of size K × V. The nvdm uses a per-document real-valued latent vector θd to average over the embedding matrix β in the logit space. Like the etm, the nvdm uses amortized variational inference to jointly learn the approximate posterior over the document representation θd and the model parameter β.
nvdm is not interpretable as a topic model; its latent variables are unconstrained. We study a more interpretable variant of the nvdm which constrains θd to lie in the simplex, replacing its Gaussian prior with a logistic normal (Aitchison and Shen, 1980). (This can be thought of as a semi-nonnegative matrix factorization.) We call this document model Δ-nvdm.
We also consider prodlda (Srivastava and Sutton, 2017). It posits the likelihood wdn ∼softmax(β⊤θd) where the topic proportions θd are from the simplex. Contrary to lda, the topic-matrix β s unconstrained.
prodlda shares the generative model with Δ-nvdm but it is fit differently. prodlda uses amortized variational inference with batch normalization (Ioffe and Szegedy, 2015) and dropout (Srivastava et al., 2014).
Finally, we consider a document model that combines prodlda with pre-fitted word embeddings ρ, by using the likelihood wdn ∼softmax(ρ⊤θd). We call this document model prodlda-PWE, where PWE stands for Pre-fitted Word Embeddings.
We study two variants of the etm, one where the word embeddings are pre-fitted and one where they are learned jointly with the rest of the parameters. The variant with pre-fitted embeddings is called the etm-PWE.
For prodlda-PWE and the etm-PWE, we first obtain the word embeddings (Mikolov et al., 2013b) by training skip-gram on each corpus. (We reuse the same embeddings across the experiments with varying vocabulary sizes.)
Algorithm Settings.
Given a corpus, each model comes with an approximate posterior inference problem. We use variational inference for all of the models and employ svi (Hoffman et al., 2013) to speed up the optimization. The minibatch size is 1,000 documents. For lda, we set the learning rate as suggested by Hoffman et al. (2013): the delay is 10 and the forgetting factor is 0.85.
Within svi, lda enjoys coordinate ascent variational updates; we use five inner steps to optimize the local variables. For the other models, we use amortized inference over the local variables θd. We use 3-layer inference networks and we set the local learning rate to 0.002. We use ℓ2 regularization on the variational parameters (the weight decay parameter is 1.2 × 10−6).
Qualitative Results.
We first examine the embeddings. The etm, nvdm, Δ-nvdm, and prodlda all learn word embeddings. We illustrate them by fixing a set of terms and showing the closest words in the embedding space (as measured by cosine distance). For comparison, we also illustrate word embeddings learned by the skip-gram model.
Table 2 illustrates the embeddings of the different models. All the methods provide interpretable embeddings—words with related meanings are close to each other. The etm, the nvdm, and prodlda learn embeddings that are similar to those from the skip-gram. The embeddings of Δ-nvdm are different; the simplex constraint on the local variable and the inference procedure change the nature of the embeddings.
Skip-gram embeddings . | etm embeddings . | ||||||
---|---|---|---|---|---|---|---|
love | family | woman | politics | love | family | woman | politics |
loved | families | man | political | joy | children | girl | political |
passion | grandparents | girl | religion | loves | son | boy | politician |
loves | mother | boy | politicking | loved | mother | mother | ideology |
affection | friends | teenager | ideology | passion | father | daughter | speeches |
adore | relatives | person | partisanship | wonderful | wife | pregnant | ideological |
nvdm embeddings | Δ-nvdm embeddings | ||||||
love | family | woman | politics | love | family | woman | politics |
loves | sons | girl | political | miss | home | life | political |
passion | life | women | politician | young | father | marriage | faith |
wonderful | brother | man | politicians | born | son | women | marriage |
joy | son | pregnant | politically | dream | day | read | politicians |
beautiful | lived | boyfriend | democratic | younger | mrs | young | election |
prodlda embeddings | |||||||
love | family | woman | politics | ||||
loves | husband | girl | political | ||||
affection | wife | boyfriend | politician | ||||
sentimental | daughters | boy | liberal | ||||
dreams | sister | teenager | politicians | ||||
laugh | friends | ager | ideological |
Skip-gram embeddings . | etm embeddings . | ||||||
---|---|---|---|---|---|---|---|
love | family | woman | politics | love | family | woman | politics |
loved | families | man | political | joy | children | girl | political |
passion | grandparents | girl | religion | loves | son | boy | politician |
loves | mother | boy | politicking | loved | mother | mother | ideology |
affection | friends | teenager | ideology | passion | father | daughter | speeches |
adore | relatives | person | partisanship | wonderful | wife | pregnant | ideological |
nvdm embeddings | Δ-nvdm embeddings | ||||||
love | family | woman | politics | love | family | woman | politics |
loves | sons | girl | political | miss | home | life | political |
passion | life | women | politician | young | father | marriage | faith |
wonderful | brother | man | politicians | born | son | women | marriage |
joy | son | pregnant | politically | dream | day | read | politicians |
beautiful | lived | boyfriend | democratic | younger | mrs | young | election |
prodlda embeddings | |||||||
love | family | woman | politics | ||||
loves | husband | girl | political | ||||
affection | wife | boyfriend | politician | ||||
sentimental | daughters | boy | liberal | ||||
dreams | sister | teenager | politicians | ||||
laugh | friends | ager | ideological |
We next look at the learned topics. Table 3 displays the seven most used topics for all methods, as given by the average of the topic proportions θd. lda and both variants of the etm provide interpretable topics. The rest of the models do not provide interpretable topics; their matrices β are unconstrained and thus are not interpretable as distributions over the vocabulary that mix to form documents. Δ-nvdm also suffers from this effect although it is less apparent (see, e.g., the fifth listed topic for Δ-nvdm).
LDA . | ||||||
---|---|---|---|---|---|---|
time | year | officials | mr | city | percent | state |
day | million | public | president | building | million | republican |
back | money | department | bush | street | company | party |
good | pay | report | white | park | year | bill |
long | tax | state | clinton | house | billion | mr |
nvdm | ||||||
scholars | japan | gansler | spratt | assn | ridership | pryce |
gingrich | tokyo | wellstone | tabitha | assoc | mtv | mickens |
funds | pacific | mccain | mccorkle | qtr | straphangers | mckechnie |
institutions | europe | shalikashvili | cheetos | yr | freierman | mfume |
endowment | zealand | coached | vols | nyse | riders | filkins |
Δ-nvdm | ||||||
concerto | servings | nato | innings | treas | patients | democrats |
solos | tablespoons | soviet | scored | yr | doctors | republicans |
sonata | tablespoon | iraqi | inning | qtr | medicare | republican |
melodies | preheat | gorbachev | shutout | outst | dr | senate |
soloist | minced | arab | scoreless | telerate | physicians | dole |
prodlda | ||||||
temptation | grasp | electron | played | amato | briefly | giant |
repressed | unruly | nuclei | lou | model | precious | boarding |
drowsy | choke | macal | greg | delaware | serving | bundle |
addiction | drowsy | trained | bobby | morita | set | distance |
conquering | drift | mediaone | steve | dual | virgin | foray |
prodlda-PWE | ||||||
mercies | cheesecloth | scoreless | chapels | distinguishable | floured | gillers |
lockbox | overcook | floured | magnolias | cocktails | impartiality | lacerated |
pharm | strainer | hitless | asea | punishable | knead | polshek |
shims | kirberger | asterisk | bogeyed | checkpoints | refrigerate | decimated |
cp | browned | knead | birdie | disobeying | tablespoons | inhuman |
etm-PWE | ||||||
music | republican | yankees | game | wine | court | company |
dance | bush | game | points | restaurant | judge | million |
songs | campaign | baseball | season | food | case | stock |
opera | senator | season | team | dishes | justice | shares |
concert | democrats | mets | play | restaurants | trial | billion |
etm | ||||||
game | music | united | wine | company | yankees | art |
team | mr | israel | food | stock | game | museum |
season | dance | government | sauce | million | baseball | show |
coach | opera | israeli | minutes | companies | mets | work |
play | band | mr | restaurant | billion | season | artist |
LDA . | ||||||
---|---|---|---|---|---|---|
time | year | officials | mr | city | percent | state |
day | million | public | president | building | million | republican |
back | money | department | bush | street | company | party |
good | pay | report | white | park | year | bill |
long | tax | state | clinton | house | billion | mr |
nvdm | ||||||
scholars | japan | gansler | spratt | assn | ridership | pryce |
gingrich | tokyo | wellstone | tabitha | assoc | mtv | mickens |
funds | pacific | mccain | mccorkle | qtr | straphangers | mckechnie |
institutions | europe | shalikashvili | cheetos | yr | freierman | mfume |
endowment | zealand | coached | vols | nyse | riders | filkins |
Δ-nvdm | ||||||
concerto | servings | nato | innings | treas | patients | democrats |
solos | tablespoons | soviet | scored | yr | doctors | republicans |
sonata | tablespoon | iraqi | inning | qtr | medicare | republican |
melodies | preheat | gorbachev | shutout | outst | dr | senate |
soloist | minced | arab | scoreless | telerate | physicians | dole |
prodlda | ||||||
temptation | grasp | electron | played | amato | briefly | giant |
repressed | unruly | nuclei | lou | model | precious | boarding |
drowsy | choke | macal | greg | delaware | serving | bundle |
addiction | drowsy | trained | bobby | morita | set | distance |
conquering | drift | mediaone | steve | dual | virgin | foray |
prodlda-PWE | ||||||
mercies | cheesecloth | scoreless | chapels | distinguishable | floured | gillers |
lockbox | overcook | floured | magnolias | cocktails | impartiality | lacerated |
pharm | strainer | hitless | asea | punishable | knead | polshek |
shims | kirberger | asterisk | bogeyed | checkpoints | refrigerate | decimated |
cp | browned | knead | birdie | disobeying | tablespoons | inhuman |
etm-PWE | ||||||
music | republican | yankees | game | wine | court | company |
dance | bush | game | points | restaurant | judge | million |
songs | campaign | baseball | season | food | case | stock |
opera | senator | season | team | dishes | justice | shares |
concert | democrats | mets | play | restaurants | trial | billion |
etm | ||||||
game | music | united | wine | company | yankees | art |
team | mr | israel | food | stock | game | museum |
season | dance | government | sauce | million | baseball | show |
coach | opera | israeli | minutes | companies | mets | work |
play | band | mr | restaurant | billion | season | artist |
Quantitative Results.
We next study the models quantitatively. We measure the quality of the topics and the predictive performance of the model. We found that among the models with interpretable topics, the etm provides the best predictions.
The idea behind topic coherence is that a coherent topic will display words that tend to occur in the same documents. In other words, the most likely words in a coherent topic should have high mutual information. Document models with higher topic coherence are more interpretable topic models.
We combine coherence with a second metric, topic diversity. We define topic diversity to be the percentage of unique words in the top 25 words of all topics. Diversity close to 0 indicates redundant topics; diversity close to 1 indicates more varied topics.
We define the overall quality of a model’s topics as the product of its topic diversity and topic coherence.
A good topic model also provides a good distribution of language. To measure predictive power, we calculate log likelihood on a document completion task (Rosen-Zvi et al., 2004; Wallach et al., 2009b). We divide each test document into two sets of words. The first half is observed: it induces a distribution over topics which, in turn, induces a distribution over the next words in the document. We then evaluate the second half under this distribution. A good document model should provide high log-likelihood on the second half. (For all methods, we approximate the likelihood by setting θd to the variational mean.)
We study both corpora and with different vocabularies. Figures 4 and 5 show interpretability of the topics as a function of predictive power. (To ease visualization, we exponentiate topic quality and normalize all metrics by subtracting the mean and dividing by the standard deviation across methods.) The best models are on the upper right corner.
lda predicts worst in almost all settings. On the 20NewsGroups, the nvdm’s predictions are in general better than lda but worse than for the other methods; on the New York Times, the nvdm gives the best predictions. However, topic quality for the nvdm is far below the other methods. (It does not provide “topics”, so we assess the interpretability of its β matrix.) In prediction, both versions of the etm are at least as good as the simplex-constrained Δ-nvdm. More importantly, both versions of the etm outperform the prodlda-PWE; signaling the etm provides a better way of integrating word embeddings into a topic model.
These figures show that, of the interpretable models, the etm provides the best predictive performance while keeping interpretable topics. It is robust to large vocabularies.
6.1 Stop Words
We now study a version of the New York Times corpus that includes all stop words. We remove infrequent words to form a vocabulary of size 10,283. Our goal is to show that the etm-PWE provides interpretable topics even in the presence of stop words, another regime where topic models typically fail. In particular, given that stop words appear in many documents, traditional topic models learn topics that contain stop words, regardless of the actual semantics of the topic. This leads to poor topic interpretability. There are extensions of topic models specifically designed to cope with stop words (Griffiths et al., 2004; Chemudugunta et al., 2006; Wallach et al., 2009a); our goal here is not to establish comparisons with these methods but to show the performance of the etm-PWE in the presence of stop words.
We fit lda, the Δ-nvdm, the prodlda-PWE, and the etm-PWE with K = 300 topics. (We do not report the nvdm because it does not provide interpretable topics.) Table 4 shows the topic quality (the product of topic coherence and topic diversity). Overall, the etm-PWE gives the best performance in terms of topic quality.
. | tc . | td . | Quality . |
---|---|---|---|
lda | 0.13 | 0.14 | 0.0182 |
Δ-nvdm | 0.17 | 0.11 | 0.0187 |
prodlda-PWE | 0.03 | 0.53 | 0.0159 |
etm-PWE | 0.18 | 0.22 | 0.0396 |
. | tc . | td . | Quality . |
---|---|---|---|
lda | 0.13 | 0.14 | 0.0182 |
Δ-nvdm | 0.17 | 0.11 | 0.0187 |
prodlda-PWE | 0.03 | 0.53 | 0.0159 |
etm-PWE | 0.18 | 0.22 | 0.0396 |
While the etm has a few “stop topics” that are specific for stop words (see, e.g., Figure 6), Δ-nvdm and lda have stop words in almost every topic. (The topics are not displayed here for space constraints.) The reason is that stop words co-occur in the same documents as every other word; therefore traditional topic models have difficulties telling apart content words and stop words. The etm-PWE recognizes the location of stop words in the embedding space; its sets them off on their own topic.
7 Conclusion
We developed the etm, a generative model of documents that marries lda with word embeddings. The etm assumes that topics and words live in the same embedding space, and that words are generated from a categorical distribution whose natural parameter is the inner product of the word embeddings and the embedding of the assigned topic.
The etm learns interpretable word embeddings and topics, even in corpora with large vocabularies. We studied the performance of the etm against several document models. The etm learns both coherent patterns of language and an accurate distribution of words.
Acknowledgments
DB and AD are supported by ONR N00014-17-1-2131, ONR N00014-15-1-2209, NIH 1U01MH115727-01, NSF CCF-1740833, DARPA SD2 FA8750-18-C-0130, Amazon, NVIDIA, and the Simons Foundation. FR received funding from the EU’s Horizon 2020 R&I programme under the Marie Skłodowska-Curie grant agreement 706760. AD is supported by a Google PhD Fellowship.
Notes
Code is available at https://github.com/adjidieng/ETM.
References
Author notes
Work done while at Columbia University and the University of Cambridge.