Abstract
We propose a novel generative model to explore both local and global context for joint learning topics and topic-specific word embeddings. In particular, we assume that global latent topics are shared across documents, a word is generated by a hidden semantic vector encoding its contextual semantic meaning, and its context words are generated conditional on both the hidden semantic vector and global latent topics. Topics are trained jointly with the word embeddings. The trained model maps words to topic-dependent embeddings, which naturally addresses the issue of word polysemy. Experimental results show that the proposed model outperforms the word-level embedding methods in both word similarity evaluation and word sense disambiguation. Furthermore, the model also extracts more coherent topics compared with existing neural topic models or other models for joint learning of topics and word embeddings. Finally, the model can be easily integrated with existing deep contextualized word embedding learning methods to further improve the performance of downstream tasks such as sentiment classification.
1 Introduction
Probabilistic topic models assume that words are generated from latent topics that can be inferred from word co-occurrence patterns taking a document as global context. In recent years, various neural topic models have been proposed. Some of them are built on the Variational Auto-Encoder (VAE) (Kingma and Welling, 2014), which utilizes deep neural networks to approximate the intractable posterior distribution of observed words given latent topics (Miao et al., 2016; Srivastava and Sutton, 2017; Bouchacourt et al., 2018). However, these models take the bag-of-words (BOWs) representation of a given document as the input to the VAE and aim to learn hidden topics that can be used to reconstruct the original document. They do not learn word embeddings concurrently.
Other topic modeling approaches explore the pre-trained word embeddings for the extraction of more semantically coherent topics since word embeddings capture syntactic and semantic regularities by encoding the local context of word co-occurrence patterns. For example, the topic-word generation process in the traditional topic models can be replaced by generating word embeddings given latent topics Das et al. (2015) or by a two-component mixture of a Dirichlet multinomial component and a word embedding component Nguyen et al. (2015). Alternatively, the information derived from word embeddings can be used to promote semantically related words in the Polya Urn sampling process of topic models Li et al. (2017) or generate topic hierarchies Zhao et al. (2018). However, all these models use pre-trained word embeddings and do not learn word embeddings jointly with topics.
Word embeddings could improve the topic modeling results, but conversely, the topic information could also benefit word embedding learning. Early word embedding learning methods (Mikolov et al., 2013a) learn a mapping function to project a word to a single vector in an embedding space. Such one-to-one mapping cannot deal with word polysemy, as a word could have multiple meanings depending on its context. For example, the word ‘patient’ has two possible meanings ‘enduring trying circumstances with even temper’ and ‘a person who requires medical care’. When analyzing reviews about restaurants and health services, the semantic meaning of ‘patient’ could be inferred depending on which topic it is associated with. One solution is to first extract topics using the standard latent Dirichlet allocation (LDA) model and then incorporate the topical information into word embedding learning by treating each topic as a pseudo-word (Liu et al., 2015).
Whereas the aforementioned approaches adopt a two-step process, by either using pre-trained word embeddings to improve the topic extraction results in topic modeling, or incorporating topics extracted using a standard topic model into word embedding learning, Shi et al. (2017) developed a Skip-Gram based model to jointly learn topics and word embeddings based on the Probabilistic Latent Semantic Analysis (PLSA), where each word is associated with two matrices rather than a vector to induce topic-dependent embeddings. This is a rather cumbersome setup. Foulds (2018) used the Skip-Gram to imitate the probabilistic topic model that each word is represented as an importance vector over topics for context generation.
In this paper, we propose a neural generative model built on VAE, called the Joint Topic Word-embedding (JTW) model, for jointly learning topics and topic-specific word embeddings. More concretely, we introduce topics as tangible parameters that are shared across all the context windows. We assume that the pivot word is generated by the hidden semantics encoding the local context where it occurred. Then the hidden semantics is transformed to a topical distribution taking into account the global topics, and this enables the generation of context words. Our rationale is that the context words are generated by the hidden semantics of the pivot word together with a global topic matrix, which captures the notion that the word has multiple meanings that should be shared across the corpus. We are thus able to learn topics and generate topic-dependent word embeddings jointly. The results of our model also allow the visualization of word semantics because topics can be visualized via the top words and words can be encoded as distributions over the topics1 .
In summary, our contribution is three-fold:
- •
We propose a novel Joint Topic Word-embedding (JTW) model built on VAE, for jointly learning topics and topic-specific word embeddings;
- •
We perform extensive experiments and show that JTW outperforms other Skip-Grams or Bayesian alternatives in both word similarity evaluation and word sense disambiguation tasks, and can extract semantically more coherent topics from data;
- •
We also show that JTW can be easily integrated with existing deep contextualized word embedding learning models to further improve the performance of downstream tasks such as sentiment classification.
2 Related Work
Our work is related to two lines of research:
Skip-Gram approaches for word embedding learning.
The Skip-Gram, also known as Word2Vec (Mikolov et al., 2013b), maximizes the probability of the context words wn given a centroid word xn. Pennington et al. (2014) pointed out that Skip-Gram neglects the global word co-occurrence statistics. They thus formulated the Skip-Gram as a non-negative matrix factorization (NMF) with the cross-entropy loss switched to the least square error. Another NMF-based method was proposed by Xu et al. (2018), in which the Euclidean distance was substituted with Wasserstein distance. Jameel and Schockaert (2019) rewrote the NMF objective as a cumulative product of normal distributions, in which each factor is multiplied by a von Mises-Fisher (vMF) distribution of context word vectors, to hopefully cluster the context words since the vMF density retains the cosine similarity.
Although the Skip-Gram-based methods attracted extensive attention, they were criticized for their inability to capture polysemy (Pilehvar and Collier, 2016). A pioneered solution to this problem is the Multiple-Sense Skip-Gram model Neelakantan et al. (2014), where word vectors in a context are first averaged then clustered with other contexts to obtain a sense representation for the pivot word. In the same vein, Iacobacci and Navigli (2019) leveraged sense tags annotated by BabelNet (Navigli and Ponzetto, 2012) to jointly learn word and sense representations in the Skip- Gram manner that the context words are parameterized via a shared look-up table and sent to a BiLSTM to match the pivot word vector.
There have also been Bayesian extensions of the Skip-Gram models for word embedding learning. Barkan (2017) inherited the probabilistic generative line while extending the Skip-Gram by placing a Gaussian prior on the parameterized word vectors. The parameters were estimated via variational inference. In a similar vein, Rios et al. (2018) proposed to generate words in bilingual parallel sentences by shared hidden semantics. They introduced a latent index variable to align the hidden semantics of a word in the source language to its equivalence in the target language. More recently, Bražinskas et al. (2018) proposed the Bayesian Skip-Gram (BSG) model, in which each word type with its related word senses collapsed is associated with a ‘prior’ or static embedding and then, depending on the context, the representation of each word is updated by ‘posterior’ or dynamic embedding. Through Bayesian modeling, BSG is able to learn context-dependent word embeddings. It does not explicitly model topics, however. In our proposed JTW, global topics are shared among all documents and learned from data. Also, whereas BSG only models the generation of context words given a pivot word, JTW explicitly models the generation of both the pivot word and the context words with different generative routes.
Combining word embeddings with topic modeling.
Pre-trained word embeddings can be used to improve the topic modeling performance. For example, Das et al. (2015) proposed the Gaussian LDA model, which, instead of generating discrete word tokens given latent topics, generates draws from a multivariate Gaussian of word embeddings. Nguyen et al. (2015) also replaced the topic-word Dirichlet multinomial component in traditional topic models, but by a two-component mixture of a Dirichlet multinomial component and a word embedding component. Li et al. (2017) proposed to modify the Polya Urn sampling process of the LDA model by promoting semantically related words obtained from word embeddings. More recently, Zhao et al. (2018) proposed to adapt a multi-layer Gamma Belief Network to generate topic hierarchies and also fine-grained interpretation of local topics, both of which are informed by word embeddings.
Instead of using word embeddings for topic modeling, Liu et al. (2015) proposed the Topical Word Embedding model, which incorporates the topical information derived from standard topic models into word embedding learning by treating each topic as a pseudo-word. Briakou et al. (2019) followed this route and proposed a four-stage model in which topics were first extracted from a corpus by LDA and then the topic-based word embeddings are mapped to a shared space using anchor words that were retrieved from the WordNet.
There are also approaches proposed to jointly learn topics and word embeddings built on Skip- Gram models. Shi et al. (2017) developed a Skip-Gram Topical word Embedding (STE) model built on PLSA where each word is associated with two matrices—one matrix used when the word is a pivot word and another used when the word is considered as a context word. Expectation-Maximization is used to estimate model parameters. Foulds (2018) proposed the Mixed-Membership Skip-Gram model (MMSG), which assumes a topic is drawn for each context and the word in the context is drawn from the log-bilinear model based on the topic embeddings. Foulds trained their model by alternating between Gibbs sampling and noise-contrastive estimation. MMSG only models the generation of context words, but not pivot words.
Whereas our proposed JTW also resembles the similarity to the Skip-Gram model in that it predicts the context word given the pivot word, it is different from the existing approaches in that it assumes global latent topics shared across all documents and the generation of the pivot word and the context words follows different generative routes. Moreover, it is built on VAE and is trained using neural networks for more efficient parameter inference.
3 Joint Topic Word-embedding (JTW) Model
In this section, we describe our proposed Joint Topic Word-embedding (JTW) model built on VAE, as shown in Figure 1. We first give an overview of JTW, then present each component of the model, followed by the training details.
Following the problem setup in the Skip-Gram model, we consider a pivot word xn and its context window wn = wn,1:C. We assume there are a total of N pivot word tokens and each context window contains C context words. However, as opposed to Skip-Gram, we do not compute the joint probability as a product chain of conditional probabilities of the context word given the pivot. Instead, in our model, context words are represented as BOWs for each context window by assuming the exchangeability of context words within the local context window.
We hypothesize that the hidden semantic vector zn of each word xn induces a topical distribution that is combined with the global corpus-wide latent topics to generate context words. Topics are represented as a probability matrix where each row is a multinomial distribution measuring the importance of each word within a topic. The hidden semantics zn of the pivot word xn is transformed to a topical distribution ζn, which participates in the generation of context words. Our assumption is that each word embodies a finite set of meanings that can be interpreted as topics, thus each word representation can be transformed to a distribution over topics. Context words are generated by first selecting a topic and then sampled according to the corresponding multinomial distribution. This enables a quick understanding of word semantics through the topical distribution and at the same time learning the latent topics from the corpus. The generative process is given below:
- •
For each word position n ∈{1,2,3,…,N}:
- –
Draw hidden semantic representation
- –
Choose a pivot word xn ∼ p(xn|zn)
- –
Transform zn to ζn with a multi-layered perceptron: ζn = MLP(zn)
- –
For each context word position c ∈{1,2,3,…,C}:
- *
Choose a topic indicator tn,c ∼Categorical(ζn)
- *
Choose a context word
Here, all the distributions are functions approximated by neural networks, e.g., p(xn|zn) ∝exp(Mxzn +bx), which will be discussed in more details in the Decoder section, tn,c indexes a row in the topic matrix. We could implicitly marginalize out the topic indicators, in which case the probability of a word would be written as wn,c|ζn,β ∼Categorical(σ(βTζn)), where σ(⋅) denotes the softmax function. The prior distribution for zn is a multivariate Gaussian distribution with the mean 0 and covariance I, of which the posterior indicates the hidden semantics of the pivot word when conditioned on {xn,wn}.
Although both JTW and BSG assume that a word can have multiple senses and use a latent embedding z to represent the hidden semantic meaning of each pivot word, there are some key differences in their generative processes. JTW first draws a latent embedding z from a standard Gaussian prior that is deterministically transformed into topic distributions and a distribution over pivot words. The pivot word is conditionally independent of its context given the latent embedding. At the same time, each context word is assigned a latent topic, drawn from a shared topic distribution which leverages the global topic information, and then drawn independently of one another. In BSG the latent embedding z is also drawn from a Gaussian prior but the context words are generated directly from the latent embedding z, as opposed to via a mixture model as in JTW. Therefore, JTW is able to group semantically similar words into topics, which is not the case in BSG.
Given the observed variables {x1:N,w1:N}, the objective of the model is to infer the posterior p(z|x,w). This is achieved by the VAE framework. As illustrated in Figure 1, the JTW model is composed of an encoder and a decoder, each of which is constructed by neural networks. The family of distributions to approximate the posterior is Gaussian, in which μn and σn are optimized. As in VAE, we optimize μn and σn through the training of parameters in neural networks (e.g., we optimize Mπ in μn = instead of updating μn directly).
3.1 ELBO
3.2 Encoder
3.3 Decoder
3.4 Loss Function
3.5 Prediction
After training, we are able to map the words to their respective representations using the Encoder part of JTW. The Encoder takes a pivot word together with its context window as an input and outputs the parameters of the variational distribution considered to be the approximated posterior q𝜙(z|xn,wn), which is a Gaussian distribution in our case. The word representations are Gaussian parameters {μn,σn}. Because the output of the Encoder is formulated as a Gaussian distribution, the word similarity of two words can be either computed by the KL-divergence between the Gaussian distributions, or by the cosine similarity between their means. We use the Gaussian mean μ to represent a word given its context. The universal representation of a word type can be obtained by averaging the posterior means of all occurrences over the corpus.
4 Experimental Setup
Dataset.
We train the proposed JTW model on the Yelp dataset,2 which is a collection of more than 4 million reviews on over 140k business categories. Although the number of business categories is large, the vast majority of reviews falls into 5 business categories. The top Restaurant category consists of more than 40% of reviews. The next top 4 categories, Shopping, Beauty &Spas, Automotive, and Clinical, contain about 8%,6%,4%, and 3% of reviews, respectively. The Clinical documents are further filtered by business subcategories defined in Tran and Lee (2017), which are recognized as core clinical businesses. This results in 176,733 documents for the Clinical category. Because the dataset is extremely imbalanced, simply training the model on the original dataset will likely overfit to the Restaurant category. We thus balance the dataset by sampling roughly an equal number of documents from each of the top 5 categories. The vocabulary size is set to 8,000. We use Mallet3 to filter out stopwords. The final dataset consists of 865,616 documents with a total of 101,468,071 tokens.
Parameter Setting.
The word semantics are represented as 100-dimensional vectors (i.e., D = 100), which is a default configuration for word representations (Mikolov et al., 2013a; Bražinskas et al., 2018). The number of latent topics is set to 50. It has been previously studied in Kingma and Welling (2014) that the number of samples per data point can be set to 1 if the batch size is large, (e.g., > 100). In our experiments, we set the batch size to 2,048 and the number of samples per data point, S, to 1. The context window size is set to 10. Network parameters (i.e., θ, 𝜙) are all initialized by a normal distribution with zero mean and 0.1 variance.
Baselines.
We compare our model against four baselines:
- •
CvMF (Jameel and Schockaert, 2019). CvMF can be viewed as an extension of GloVe that modifies the objective function by multiplying a mixture of vMFs, whose distance is measured by cosine similarity instead of euclidean distance. The mixture depicts the underlying semantics with which the words could be clustered.
- •
Bayesian Skip-Gram (BSG) (Bražinskas et al., 2018). BSG4 is a probabilistic word- embedding method built on VAE as well, which achieved the state-of-art among other Bayesian word-embedding alternatives (Vilnis and McCallum, 2015; Barkan, 2017). BSG infers the posterior or dynamic embedding given a pivot word and its observed context and is able to learn context-dependent word embeddings.
- •
Skip-gram Topical word Embedding (STE) (Shi et al., 2017). STE adapted the commonly known Skip-Gram by associating each word with an input matrix and an output matrix and used the Expectation-Maximization method with the negative sampling for model parameter inference. For topic generation, they need to evaluate the probability of p(wt+j|z,wt) for each topic z and each skip-gram < wt;wt+j >, and represent each topic as the ranked list of bigrams.
- •
Mixed Membership Skip-Gram (MMSG) (Foulds, 2018). MMSG leverages mixed membership modeling in which words are assumed to be clustered into topics and the words in the context of a given pivot word are drawn from the log-bilinear model using the vector representations of the context-dependent topic. Model inference is performed using the Metropolis-Hastings-Walker algorithm with noise-contrastive estimation.
Among the aforementioned baselines, CvMF and BSG only generate word embeddings and do not model topics explicitly. Also, CvMF only maps each word to a single word embedding whereas BSG can output context-dependent word embeddings. Both STE and MMSG can learn topics and topic-dependent embeddings at the same time. However, in STE the topic dependence is stored in the lines of word matrices and the word representations themselves are context independent. In contrast, MMSG associates each word with a topic distribution; it could produce contextualized word embeddings by summing up topic vectors weighed by the posterior topic distribution given a context. We probe into different topic counts and find the best setting for methods with topics or mixtures. In all the baselines, the dimensionality of word embeddings is tuned and finally set to 100.
5 Experimental Results
We compare JTW with baselines on both word similarity and word-sense disambiguation tasks for the learned word embeddings, and also present the topic coherence and qualitative evaluation results for the extracted topics. Furthermore, we show that JTW can be easily integrated with deep contextualized word embeddings to further improve the performance of downstream tasks such as sentiment classification.
5.1 Word Similarity
The word similarity task (Finkelstein et al., 2001) has been widely adopted to measure the quality of word embeddings. In the word similarity task, a number of pairwise words are given. Each pair of words should be assigned with a score that indicates their relatedness. The calculated scores are then compared with the golden scores by means of Spearman rank-order correlation coefficient. Because the word similarity task requires context-free word representations, we aggregate all the occurrences and obtain a universal vector for each word. The distance used for similarity scores is cosine similarity. For STE, we use AvgSimC following Shi et al. (2017). We further make a comparison with the results of the Skip-Gram (SG) model,5 which maps each word token to a single point in an Euclidean space without considering different senses of words. All the approaches are evaluated on the 7 commonly used benchmarking datasets. For JTW, we average the results over 10 runs and also report the standard deviations.
The results are reported in Table 1. It can be observed that among the baselines, BSG achieves the lowest score on average, followed by MMSG. Although JTW clearly beats all the other models on SimLex-999 only, it only performs slightly worse than the top model in 5 out of the remaining 6 benchmarks. Overall, JTW gives superior results on average. A noticeable gap can be observed on the Stanford’s Contextual Word Similarities (SCWS) dataset where JTW, MMSG, and BSG give better results compared with SG, CvMF, and STE. This can be explained by the fact that, in SCWS, golden scores are annotated together with the context. However, SG, CvMF, and STE can only produce context-independent word vectors. The results show the clear benefit of learning contextualized word vectors. Among the topic-dependent word embeddings, JTW built on VAE appears to be more effective than the PLSA-based STE and the mixed membership model MMSG, achieving the best overall score when averaging the evaluation results across all the seven benchmarking datasets. The small standard deviation of JTW indicates that the performance is consistent across multiple runs.
Benchmarks . | SG . | CvMF . | BSG . | STE . | MMSG . | JTW (std. dev.) . |
---|---|---|---|---|---|---|
WS353-SIM | 0.610 | 0.597 | 0.529 | 0.582 | 0.579 | 0.598 (.014) |
WS353-ALL | 0.571 | 0.615 | 0.551 | 0.538 | 0.558 | 0.606 (.012) |
MEN | 0.649 | 0.632 | 0.656 | 0.650 | 0.627 | 0.653 (.006) |
SimLex-999 | 0.321 | 0.313 | 0.271 | 0.301 | 0.281 | 0.344 (.005) |
SCWS | 0.620 | 0.637 | 0.652 | 0.622 | 0.624 | 0.640 (.010) |
MTurk771 | 0.548 | 0.524 | 0.555 | 0.554 | 0.596 | 0.546 (.010) |
MTurk287 | 0.534 | 0.517 | 0.572 | 0.641 | 0.599 | 0.639 (.006) |
Average | 0.550 | 0.548 | 0.541 | 0.555 | 0.552 | 0.575 (.004) |
Benchmarks . | SG . | CvMF . | BSG . | STE . | MMSG . | JTW (std. dev.) . |
---|---|---|---|---|---|---|
WS353-SIM | 0.610 | 0.597 | 0.529 | 0.582 | 0.579 | 0.598 (.014) |
WS353-ALL | 0.571 | 0.615 | 0.551 | 0.538 | 0.558 | 0.606 (.012) |
MEN | 0.649 | 0.632 | 0.656 | 0.650 | 0.627 | 0.653 (.006) |
SimLex-999 | 0.321 | 0.313 | 0.271 | 0.301 | 0.281 | 0.344 (.005) |
SCWS | 0.620 | 0.637 | 0.652 | 0.622 | 0.624 | 0.640 (.010) |
MTurk771 | 0.548 | 0.524 | 0.555 | 0.554 | 0.596 | 0.546 (.010) |
MTurk287 | 0.534 | 0.517 | 0.572 | 0.641 | 0.599 | 0.639 (.006) |
Average | 0.550 | 0.548 | 0.541 | 0.555 | 0.552 | 0.575 (.004) |
5.2 Lexical Substitution
While the word similarity tasks focus more on the general meaning of a word (since word pairs are presented without context), in this section, we turn to the lexical substitution task (Yuret, 2007; Thater et al., 2011), which was designed to evaluate the word-embedding learning methods regarding their ability to disambiguate word senses. The lexical substitution task can be described by the following scenario: Given a sentence and one of its member words, find the most related replacement from a list of candidate words. As stated in Thater et al. (2011), a good lexical substitution should not only capture the relatedness between the candidate word and the original word, but also imply the correctness with respect to the context.
We report in Table 2 the accuracy scores of different methods. Context-sensitive word embeddings generally perform better than context-free alternatives. STE can only learn context-independent word embeddings and hence gives the lowest score. BSG is able to learn context-dependent word embeddings and outperforms CvMF. Among the joint topic and word embedding learning methods, STE performs the worst, showing that associating each word with two matrices and learning topic-dependent word embeddings based on PLSA appear to be less effective. Both JTW and MMSG show superior performances compared to BSG. JTW outperforms MMSG because JTW also models the generation of pivot word in addition to context words and the VAE framework for parameter inference is more effective than the annealed negative contrastive estimation used in MMSG.
5.3 Topic Coherence
Because only STE and MMSG can jointly learn topics and word embeddings among the baselines, we compare our proposed JTW with these two models in term of topic quality. The evaluation metric we employed is the topic coherence metric proposed in Röder et al. (2015). The metric extracts co-occurrence counts of the topic words in Wikipedia using a sliding window of size 110. For each top word a vector is calculated whose elements are the normalized pointwise mutual information between the word and every other top words. Given a topic, the arithmetic mean of all vector pairs’ cosine similarity is treated as the coherence measure. We calculate the topic coherence score of each extracted topic based on its associated top ten words using Palmetto7 (Rosner et al., 2014). The topic coherence results with the topic number varying between 10 and 200 are plotted in Figure 2. The graph shows that JTW scores the highest under all the topic settings. It gives the best coherence score of 0.416 at 50 topics, and gradually flattens with the increasing number of topics. MMSG exhibits an upward trend up to 100 topics, and drops to 0.365 when the topic number is set to 150. STE undergoes a gradual decrease and then stabilizes with the topic number beyond 150.
5.4 Extracted Topics
We present in Table 3 the example topics extracted by JTW and MMSG. It can be easily inferred from the top words generated by JTW that Topic 1 is related to ‘Food’, whereas Topic 5 is about the ‘Clinical Service’, which is identified by the words ‘caring’ and ‘physician’. It can also be deduced from the top words that Topic 2, 3, and 4 represent ‘Shopping’, ‘Beauty’, and ‘Automotive’, respectively. In contrast, topics produced by MMSG contain more semantically less coherent words as highlighted by italics. For example, Topic 1 in MMSG contains words relating to both food and staff. This might be caused by the fact that, in MMSG, training is performed as a two- stage process by first assigning topics to words using Gibbs sampling then estimating the topic vectors and word vectors from word co-occurrences and topic assignments via maximum likelihood estimator. This is equivalent to a topic model with parameterized word embeddings. Conversely, in JTW, latent variables in the generative process are recognized as word representations. Parameters reside in the generative network, and are inferred by the VAE. No extra parameters are introduced to encode the words. Therefore, the topics extracted tend to be more identifiable.
Topic 1 . | Topic 2 . | Topic 3 . | Topic 4 . | Topic 5 . |
---|---|---|---|---|
Food | Shopping | Beauty | Automotive | Clinical |
JTW | ||||
good | great | hair | car | compassionate |
food | friendly | recommend | told | caring |
chicken | service | highly | phone | personable |
place | staff | place | called | courteous |
pizza | shop | experience | care | therapy |
love | clean | fabulous | vehicle | competent |
cheese | helpful | great | time | knowledgeable |
salad | nice | nail | BMW | passionate |
red | amazing | nails | insurance | physician |
delicious | customer | awesome | wanted | respectful |
MMSG | ||||
food | friendly | massage | place | therapy |
service | staff | spa | service | physical |
great | great | back | time | pain |
good | helpful | great | back | back |
place | service | time | customer | massage |
friendly | clean | good | car | recommend |
staff | place | massages | people | great |
nice | nice | facial | good | therapist |
back | store | therapist | money | work |
prices | super | body | give | highly |
Topic 1 . | Topic 2 . | Topic 3 . | Topic 4 . | Topic 5 . |
---|---|---|---|---|
Food | Shopping | Beauty | Automotive | Clinical |
JTW | ||||
good | great | hair | car | compassionate |
food | friendly | recommend | told | caring |
chicken | service | highly | phone | personable |
place | staff | place | called | courteous |
pizza | shop | experience | care | therapy |
love | clean | fabulous | vehicle | competent |
cheese | helpful | great | time | knowledgeable |
salad | nice | nail | BMW | passionate |
red | amazing | nails | insurance | physician |
delicious | customer | awesome | wanted | respectful |
MMSG | ||||
food | friendly | massage | place | therapy |
service | staff | spa | service | physical |
great | great | back | time | pain |
good | helpful | great | back | back |
place | service | time | customer | massage |
friendly | clean | good | car | recommend |
staff | place | massages | people | great |
nice | nice | facial | good | therapist |
back | store | therapist | money | work |
prices | super | body | give | highly |
5.5 Visualization of Word Semantics
The extracted topics allow the visualization of word semantics. In JTW, a word’s semantic meanings can be interpreted as a distribution over the discovered latent topics. This is achieved by aggregating all the contextualized topical distribution of a particular word throughout the corpus. Meanwhile, when a word is placed under a specific context, its topical distribution can be directly transformed from its contextualized representation. We chose three words—‘plastic’, ‘bar’ and ‘patient’—to illustrate the polysemous nature of them. To further demonstrate their context-dependent meanings, we also visualize the topic distribution of the following three sentences: (1) Effective patient care requires clinical knowledge and understanding of physical therapy; (2) Restaurant servers require patient temperament; (3) You have to bring your own bags or boxes but you can also purchase plastic bags. The topical distribution for the pivot words and the three example sentences are shown in Figure 3.
We can deduce from the overall distributions that the semantic meaning of ‘plastic’ distributes almost equally on two topics, ‘shopping’ and ‘beauty’, while the meaning of ‘bar’ is more prominent on the ‘food’ and ‘shopping’ topics. ‘Patient’ has a strong connection with the ‘clinical’ topic, though it is also associated with the ‘food’ topic. When considering a specific context about the patient care, Sentence 1 has its topic distribution peaked at the ‘clinical’ topic. Sentence 2 also contains the word ‘patient’, but it now has its topic distribution peaked at ‘food’. Sentence 3 mentioned ‘plastic bags’ and its most prominent topic is ‘shopping’. These results show that JTW can indeed jointly learn latent topics and topic-specific word embeddings.
5.6 Integration with Deep Contextualized Word Embeddings
Recent advances in deep contextualized word representation learning have generated significant impact in natural language processing. Different from traditional word embedding learning methods such as Word2Vec or GloVe, where each word is mapped to a single vector representation, deep contextualized word representation learning methods are typically trained by language modeling and generate a different word vector for each word depending on the context in which it is used. A notable work is ELMo Peters et al. (2018), which is commonly regarded as the pioneer for deriving deep contextualized word embeddings (Devlin et al., 2019). ELMo calculates the weighed sum of different layers of a multi-layered BiLSTM-based language model, using the normalized vector as a representation for the corresponding word. More recently, in contrast to ELMo, BERT Devlin et al. (2019) was proposed to apply the bidirectional training of Transformer to masked language modelling. Because of its capability of effectively encoding contextualized knowledge from huge external corpora in word embeddings, BERT has refreshed the state-of-art results on a number of NLP tasks.
We resort to the sentiment classification task on Yelp and compare the performance of JTW, ELMo, and BERT,8 and the integration of both, JTW-ELMo and JTW-BERT, by 10-fold cross validation. In all the experiments, we fine-tune the models on the training set consisting of 90% of documents sampled from the dataset described in Section 4 and evaluate on the 10% of data that serves as the test set. We employ the further pre-training scheme (Sun et al., 2019) that different learning rates are applied to each layer and slanted triangular learning rates are imposed across epochs when adapting the language model to the training corpus (Howard and Ruder, 2018). The classifier used for all the methods is an attention hop over a BiLSTM with a softmax layer. The ground truth labels are the five-scale review ratings included in the original dataset. The 5-class sentiment classification results in precision, recall, macro-F1, and micro-F1 scores are reported in Table 4.
Model . | Criteria . | |||
---|---|---|---|---|
Precision . | Recall . | Macro-F1 . | Micro-F1 . | |
JTW | 0.5713±.021 | 0.5639±.014 | 0.5599±.016 | 0.7339±.015 |
ELMo | 0.6091±.005 | 0.6053±.001 | 0.6056±.002 | 0.7610±.005 |
BERT | 0.6293±.014 | 0.5952±.006 | 0.6041±.012 | 0.7626±.005 |
JTW-ELMo | 0.6286±.008 | 0.6110 ±.004 | 0.6168 ±.008 | 0.7783±.004 |
JTW-BERT | 0.6354±.014 | 0.6081±.009 | 0.6045±.014 | 0.7806 ±.005 |
Model . | Criteria . | |||
---|---|---|---|---|
Precision . | Recall . | Macro-F1 . | Micro-F1 . | |
JTW | 0.5713±.021 | 0.5639±.014 | 0.5599±.016 | 0.7339±.015 |
ELMo | 0.6091±.005 | 0.6053±.001 | 0.6056±.002 | 0.7610±.005 |
BERT | 0.6293±.014 | 0.5952±.006 | 0.6041±.012 | 0.7626±.005 |
JTW-ELMo | 0.6286±.008 | 0.6110 ±.004 | 0.6168 ±.008 | 0.7783±.004 |
JTW-BERT | 0.6354±.014 | 0.6081±.009 | 0.6045±.014 | 0.7806 ±.005 |
It can be observed from Table 4 that a sentiment classifier trained on JTW-produced word embeddings gives worse results compared with that using the deep contextualized word embeddings generated by ELMo or BERT. Nevertheless, when integrating the ELMo or BERT front-end with JTW, the combined model, JTW-ELMo and JTW-BERT, outperforms the original deep contextualized word representation models, respectively. It has been verified by the paired t-test that JTW-ELMo outperforms ELMo and BERT at the 95% significance level on Micro-F1. The results show that our proposed JTW is flexible and it can be easily integrated with pre-trained contextualized word embeddings to capture the domain-specific semantics better compared to directly fine-tuning the pre-trained ELMo or BERT on the target domain, hence leading to improved sentiment classification performance.
6 Conclusion
Driven by the motivation that combining word embedding learning and topic modeling can mutually benefit each other, we propose a probabilistic generative framework that can jointly discover more semantically coherent latent topics from the global context and also learn topic-specific word embeddings, which naturally address the problem of word polysemy. Experimental results verify the effectiveness of the model on word similarity evaluation and word sense disambiguation. Furthermore, the model can discover latent topics shared across documents, and the encoder of JTW can generate the topical distribution for each word. This enables an intuitive understanding of word semantics. We have also shown that our proposed JTW can be easily integrated with deep contextualized word embeddings to further improve the performance of downstream tasks. In future work, we will explore the discourse relationships between context windows to model, for example, the semantic shift between the neighboring sentences.
Acknowledgments
The authors would like to thank the anonymous reviewers for insightful comments and helpful suggestions. This work was funded in part by EPSRC (grant no. EP/T017112/1). LZ was funded by the Chancellor’s International Scholarship at the University of Warwick. DZ was partially funded by the National Key Research and Development Program of China (2017YFB1002801) and the National Natural Science Foundation of China (61772132).
Notes
Our source code is made available at http://github.com/somethingx02/topical_wordvec_models.
References
Author notes
Corresponding author.