A Neural Generative Model for Joint Learning Topics and Topic-Specific Word Embeddings

We propose a novel generative model to explore both local and global context for joint learning topics and topic-specific word embeddings. In particular, we assume that global latent topics are shared across documents, a word is generated by a hidden semantic vector encoding its contextual semantic meaning, and its context words are generated conditional on both the hidden semantic vector and global latent topics. Topics are trained jointly with the word embeddings. The trained model maps words to topic-dependent embeddings, which naturally addresses the issue of word polysemy. Experimental results show that the proposed model outperforms the word-level embedding methods in both word similarity evaluation and word sense disambiguation. Furthermore, the model also extracts more coherent topics compared with existing neural topic models or other models for joint learning of topics and word embeddings. Finally, the model can be easily integrated with existing deep contextualized word embedding learning methods to further improve the performance of downstream tasks such as sentiment classification.


Introduction
Probabilistic topic models assume words are generated from latent topics which can be inferred from word co-occurrence patterns taking a document as global context. In recent years, various neural topic models have been proposed. Some of them are built on the Variational Auto-Encoder (VAE) (Kingma and Welling, 2014) which utilizes deep neural networks to approximate the intractable posterior distribution of observed words given latent topics (Miao et al., 2016;Srivastava and Sutton, 2017;Bouchacourt et al., 2018). However, these * Corresponding author. models take the bag-of-words (BOWs) representation of a given document as the input to the VAE and aim to learn hidden topics that can be used to reconstruct the original document. They do not learn word embeddings concurrently.
Other topic modeling approaches explore the pre-trained word embeddings for the extraction of more semantically coherent topics since word embeddings capture syntactic and semantic regularities by encoding the local context of word co-occurrence patterns. For example, the topicword generation process in the traditional topic models can be replaced by generating word embeddings given latent topics (Das et al., 2015) or by a two-component mixture of a Dirichlet multinomial component and a word embedding component (Nguyen et al., 2015). Alternatively, the information derived from word embeddings can be used to promote semantically-related words in the Polya Urn sampling process of topic models (Li et al., 2017) or generate topic hierarchies (Zhao et al., 2018). However, all these models use pre-trained word embeddings and do not learn word embeddings jointly with topics.
While word embeddings could improve the topic modeling results, but conversely, the topic information could also benefit word embedding learning. Early word embedding learning methods (Mikolov et al., 2013a) learn a mapping function to project a word to a single vector in an embedding space. Such one-to-one mapping cannot deal with word polysemy, as a word could have multiple meanings depending on its context. For example, the word 'patient' has two possible meanings 'enduring trying circumstances with even temper' and 'a person who requires medical care'. When analyzing reviews about restaurants and health services, the semantic meaning of 'patient' could be inferred depending on which topic it is associated with. One solution is to first extract topics using the standard Latent Dirichlet Allocation (LDA) model and then incorporate the topical information into word embedding learning by treating each topic as a pseudo-word (Liu et al., 2015).
Whereas the aforementioned approaches adopt a two-step process, by either using pre-trained word embeddings to improve the topic extraction results in topic modeling, or incorporating topics extracted using a standard topic model into word embedding learning, Shi et al. (2017) developed a Skip-Gram based model to jointly learn topics and word embeddings based on the Probabilistic Latent Semantic Analysis (PLSA), where each word is associated with two matrices rather than a vector to induce topic-dependent embeddings. This is a rather cumbersome setup. Foulds (2018) used the Skip-Gram to imitate the probabilistic topic model that each word is represented as an importance vector over topics for context generation.
In this paper we propose a neural generative model built on VAE, called the Joint Topic Wordembedding (JTW) model, for jointly learning topics and topic-specific word embeddings. More concretely, we introduce topics as tangible parameters that are shared across all the context windows. We assume that the pivot word is generated by the hidden semantics encoding the local context where it occurred. Then the hidden semantics is transformed to a topical distribution taking into account the global topics, and this enables the generation of context words. Our rationale is that the context words are generated by the hidden semantics of the pivot word together with a global topic matrix, which captures the notion that the word has multiple meanings that should be shared across the corpus. We are thus able to learn topics and generate topic-dependent word embeddings jointly. The results of our model also allow the visualization of word semantics because topics can be visualized via the top words and words can be encoded as distributions over the topics 1 .
In summary, our contribution is three-fold: •

Related Work
Our work is related to two lines of research: Skip-Gram approaches for word embedding learning.
The Skip-Gram, also known as WORD2VEC (Mikolov et al., 2013b), maximizes the probability of the context words w n given a centroid word x n . Pennington et al. (2014) pointed out that Skip-Gram neglects the global word cooccurrence statistics. They thus formulated the Skip-Gram as a non-negative matrix factorization (NMF) with the cross-entropy loss switched to the least square error. Another NMF-based method was proposed by Xu et al. (2018), in which the Euclidean distance was substituted with Wasserstein distance. Jameel and Schockaert (2019) rewrote the NMF objective as a cumulative product of normal distributions, in which each factor is multiplied by a von Mises-Fisher (vMF) distribution of context word vectors, to hopefully cluster the context words since the vMF density retains the cosine similarity.
Although the Skip-Gram-based methods attracted extensive attention, they were criticized for their inability to capture the polysemy (Pilehvar and Collier, 2016). A pioneered solution to this problem is the Multiple-Sense Skip-Gram (MSSG) model (Neelakantan et al., 2014), where word vectors in a context are first averaged then clustered with other contexts to obtain a sense representation for the pivot word. In the same vein, Iacobacci and Navigli (2019) leveraged sense tags annotated by BabelNet (Navigli and Ponzetto, 2012) to jointly learn word and sense representations in the Skip-Gram manner that the context words are parameterized via a shared look-up table and sent to a BiLSTM to match the pivot word vector.
There have also been Bayesian extensions of the Skip-Gram models for word embedding learning. Barkan (2017) inherited the probabilistic generative line while extending the Skip-Gram by placing a Gaussian prior on the parameterized word vectors. The parameters were estimated via variational inference. In a similar vein, Rios et al. (2018) proposed to generate words in bilingual parallel sentences by shared hidden semantics. They introduced a latent index variable to align the hidden semantics of a word in the source language to its equivalence in the target language. More recently, Bražinskas et al. (2018) proposed the Bayesian Skip-Gram (BSG) model, in which each word type with its related word senses collapsed is associated with a 'prior' or static embedding and then, depending on the context, the representation of each word is updated by 'posterior' or dynamic embedding. Through Bayesian modeling, BSG is able to learn context-dependent word embeddings. It does not explicitly model topics, however. In our proposed JTW, global topics are shared among all documents and learned from data. Also, whereas BSG only models the generation of context words given a pivot word, JTW explicitly models the generation of both the pivot word and the context words with different generative routes.
Combining word embeddings with topic modeling. Pre-trained word embeddings can be used to improve the topic modeling performance. For example, Das et al. (2015) proposed the Gaussian LDA model, which, instead of generating discrete word tokens given latent topics, generates draws from a multivariate Gaussian of word embeddings. Nguyen et al. (2015) also replaced the topic-word Dirichlet multinomial component in traditional topic models, but by a two-component mixture of a Dirichlet multinomial component and a word embedding component. Li et al. (2017) proposed to modify the Polya Urn sampling process of the LDA model by promoting semantically-related words obtained from word embeddings. More recently, Zhao et al. (2018) proposed to adapt a multilayer Gamma Belief Network to generate topic hierarchies and also fine-grained interpretation of local topics, both of which are informed by word embeddings.
Instead of using word embeddings for topic modeling, Liu et al. (2015) proposed the Topical Word Embedding model which incorporates the topical information derived from standard topic models into word embedding learning by treating each topic as a pseudo-word. Briakou et al. (2019) followed this route and proposed a four-stage model in which topics were first extracted from a corpus by LDA and then the topic-based word embeddings are mapped to a shared space using anchor words which were retrieved from the WordNet.
There are also approaches proposed to jointly learn topics and word embeddings built on Skip-Gram models. Shi et al. (2017) developed a Skip-Gram Topical word Embedding (STE) model built on PLSA where each word is associated with two matrices-one matrix used when the word is a pivot word and another used when the word is considered as a context word. Expectation-Maximization (EM) is used to estimate model parameters. Foulds (2018) proposed the Mixed-Membership Skip-Gram model (MMSG), which assumes a topic is drawn for each context and the word in the context is drawn from the log-bilinear model based on the topic embeddings. Foulds trained their model by alternating between Gibbs sampling and noise-contrastive estimation. MMSG only models the generation of context words, but not pivot words.
While our proposed JTW also resembles the similarity to the Skip-Gram model in that it predicts the context word given the pivot word, it is different from the existing approaches in that it assumes global latent topics shared across all documents and the generation of the pivot word and the context words follows different generative routes. Moreover, it is built on VAE and is trained using neural networks for more efficient parameter inference.

Joint Topic Word-embedding (JTW) Model
In this section, we describe our proposed Joint Topic Word-embedding (JTW) model built on VAE, as shown in Fig. 1. We first give an overview of JTW, then present each component of the model, followed by the training details. Following the problem setup in the Skip-Gram model, we consider a pivot word x n and its context window w n = w n,1:C . We assume there are a total of N pivot word tokens and each context window contains C context words. However, as opposed to Skip-Gram, we do not compute the joint probability as a product chain of conditional probabilities of the context word given the pivot. Instead, in our model, context words are represented as BOWs for each context window by assuming the exchangeability of context words within the local context window.
We hypothesize that the hidden semantic vector z n of each word x n induces a topical distribution that is combined with the global corpus-wide latent topics to generate context words. Topics are represented as a probability matrix where each row is a multinomial distribution measuring the importance of each word within a topic. The hidden semantics z n of the pivot word x n is transformed to a topical distribution ζ n , which participates in the generation of context words. Our assumption is that each word embodies a finite set of meanings that can be interpreted as topics, thus each word representation can be transformed to a distribution over topics. Context words are generated by first selecting a topic and then sampled according to the corresponding multinomial distribution. This enables a quick understanding of word semantics through the topical distribution and at the same time learning the latent topics from the corpus. The generative process is given below: • For each word position n ∈ {1, 2, 3, . . . , N }: -Draw hidden semantic representation z n ∼ N (0, I) -Choose a pivot word x n ∼ p(x n |z n ) -Transform z n to ζ n with a multi-layered perceptron: ζ n = MLP(z n ) -For each context word position c ∈ {1, 2, 3, . . . , C}: Here, all the distributions are functions approximated by neural networks, e.g., p(x n |z n ) ∝ exp(M x z n + b x ), which will be discussed in more details in the Decoder section, t n,c indexes a row β tn,c in the topic matrix. We could implicitly marginalise out the topic indicators, in which case the probability of a word would be written as w n,c |ζ n , β ∼ Categorical(σ(β T ζ n )), where σ(·) denotes the softmax function. The prior distribution for z n is a multivariate Gaussian distribution with the mean 0 and covariance I, of which the posterior indicates the hidden semantics of the pivot word when conditioned on {x n , w n }.
Although both JTW and BSG assume that a word can have multiple senses and use a latent embedding z to represent the hidden semantic meaning of each pivot word, there are some key differences in their generative processes. JTW first draws a latent embedding z from a standard Gaussian prior which is deterministically transformed into topic distributions and a distribution over pivot words. The pivot word is conditionally independent of its context given the latent embedding. At the same time, each context word is assigned a latent topic, drawn from a shared topic distribution which leverages the global topic information, and then drawn independently of one another. In BSG the latent embedding z is also drawn from a Gaussian prior but the context words are generated directly from the latent embedding z, as opposed to via a mixture model as in JTW. Therefore, JTW is able to group semantically-similar words into topics, which is not the case in BSG.
Given the observed variables {x 1:N , w 1:N }, the objective of the model is to infer the posterior p(z|x, w). This is achieved by the VAE framework. As illustrated in Figure 1, the JTW model is composed of an encoder and a decoder, each of which is constructed by neural networks. The family of distributions to approximate the posterior is Gaussian, in which µ n and σ n are optimized. As in VAE, we optimize µ n and σ n through the training of parameters in neural networks (e.g., we optimize M π in µ n = M T π π n + b π instead of updating µ n directly).

ELBO
The VAE naturally simulates the variational inference (Jordan et al., 1999), where a family of parameterized distributions q φ (z n |x n , w n ) are optimized to approximate the intractable true posterior p θ (z n |x n , w n ). This is achieved by minimizing the Kullback-Leibler (KL) divergence between the variational distribution and the true posterior for each data point: where the expectation term is called the Evidence Lower Bound (ELBO), denoted as L(θ, φ; x n , w n ). VAE optimizes ELBO to presumably minimize the KL-divergence. The ELBO is further derived as (2) The first term on the left-hand side of Equation 2, which is an expectation with respect to q φ (z n |x n , w n ), can be estimated by sampling due to its intractability. That is: where z (s) n ∼ q φ (z n |x n , w n ). Here we use z (s) n to represent the samples since the sampled distribution is related to x n .

Encoder
The Encoder corresponds to q φ (z n |x n , w n ) in Equation 3. Recall that the variational family for approximating the true posterior is Gaussian Distribution parameterized by {µ n , σ n }. As such, the encoder is essentially a set of neural functions mapping from observations to Gaussian parameters {µ n , σ n }. The neural functions are defined as: π n = MLP(x n , w n ), µ n = M T µ π n + b π , σ n = M T σ π n + b σ , where the MLP denotes the multi-layered perceptron and the context window w n is represented as a BOW that is a Vdimentional vector. The encoder outputs Gaussian parameters {µ n , σ n }, which constitutes the variational distribution q φ (z n |x n , w n ). In order to differentiate q φ (z n |x n , w n ) with respect to φ, we apply the reparameterization trick (Kingma and Welling, 2014) by using the following transformation: (4)

Decoder
The Decoder corresponds to p θ (x n , w n |z n ) with random variables instantiated by x n and w n . More concretely, we define two neural functions to generate the pivot word and the context words separately. Both the functions involve an MLP, while the context words are generated independently from each other by the topic mixture weighted by the hidden topic distributions. The neural functions are expressed as: In this case, the MLP for the pivot word is specified as a fully-connected layer. Recall that we represent the context window w n as BOW, the instantiated probability p θ (x n , w n |z (s) n ) can be therefore derived as: where exp(M x z (s)

Loss Function
We are now ready to compute ELBO in Equation 2 with the specified q φ (z n |x n , w n ) and p θ (x n , w n |z (s) n ) in hand. Our final objective function that needs to be maximized is: where D denotes the dimension of µ. S denotes the number of sample points required for the computation of the expectation term. The loss function is the negative of the objective function. The learning procedure is summarized in Algorithm 1.

Prediction
After training, we are able to map the words to their respective representations using the Encoder part of JTW. The Encoder takes a pivot word together with its context window as an input and outputs the parameters of the variational distribution considered to be the approximated posterior q φ (z|x n , w n ), which is a Gaussian distribution in our case. The word representations are Gaussian parameters {µ n , σ n }. Because the output of the Encoder is formulated as a Gaussian distribution, the word similarity of two words can be either computed by the KL-divergence between the Gaussian distributions, or by the cosine similarity between their means. We use the Gaussian mean µ to represent a word given its context. The universal representation of a word type can be obtained by averaging the posterior means of all occurrences over the corpus.

Experimental Setup
Dataset. We train the proposed JTW model on the Yelp dataset 2 , which is a collection of more than 4 million reviews on over 140k business categories. Although the number of business categories is large, the vast majority of reviews falls into 5 business categories. Parameter Setting. The word semantics are represented as 100-dimensional vectors (i.e., D = 100), which is a default configuration for word representations (Mikolov et al., 2013a;Bražinskas et al., 2018). The number of latent topics is set to 50. It has been previously studied in Kingma and Welling (2014) that the number of samples per data point can be set to 1 if the batch size is large, (e.g. > 100).
In our experiments, we set the batch size to 2, 048 and the number of samples per data point, S, to 1. The context window size is set to 10. Network parameters (i.e., θ, φ) are all initialized by a normal distribution with zero mean and 0.1 variance. Baselines. We compare our model against four baselines: • CvMF (Jameel and Schockaert, 2019). CvMF can be viewed as an extension of GloVe that modifies the objective function by multiplying a mixture of vMFs, whose distance is measured by cosine similarity instead of euclidean distance. The mixture depicts the underlying semantics with which the words could be clustered.
• Bayesian Skip-Gram (BSG) (Bražinskas et al., 2018). BSG 4 is a probabilistic wordembedding method built on VAE as well, which achieved the state-of-art among other Bayesian word-embedding alternatives (Vilnis and McCallum, 2015;Barkan, 2017). BSG infers the posterior or dynamic embedding given a pivot word and its observed context and is able to learn context-dependent word embeddings.
• Skip-gram Topical word Embedding (STE) (Shi et al., 2017). STE adapted the commonly known Skip-Gram by associating each word with an input matrix and an output matrix and used the Expectation-Maximization (EM) method with the negative sampling for model parameter inference. For topic generation, they need to evaluate the probability of p(w t+j |z, w t ) for each topic z and each skip-gram < w t ; w t+j >, and represent each topic as the ranked list of bi-grams.
• Mixed Membership Skip-Gram (MMSG) (Foulds, 2018). MMSG leverages mixed membership modeling in which words are assumed to be clustered into topics and the words in the context of a given pivot word are drawn from the log-bilinear model using the vector representations of the context-dependent topic. Model inference is performed using the Metropolis-Hastings-Walker algorithm with noise-contrastive estimation.
Among the aforementioned baselines, CvMF and BSG only generate word embeddings and do not model topics explicitly. Also, CvMF only maps each word to a single word embedding whereas BSG can output context-dependent word embeddings. Both STE and MMSG can learn topics and topic-dependent embeddings at the same time. However, in STE the topic dependence is stored in the lines of word matrices and the word representations themselves are context independent. In contrast, MMSG associates each word with a topic distribution; it could produce contextualized word embeddings by summing up topic vectors weighed by the posterior topic distribution given a context. We probe into different topic counts and find the 4 https://github.com/ixlan/BSG best setting for methods with topics or mixtures. In all the baselines, the dimensionality of word embeddings is tuned and finally set to 100.

Experimental Results
We compare JTW with baselines on both word similarity and word-sense disambiguation tasks for the learned word embeddings, and also present the topic coherence and qualitative evaluation results for the extracted topics. Furthermore, we show that JTW can be easily integrated with deep contextualized word embeddings to further improve the performance of downstream tasks such as sentiment classification.

Word Similarity
The word similarity task (Finkelstein et al., 2001) has been widely adopted to measure the quality of word embeddings. In the word similarity task, a number of pair-wise words are given. Each pair of words should be assigned with a score that indicates their relatedness. The calculated scores are then compared with the golden scores by means of Spearman rank-order correlation coefficient. Because the word similarity task requires context-free word representations, we aggregate all the occurrences and obtain a universal vector for each word. The distance used for similarity scores is cosine similarity. For STE, we use AvgSimC following Shi et al. (2017). We further make a comparison with the results of the Skip-Gram (SG) model 5 , which maps each word token to a single point in an Euclidean space without considering different senses of words. All the approaches are evaluated on the 7 commonly used benchmarking datasets. For JTW, we average the results over 10 runs and also report the standard deviations.
The results are reported in Table 1. It can be observed that among the baselines, BSG achieves the lowest score on average, followed by MMSG. Although JTW clearly beats all the other models on SimLex-999 only, it only performs slightly worse than the top model in 5 out of the remaining 6 benchmarks. Overall, JTW gives the superior results on average. A noticeable gap can be observed on the Stanford's Contextual Word Similarities (SCWS) dataset where JTW, MMSG and BSG give better results compared with SG, CvMF and STE. This can be explained by the fact that, in The small standard deviation of JTW indicates that the performance is consistent across multiple runs.

Lexical Substitution
While the word similarity tasks focus more on the general meaning of a word (since word pairs are presented without context), in this section, we turn to the lexical substitution task (Yuret, 2007;Thater et al., 2011), which was designed to evaluate the word-embedding learning methods regarding their ability to disambiguate word senses. The lexical substitution task can be described by the following scenario: Given a sentence and one of its member words, find the most related replacement from a list of candidate words. As stated in Thater et al. (2011), a good lexical substitution should not only capture the relatedness between the candidate word and the original word, but also imply the correctness with respect to the context.
Following Bražinskas et al. (2018), we derive the setting from Melamud et al. (2015) to ensure a fair comparison between the context-free word embedding methods and the context-dependent ones. In details, for JTW and BSG, we capture the context of a given word using the BOW representation, and derive the representation of each candidate word taking account of the context. For CvMF and STE, the similarity score is computed using where y is the candidate word and x denotes the original word. For MMSG, the original word's representation is calculated as the sum of its associated topic vectors weighed by the word's posterior topical distribution. Given an original word and its context, we choose the candidate word with the highest similarity score. We compare the performance of various models on lexical substitution using the dataset from the SemEval 2007 task 10 6 (McCarthy and Navigli, 2007), which consists of 1,688 instances. Because some words have multiple synonyms as annotated in the dataset, we would consider a chosen candidate word as a correct prediction if it hits one of the ground-truth replacements.
We report in Table 2 the accuracy scores of different methods. Context-sensitive word embeddings generally perform better than context-free alternatives. STE can only learn context-independent word embeddings and hence gives the lowest score. BSG is able to learn context-dependent word embeddings and outperforms CvMF. Among the joint topic and word embedding learning methods, STE performs the worst, showing that associating each word with two matrices and learning topicdependent word embddings based on PLSA appear to be less effective. Both JTW and MMSG show superior performances compared to BSG. JTW outperforms MMSG because JTW also models the generation of pivot word in addition to context words and the VAE framework for parameter inference is more effective than the annealed negative contrastive estimation used in MMSG. Because only STE and MMSG can jointly learn topics and word embeddings among the baselines, we compare our proposed JTW with these two models in term of topic quality. The evaluation metric we employed is the topic coherence metric proposed in Röder et al. (2015). The metric extracts co-occurrence counts of the topic words in Wikipedia using a sliding window of size 110. For each top word a vector is calculated whose elements are the normalized point-wise mutual information between the word and every other top words. Given a topic, the arithmetic mean of all vector pairsâȂŹ cosine similarity is treated as the coherence measure. We calculate the topic coherence score of each extracted topic based on its associated top ten words using Palmetto 7 (Rosner et al., 2014). The topic coherence results with the topic number varying between 10 and 200 are plotted in Figure 2. The graph shows that JTW scores the highest under all the topic settings. It gives the best coherence score of 0.416 at 50 topics, and gradually flattens with the increasing number of topics. MMSG exhibits an upward trend up to 100 topics, and drops to 0.365 when the topic number is set to 150. STE undergoes a gradual decrease and then stabilizes with the topic number beyond 150.

Extracted Topics
We present in Table 3 the example topics extracted by JTW and MMSG. It can be easily inferred from the top words generated by JTW that Topic 1 is related to 'Food', whereas Topic 5 is about the 'Clinical Service', which is identified by the words 'caring' and 'physician'. It can also be deduced from the top words that Topic 2, 3 and 4 represent 'Shopping', 'Beauty' and 'Automotive', respectively. In contrast, topics produced by MMSG contain more semantically less coherent words as highlighted by italics. For example, Topic 1 in MMSG contains words relating to both food and staff. This might be caused by the fact that, in MMSG, training is performed as a two-stage process by first assigning topics to words using Gibbs Sampling then estimating the topic vectors and word vectors from word co-occurrences and topic assignments via maximum likelihood estimator. This is equivalent to a topic model with parameterized word embeddings. Conversely, in JTW, latent variables in the generative process are recognized as word representations.
Parameters reside in the generative network, and are inferred by the VAE. No extra parameters are introduced to encode the words. Therefore, the topics extracted tend to be more identifiable.

Visualization of Word Semantics
The extracted topics allow the visualization of word semantics. In JTW, a word's semantic meanings can be interpreted as a distribution over the discovered latent topics. This is achieved by aggregating all the contextualized topical distribution of a particular word throughout the corpus. Meanwhile, when a word is placed under a specific context, its topical distribution can be directly transformed from its contextualized representation. We chose three words-'plastic', 'bar' and 'patient'-to illustrate the polysemous nature of them. To further demonstrate their context-dependent meanings, we also visualize the topic distribution of the following three sentences: (1) Effective patient care requires clinical knowledge and understanding of physical therapy; (2) Restaurant servers require patient temperament; (3) You have to bring your own bags or boxes but you can also purchase plastic bags. The topical distribution for the pivot words and the three example sentences are shown in Figure 3.  Table 4.
We can deduce from the overall distributions that the semantic meaning of 'plastic' distributes almost equally on two topics, 'shopping' and 'beauty', while the meaning of 'bar' is more prominent on the 'food' and 'shopping' topics. 'Patient' has a strong connection with the 'clinical' topic, though it is also associated with the 'food' topic. When considering a specific context about the patient care, Sentence 1 has its topic distribution peaked at the 'clinical' topic. Sentence 2 also contains the word 'patient', but it now has its topic distribution peaked at 'food'. Sentence 3 mentioned 'plastic bags' and its most prominent topic is 'shopping'. These results show that JTW can indeed jointly learn latent topics and topic-specific word embeddings.

Integration with Deep Contextualized Word Embeddings
Recent advances in deep contextualized word representation learning have generated significant impact in natural language processing. Different from traditional word embedding learning methods such as Word2Vec or GloVe, where each word is mapped to a single vector representation, deep contextualized word representation learning methods are typically trained by language modelling and generate a different word vector for each word depending on the context in which it is used. A notable work is ELMo (Peters et al., 2018), which is commonly regarded as the pioneer for deriving deep contextualized word embeddings (Devlin et al., 2019). ELMo calculates the weighed sum of different layers of a multi-layered BiLSTM-based language model, using the normalized vector as a representation for the corresponding word. More recently, in contrast to ELMo, BERT (Devlin et al., 2019) was proposed to apply the bidirectional training of Transformer to masked language modelling. Because of its capability of effectively encoding contextualized knowledge from huge external corpora in word embeddings, BERT has refreshed the stateof-art results on a number of NLP tasks.
While Word2Vec/GloVe and ELMo/BERT represent the two opposite extremes in word embedding learning, with the former learning a single vector representation for each word and the latter learning a separate vector representation for each occurrence of a word, our proposed JTW sits in the middle that it learns different word vectors depending on which topic a word is associated with. Nevertheless, we can incorporate ELMo/BERT embeddings into JTW. This is achieved by replacing the BOW input with the pre-trained ELMo/BERT word embeddings in the Encoder-Decoder architecture of JTW, making the resulting word embeddings better at capturing semantic topics in a specific domain. More precisely, the training objective is switched to the cosine value of half the angle between the input ELMo/BERT vector and decoded output vector   , respectively. Recall that, the input to the model has been encoded by pre-trained word vectors (e.g., 300-dimensional vectors). Our training objective is to make the reconstructed x n,c as close as possible to their original input word embeddings. The difference is measured by the angle between the input and the output vectors. Normalized ELMo/BERT vectors can be transformed to the polar coordinate system with trigonometric functions, which forms a probability distribution by π 0 1 2 cos θ 2 dθ = 1, and the function is monotone to the similarity between the input ELMo/BERT embeddings and the reconstructed output embeddings, which reaches its peak when x n = x (p) n (i.e., θ = 0). Therefore, we are able to replace Equation 8 with Equation 11 when an ELMo/BERT is attached. The input vectors of the Encoder are then the embeddings produced by ELMo/BERT, and the Decoder output are the reconstructed word embeddings aligned with the input.
We resort to the sentiment classification task on Yelp and compare the performance of JTW, ELMo and BERT 8 , and the integration of both, JTW-ELMo and JTW-BERT, by 10-fold cross validation. In all the experiments, we fine-tune the 8 https://github.com/google-research/ bert models on the training set consisting of 90% documents sampled from the dataset described in Section 4 and evaluate on the 10% data that serves as the test set. We employ the further pre-training scheme (Sun et al., 2019) that different learning rates are applied to each layer and slanted triangular learning rates are imposed across epochs when adapting the language model to the training corpus (Howard and Ruder, 2018). The classifier used for all the methods is an attention hop over a BiLSTM with a softmax layer. The ground truth labels are the five-scale review ratings included in the original dataset. The 5-class sentiment classification results in precision, recall, macro-F1 and micro-F1 scores are reported in Table 4.
It can be observed from Table 4 that a sentiment classifier trained on JTW-produced word embeddings gives worse results compared with that using the deep contextualized word embeddings generated by ELMo or BERT. Nevertheless, when integrating the ELMo or BERT front-end with JTW, the combined model, JTW-ELMo and JTW-BERT, outperforms the original deep contextualized word representation models, respectively. It has been verified by the paired t-test that JTW-ELMo outperforms ELMo and BERT at the 95% significance level on Micro-F1. The results show that our proposed JTW is flexible and it can be easily integrated with pre-trained contextualized word embeddings to capture the domain-specific semantics better compared to directly fine-tuning the pretrained ELMo or BERT on the target domain, hence leading to improved sentiment classification performance.

Conclusion
Driven by the motivation that combining word embedding learning and topic modeling can mutually benefit each other, we propose a probabilistic generative framework that can jointly discover more semantically coherent latent topics from the global context and also learn topic-specific word embeddings, which naturally address the problem of word polysemy. Experimental results verify the effectiveness of the model on word similarity evaluation and word sense disambiguation. Furthermore, the model can discover latent topics shared across documents, and the encoder of JTW can generate the topical distribution for each word. This enables an intuitive understanding of word semantics. We have also shown that our proposed JTW can be easily integrated with deep contextualized word embeddings to further improve the performance of downstream tasks. In future work, we will explore the discourse relationships between context windows to model, for example, the semantic shift between the neighboring sentences.