Abstract
Probabilistic topic modeling is a common first step in crosslingual tasks to enable knowledge transfer and extract multilingual features. Although many multilingual topic models have been developed, their assumptions about the training corpus are quite varied, and it is not clear how well the different models can be utilized under various training conditions. In this article, the knowledge transfer mechanisms behind different multilingual topic models are systematically studied, and through a broad set of experiments with four models on ten languages, we provide empirical insights that can inform the selection and future development of multilingual topic models.
1 Introduction
Popularized by Latent Dirichlet Allocation (Blei, Ng, and Jordan 2003), probabilistic topic models have been an important tool for analyzing large collections of texts (Blei 2012, 2018). Their simplicity and interpretability make topic models popular for many natural language processing tasks, such as discovery of document networks (Chen et al. 2013; Chang and Blei 2009) and authorship attribution (Seroussi, Zukerman, and Bohnert 2014).
Topic models take a corpus D as input, where each document d ∈ D is usually represented as a sparse vector in a vocabulary space, and project these documents to a lower-dimensional topic space. In this sense, topic models are often used as a dimensionality reduction technique to extract representative and human-interpretable features.
Text collections, however, are often not in a single language, and thus there has been a need to generalize topic models from monolingual to multilingual settings. Given a corpus D(1, …, L) in languages ℓ ∈{1, …, L}, multilingual topic models learn topics in each of the languages. From a human’s view, each topic should be related to the same theme, even if the words are not in the same language (Figure 1(b)). From a machine’s view, the word probabilities within a topic should be similar across languages, such that the low-dimensional representation of documents is not dependent on the language. In other words, the topic space in multilingual topic models is language agnostic (Figure 1(a)).
This article presents two major contributions to multilingual topic models. We first provide an alternative view of multilingual topic models by explicitly formulating a crosslingual knowledge transfer process during posterior inference (Section 3). Based on this analysis, we unify different multilingual topic models by defining a function called the transfer operation. This function provides an abstracted view of the knowledge transfer mechanism behind these models, while enabling further generalizations and improvements. Using this formulation, we analyze several existing multilingual topic models (Section 4).
Second, in our experiments we compare four representative models under different training conditions (Section 5). The models are trained and evaluated on ten languages from various language families to increase language diversity in the experiments. In particular, we include five languages with relatively high resources and five others with low resources. To quantitatively evaluate the models, we focus on topic quality in Section 5.3.1, and performance of downstream tasks using crosslingual document classification in Section 5.3.2. We investigate how sensitive the models are to different language resources (i.e., parallel/comparable corpus and dictionaries), and analyze what factors cause this difference (Sections 6 and 7).
2 Background
We first review monolingual topic models, focusing on Latent Dirichlet Allocation, and then describe two families of multilingual extensions. Based on the types of supervision added to multilingual topic models, we separate the two model families into document-level and word-level supervision.
Topic models provide a high-level view of latent thematic structures in a corpus. Two main branches for topic models are non-probabilistic approaches such as Latent Semantic Analysis (LSA; Deerwester et al. 1990) and Non-Negative Matrix Factorization (Xu, Liu, and Gong 2003), and probabilistic ones such as Latent Dirichlet Allocation (LDA; Blei, Ng, and Jordan 2003) and probabilistic LSA (pLSA; Hofmann 1999). All these models were originally developed for monolingual data and later adapted to multilingual situations. Though there has been work to adapt non-probabilistic models, for example, based on “pseudo-bilingual” corpora approaches (Littman, Dumais, and Landauer 1998), most multilingual topic models that are trained on multilingual corpora are based on probabilistic models, especially LDA. Therefore, our work is focused on the probabilistic topic models, and in the following section we start by describing LDA.
2.1 Monolingual Topic Models
The most popular topic model is LDA, introduced by Blei, Ng, and Jordan (2003). This model assumes each document d is represented by a multinomial distribution θd over topics, and each “topic” k is a multinomial distribution ϕ(k) over the vocabulary V. In the generative process, each θ and ϕ are generated from Dirichlet distributions parameterized by α and β, respectively. The hyperparameters for Dirichlet distributions can be asymmetric (Wallach, Mimno, and McCallum 2009), though in this work we use symmetric priors. Figure 2 shows the plate notation of lda.
2.2 Multilingual Topic Models
We now describe a variety of multilingual topic models, organized into two families based on the type of supervision they use. Later, in Section 4, we focus on a subset of the models described here for deeper analysis using our knowledge transfer formulation, selecting the most general and representative models.
2.2.1 Document Level
The first model proposed to process multilingual corpora using LDA is the Polylingual Topic Model (PLTM; Mimno et al. 2009; Ni et al. 2009). This model extracts language-consistent topics from parallel or highly comparable multilingual corpora (for example, Wikipedia articles aligned across languages), assuming that document translations share the same topic distributions. This model has been extensively used and adapted in various ways for different crosslingual tasks (Krstovski and Smith 2011; Moens and Vulic 2013; Vulić and Moens 2014; Liu, Duh, and Matsumoto 2015; Krstovski and Smith 2016).
In the generative process, PLTM first generates language-specific topic-word distributions , for topics k = 1,…,K and languages ℓ = 1, …, L. Then, for each document tuple , it generates a tuple-topic distribution θd ∼ Di(α). Every topic in this document tuple is generated from θd, and the word tokens in this document tuple are then generated from language-specific word distributions ϕ(ℓ,k) for each language. To apply PLTM, the corpus must be parallel or closely comparable to provide document-level supervision. We refer to this as the document links model (doclink).
Models that transfer knowledge on the document level have many variants, including softlink (Hao and Paul 2018), comparable bilingual LDA (c-bilda; Heyman, Vulic, and Moens 2016), the partially connected multilingual topic model (pcMLTM; Liu, Duh, and Matsumoto 2015), and multi-level hyperprior polylingual topic model (mlhPLTM; Krstovski, Smith, and Kurtz 2016). softlink generalizes doclink by using a dictionary, so that documents can be linked based on overlap in their vocabulary, even if the corpus is not parallel or comparable. c-bilda is a direct extension of doclink that also models language-specific distributions to distinguish topics that are shared across languages from language-specific topics. pcMLTM adds an additional observed variable to indicate the absence of a language in a document tuple. mlhPLTM uses a hierarchy of hyperparameters to generate section-topic distributions. This model was motivated by applications to scientific research articles, where each section s has its own topic distribution θ(s) shared by both languages.
2.2.2 Word Level
Instead of document-level connections between languages, Boyd-Graber and Blei (2009) and Jagarlamudi and Daumé III (2010) proposed to model connections between languages through words using a multilingual dictionary and apply hyper-Dirichlet Type-I distributions (Andrzejewski, Zhu, and Craven 2009; Dennis III 1991). We refer to these approaches as the vocabulary links model (voclink).
Specifically, voclink uses a dictionary to create a tree structure where each internal node contains word translations, and words that are not translated are attached directly to the root of the tree r as leaves. In the generative process, for each language ℓ, voclink first generates K multinomial distributions over all internal nodes and word types that are not translated, , where β(r,ℓ) is a vector of Dirichlet prior from root r to internal nodes and untranslated words in language ℓ. Then, under each internal node i, for each language ℓ, voclink generates a multinomial over word types in language ℓ under the node i. Note that both β(r,ℓ) and β(i,ℓ) are vectors. In the first vector β(r,ℓ), each cell is parameterized by scalar β′ and scaled by the number of word translations under that internal node. For the second vector β(i,ℓ), it is a symmetric hyperparameter where every cell uses the same scalar β′′. See Figure 3 for an illustration.
Document-topic distributions θd are generated in the same way as monolingual LDA, because no document translation is required.
The use of dictionaries to model similarities across topic-word distributions has been formulated in other ways as well. ProbBiLDA (Ma and Nasukawa 2017) uses inverted indexing (Søgaard et al. 2015) to encode assumptions that word translations are generated from same distributions. ProbBiLDA does not use tree structures in the parameters as in voclink, but the general idea of sharing distributions among word translations is similar. Gutiérrez et al. (2016) use part-of-speech taggers to separate topic words (nouns) and perspective words (adjectives and verbs), developed for the application of detecting cultural differences, such as how different languages have different perspectives on the same topic. Topic words are modeled in the same way as in voclink, whereas perspective words are modeled in a monolingual fashion.
3 Crosslingual Transfer in Probabilistic Topic Models
Conceptually, the term “knowledge transfer” indicates that there is a process of carrying information from a source to a destination. Using the representations of graphical models, the process can be visualized as the dependence of random variables. For example, implies that the generation of variable Y is conditioned on X, and thus the information of X is carried to Y. If X represents a probability distribution, the distribution of Y is informed by X, presenting a process of knowledge transfer, as we define it in this work.
In our study, “knowledge” can be loosely defined as K multinomial distributions over the vocabularies: . Thus, to study the transfer mechanisms in topic models is to reveal how the models transfer from one language to another. To date, this transfer process has not been obvious in most models, because typical multilingual topic models assume the tokens in multiple languages are generated jointly.
In this section, we present a reformulation of these models that breaks down the co-generation assumption of current models and instead explicitly show the dependencies between languages. Starting with a simple example in Section 3.1, we show that our alternative formulation derives the same collapsed Gibbs sampler, and thus the same posterior distribution over samples, as in the original model. With this prerequisite, in Section 3.3 we introduce the transfer operation, which will be used to generalize and extend current multilingual topic models in Section 4.
3.1 Transfer Dependencies
We start with a simple graphical model, where is a K-dimensional categorical distribution, drawn from a Dirichlet parameterized by α, a symmetric hyperparameter (Figure 4(a)). Using θ, the model generates two variables, X and Y, and we use x and y to denote the generated observations. In the co-generation assumption, the variables X and Y are generated from the same θ at the same time, without dependencies between each other. Thus, we call this the joint model denoted as and the probability of the sample (x, y) is .
In this formulation, the model generates θx from Dirichlet(α) first and uses θx to generate the sample of x. Using the histogram of x denoted as nx = [n1|x, n2|x, …, nK|x] where nk|x is the number of instances of X assigned to category k, together with hyperparameter α, the model then generates a categorical distribution θy|x ∼ Dir(nx + α), from which the sample y is drawn.
This differs from the original joint model in that original parameter vector θ has been replaced with two variable-specific parameter vectors. The next section derives posterior inference with Gibbs sampling after integrating out the θ parameters, and we show that the sampler for each of two model formulations is equivalent and thus samples from an equivalent posterior distribution over x and y.
3.2 Collapsed Gibbs Sampling
General approaches to infer posterior distributions over graphical model variables include Gibbs sampling, variational inference, and hybrid approaches (Kim, Voelker, and Saul 2013). We focus on collapsed Gibbs sampling (Griffiths and Steyvers 2004), which marginalizes out the parameters (θ in the example above) to focus on the variables of interest (x and y in the example).
From the calculation perspective, although the meaning of Equations (7), (10), and (11) are different, their formulae are identical. This allows us to analyze similar models using the conditional formulation without changing the posterior estimation. A similar approach is the pseudo-likelihood approximation, where a joint model is reformulated as the combination of two conditional models, and the optimal parameters for the pseudo-likelihood function are the same as for the original joint likelihood function (Besag 1975; Koller and Friedman 2009; Leppä-aho et al. 2017).
3.3 Transfer Operation
Now that we have made the transfer process explicit and showed that this alternative formulation yields same collapsed posterior, we are able to describe a similar process in detail in the context of multilingual topic models.
If we treat X and Y in the previous example as two languages, and the samples x and y as either words, tokens, or documents from the two languages, we have a bilingual data set (x, y). Topic models have more complex graphical structures, where the examples (tokens) are organized within certain scopes (e.g., documents). To define the transfer process for a specific topic model, when generating samples in one language based on the transfer process of the model, we have to specify what examples we want to use from another language, how much, and where we want to use them. To this end, we define the transfer operation, which allows us to examine different models under a unified framework to compare them systematically.
Definiton 1 (Transfer operation)
In this definition, the first argument of the transfer operation is where the two languages connect to each other, and can be defined as any bilingual supervision needed to enable transfer. The actual values of L1 and L2 depend on specific models. In an example of generating a document in language ℓ2, L1 is the number of documents in languages ℓ1 and L2 = 1, and could be an binary vector where δi = 1 if document i is the translation to current document in ℓ2, or zero otherwise. This is the core of crosslingual transfer through the transfer operation; later we will see that different multilingual topic models mostly only differ in the input of this argument, and designing this matrix is critical for an efficient knowledge transfer.
The second argument in the transfer operation is the sufficient statistics of the trans- fer source (ℓ1 in the definition). After generating instances in language ℓ1, the statistics are organized into a matrix. The last argument is a prior distribution over the possible target distributions Ω.
In summary, this definition highlights three elements that are necessary to enable transfer:
language transformations or supervision from the transfer source to destination;
data statistics in the source; and
a prior on the destination.
In the next section, we show how different topic models can be formulated with transfer operations, as well as how transfer operations can be used in the design of new models.
4 Representative Models
In this section, we describe four representative multilingual topic models in terms of the transfer operation formulation. These are also the models we will experiment on in Section 5. The plate notations of these models are shown in Figure 7, and we provide notations frequently used in these models in Table 1.
Notations . | Descriptions . |
---|---|
z | The topic assignment to a token. |
w(ℓ) | A word type in language ℓ. |
V(ℓ) | The size of vocabulary in language ℓ. |
D(ℓ) | The size of corpus in language ℓ. |
The number of document pairs in languages ℓ1 and ℓ2. | |
α | A symmetric Dirichlet prior vector of size K, where K is the number of topics, and each cell is denoted as αk. |
θd,ℓ | Multinomial distribution over topics for a document d in language ℓ. |
β(ℓ) | A symmetric Dirichlet prior vector of size V(ℓ), where V(ℓ) is the size of vocabulary in language ℓ. |
β(r,ℓ) | An asymmetric Dirichlet prior vector of size I + V(ℓ,−), where I is the number of internal nodes in a Dirichlet tree, and V(ℓ,−) the number of untranslated words in language ℓ. Each cell is denoted as , indicating a scalar prior to a specific node i or an untranslated word type. |
β(i,ℓ) | A symmetric Dirichlet prior vector of size , where is the number of word types in language ℓ under internal node i. |
ϕ(ℓ,k) | Multinomial distribution over word types in language ℓ of topic k for topic k. |
ϕ(r,ℓ,k) | Multinomial distribution over internal nodes in a Dirichlet tree for topic k. |
ϕ(i,ℓ,k) | Multinomial distribution over all word types in language ℓ under internal node i for topic k. |
Notations . | Descriptions . |
---|---|
z | The topic assignment to a token. |
w(ℓ) | A word type in language ℓ. |
V(ℓ) | The size of vocabulary in language ℓ. |
D(ℓ) | The size of corpus in language ℓ. |
The number of document pairs in languages ℓ1 and ℓ2. | |
α | A symmetric Dirichlet prior vector of size K, where K is the number of topics, and each cell is denoted as αk. |
θd,ℓ | Multinomial distribution over topics for a document d in language ℓ. |
β(ℓ) | A symmetric Dirichlet prior vector of size V(ℓ), where V(ℓ) is the size of vocabulary in language ℓ. |
β(r,ℓ) | An asymmetric Dirichlet prior vector of size I + V(ℓ,−), where I is the number of internal nodes in a Dirichlet tree, and V(ℓ,−) the number of untranslated words in language ℓ. Each cell is denoted as , indicating a scalar prior to a specific node i or an untranslated word type. |
β(i,ℓ) | A symmetric Dirichlet prior vector of size , where is the number of word types in language ℓ under internal node i. |
ϕ(ℓ,k) | Multinomial distribution over word types in language ℓ of topic k for topic k. |
ϕ(r,ℓ,k) | Multinomial distribution over internal nodes in a Dirichlet tree for topic k. |
ϕ(i,ℓ,k) | Multinomial distribution over all word types in language ℓ under internal node i for topic k. |
4.1 Standard Models
Typical multilingual topic models are designed based on simple observations of multilingual data, such as parallel corpora and dictionaries. We focus on three popular models, and re-formulate them using the conditional generation assumption and the transfer operation we introduced in the previous sections.
4.1.1 DOCLINK
4.1.2 C-BILDA
Before diving into the specific definition of the transfer operation for this model, we need to take a closer look at the generative process of c-bilda first, because in this model, language itself is a random variable as well. We describe the generative process in terms of the conditional formulation where one language is conditioned on the other. As usual, a monolingual model first generates documents in ℓ1, and at this point each document pair d only has tokens in one language. Then for each document pair d, the conditional model additionally generates a number of topics z using the transfer operation on θ as defined in Equation (16). Instead of directly drawing a new word type in language ℓ2 according to z, c-bilda adds a step to generate a language ℓ′ from η(z, d). Because the current token is supposed to be in language ℓ2, if ℓ′≠ℓ2, this token is dropped, and the model keeps drawing the next topic z; otherwise, a word type is drawn from and attached to the document pair d. Once this process is over, each document pair d contains tokens from two languages, and by separating the tokens based on their languages we can obtain the corresponding set of comparable document pairs. Conceptually, c-bilda adds an additional “selector” in the generative process to decide if a topic should appear more in ℓ2 based on topics in ℓ1. We use Figure 8 as an illustration to show the difference between doclink and c-bilda.
4.1.3 VOCLINK
4.2 softlink: A Transfer Operation–Based Model
We have formulated three representative multilingual topic models by defining transfer operations for each model above. Our recent work, called softlink (Hao and Paul 2018), is explicitly designed according to the understanding of this transfer process. We present this model as a demonstration of how transfer operations can be used to build new multilingual topic models, which might not have an equivalent formulation using the standard co-generation model, by modifying the transfer operation.
In doclink, the supervision argument δ in the transfer operation is constructed using comparable data sets. This requirement, however, substantially limits the data that can be used. Moreover, the supervision δ is also limited by the data; if there is no translation available to a target document, δ is an all-zero vector, and the transfer operation defined in Equation (16) will cancel out all the available information for the target document, which is an ineffective use of the resource. Unlike parallel corpora, dictionaries are widely available and often easy to obtain for many languages. Thus, the general idea of softlink is to use a dictionary to retrieve as much as possible information from ℓ1 to construct δ in a way that links potentially comparable documents together, even if the corpus itself does not explicitly link together documents.
4.3 Summary: Transfer Levels and Transfer Models
We categorize transfer operations into two groups based on the target transfer distribution. Document-level operations transfer knowledge on distributions related to the entire document, such as θ in doclink, c-bilda, and softlink, and η in c-bilda. Word-level operations transfer knowledge on those related to the entire vocabulary or specific word types, such as ϕ in voclink.
When a model only has transfer operations on just one specific level, we also use the transfer level to refer the model. For example, doclink, c-bilda, and softlink are all document-level models, while voclink is a word-level model. Those that transfer knowledge on multiple levels, such as Hu et al. (2014b), are called mixed-level models.
Model . | Document level . | Word level . | Parameters of h . |
---|---|---|---|
LDA | α | — | |
doclink | δ: indicator vector; | ||
, | : doc-by-topic matrix; | ||
c-bilda | supervision: comparable documents; | ||
δ: transfer distribution; | |||
softlink | : doc-by-topic matrix; | ||
supervision: dictionary; | |||
δ: indicator vector; | |||
voclink | α | : node-by-word matrix; | |
supervision: dictionary; |
Model . | Document level . | Word level . | Parameters of h . |
---|---|---|---|
LDA | α | — | |
doclink | δ: indicator vector; | ||
, | : doc-by-topic matrix; | ||
c-bilda | supervision: comparable documents; | ||
δ: transfer distribution; | |||
softlink | : doc-by-topic matrix; | ||
supervision: dictionary; | |||
δ: indicator vector; | |||
voclink | α | : node-by-word matrix; | |
supervision: dictionary; |
5 Experiment Settings
From discussions above, we are able to describe various multilingual topic models by defining different transfer operations, which explicitly represent the language transfer process. When designing and applying those transfer operations in practice, some natural questions arise, such as which transfer operation is more effective in what type of situation, and how to design a model that is more generalizable regardless of availability of multilingual resources.
To study the model behaviors empirically, we train the four models described in the previous section—doclink, c-bilda, softlink, and voclink—in ten languages. Considering the resources available, we separate the ten languages into two groups: high-resource languages (HighLan) and low-resource languages (LowLan). For HighLan, we have relatively abundant resources such as dictionary entries and document translations. We additionally use these languages to simulate the settings of LowLan by training multilingual topic models with different amounts of resources. For LowLan, we use all resources available to verify experiment results and conclusions from HighLan.
5.1 Language Groups and Preprocessing
We separate the ten languages into two groups: HighLan and LowLan. In this section, we describe the preprocessing details of these languages.
5.1.1 HIGHLAN
Languages in this group have a relatively large amount of resources, and have been widely experimented on in multilingual studies. Considering language diversity, we select representative languages from five different families: Arabic (ar, Semitic), German (de, Germanic), Spanish (es, Romance), Russian (ru, Slavic), and Chinese (zh, Sinitic). We follow standard preprocessing procedures: We first use stemmers to process both documents and dictionaries (segmenter for Chinese), then we remove stopwords based on a fixed list and the most 100 frequent word types in the training corpus. The tools for preprocessing are listed in Table 3.
Language . | Family . | Stemmer . | Stopwords . |
---|---|---|---|
en | Germanic | SnowBallStemmer3 | NLTK |
de | Germanic | SnowBallStemmer | NLTK |
es | Romance | SnowBallStemmer | NLTK |
ru | Slavic | SnowBallStemmer | NLTK |
ar | Semitic | Assem’s Arabic Light Stemmer4 | GitHub 5 |
zh | Sinitic | Jieba6 | GitHub |
5.1.2 LOWLAN
Languages in this group have much fewer resources than those in HighLan, considered as low-resource languages. We similarly select five languages from different families: Amharic (am, Afro-Asiatic), Aymara (ay, Aymaran), Macedonian (mk, Indo-European), Swahili (sw, Niger-Congo), and Tagalog (tl, Austronesian). Note that some of these are not strictly “low-resource” compared with many endangered languages. For the truly low-resource languages, it is very difficult to test the models with enough data, and, therefore, we choose languages that are understudied in natural language processing literature.
Preprocessing in this language group needs more consideration. Because they represent low-resource languages that most natural language processing tools are not available for, we do not use a fixed stopword list. Stemmers are also not available for these languages, so we do not apply stemming.
5.2 Training Sets and Model Configurations
There are many resources available for multilingual research, such as the European Parliament Proceedings parallel corpus (EuroParl; Koehn 2005), the Bible, and Wikipedia. EuroParl provides a perfectly parallel corpus with precise translations, but it only contains 21 European languages, which limits its generalizability to most of the languages. The Bible, on the other hand, is also perfectly parallel and is widely available in 2,530 languages.7 Its disadvantages, however, are that the contents are very limited (mostly about family and religion), the data set size is small (1,189 chapters), and many languages do not have digital format (Christodoulopoulos and Steedman 2015).
Compared with EuroParl and the Bible, Wikipedia provides comparable documents in many languages with a large range of content, making it a very popular choice for many multilingual studies. In our experiments, we create ten bilingual Wikipedia corpora, each containing documents in one of the languages in either HighLan or LowLan, paired with documents in English (en). Though most multilingual topic models are not restricted to training bilingual corpora paired with English, this is a helpful way to focus our experiments and analysis.
We present the statistics of the training corpus of Wikipedia and the dictionary we use (from Wiktionary) in the experiments in Table 4. Note that we train topic models on bilingual pairs, where one of the languages is always English, so in the table we show statistics of English in every bilingual pair as well.
. | . | English (en) . | Paired language . | Wiktionary . | ||||
---|---|---|---|---|---|---|---|---|
#docs | #tokens | #types | #docs | #tokens | #types | #entries | ||
HighLan | ||||||||
ar | 2,000 | 616,524 | 48,133 | 2,000 | 181,946 | 25,510 | 16,127 | |
de | 2,000 | 332,794 | 35,921 | 2,000 | 254,179 | 55,610 | 32,225 | |
es | 2,000 | 369,181 | 37,100 | 2,000 | 239,189 | 30,258 | 31,563 | |
ru | 2,000 | 410,530 | 39,870 | 2,000 | 227,987 | 37,928 | 33,574 | |
zh | 2,000 | 392,745 | 38,217 | 2,000 | 168,804 | 44,228 | 23,276 | |
LowLan | ||||||||
am | 2,000 | 3,589,268 | 161,879 | 2,000 | 251,708 | 65,368 | 4,588 | |
ay | 2,000 | 1,758,811 | 84,064 | 2,000 | 169,439 | 24,136 | 1,982 | |
mk | 2,000 | 1,777,081 | 100,767 | 2,000 | 489,953 | 87,329 | 6,895 | |
sw | 2,000 | 2,513,838 | 143,691 | 2,000 | 353,038 | 46,359 | 15,257 | |
tl | 2,000 | 2,017,643 | 261,919 | 2,000 | 232,891 | 41,618 | 6,552 |
. | . | English (en) . | Paired language . | Wiktionary . | ||||
---|---|---|---|---|---|---|---|---|
#docs | #tokens | #types | #docs | #tokens | #types | #entries | ||
HighLan | ||||||||
ar | 2,000 | 616,524 | 48,133 | 2,000 | 181,946 | 25,510 | 16,127 | |
de | 2,000 | 332,794 | 35,921 | 2,000 | 254,179 | 55,610 | 32,225 | |
es | 2,000 | 369,181 | 37,100 | 2,000 | 239,189 | 30,258 | 31,563 | |
ru | 2,000 | 410,530 | 39,870 | 2,000 | 227,987 | 37,928 | 33,574 | |
zh | 2,000 | 392,745 | 38,217 | 2,000 | 168,804 | 44,228 | 23,276 | |
LowLan | ||||||||
am | 2,000 | 3,589,268 | 161,879 | 2,000 | 251,708 | 65,368 | 4,588 | |
ay | 2,000 | 1,758,811 | 84,064 | 2,000 | 169,439 | 24,136 | 1,982 | |
mk | 2,000 | 1,777,081 | 100,767 | 2,000 | 489,953 | 87,329 | 6,895 | |
sw | 2,000 | 2,513,838 | 143,691 | 2,000 | 353,038 | 46,359 | 15,257 | |
tl | 2,000 | 2,017,643 | 261,919 | 2,000 | 232,891 | 41,618 | 6,552 |
Lastly, we summarize the model configurations in Table 5 The goal of this study is to bring current multilingual topic models together, studying their corresponding strengths and limitations. To keep the experiments as comparable as possible, we use constant hyperparameters that are consistent across the models. For all models, we set the Dirichlet hyperparameter αk = 0.1 for each topic k = 1, …, K. We run 1,000 Gibbs sampling iterations on the training set and 200 iterations on the test sets. The number of topics K is set to 20 by default for efficiency reasons.
Model . | Hyperparameters . |
---|---|
doclink | We set β to be a symmetric vector where each cell βi = 0.01 for all word types of all the languages, and use the MALLET implementation for training (McCallum 2002). To enable consistent comparison, we disable hyperparameter optimization provided in the package. |
c-bilda | Following the experiment results from Heyman, Vulic, and Moens (2016), we set χ = 2 to make the results more competitive to doclink. The rest of the settings are the same as for doclink. |
softlink | We use the document-wise thresholding approach for calculating the transfer distributions. The focus threshold is set to 0.8. The rest of the settings are the same as for doclink. |
voclink | We set the scalar β′ = 0.01 for hyperparameter β(r,ℓ) from the root to both internal nodes or leaves. For those from internal nodes to leaves, we set β′′ = 100, following the settings in Hu et al. (2014b). |
Model . | Hyperparameters . |
---|---|
doclink | We set β to be a symmetric vector where each cell βi = 0.01 for all word types of all the languages, and use the MALLET implementation for training (McCallum 2002). To enable consistent comparison, we disable hyperparameter optimization provided in the package. |
c-bilda | Following the experiment results from Heyman, Vulic, and Moens (2016), we set χ = 2 to make the results more competitive to doclink. The rest of the settings are the same as for doclink. |
softlink | We use the document-wise thresholding approach for calculating the transfer distributions. The focus threshold is set to 0.8. The rest of the settings are the same as for doclink. |
voclink | We set the scalar β′ = 0.01 for hyperparameter β(r,ℓ) from the root to both internal nodes or leaves. For those from internal nodes to leaves, we set β′′ = 100, following the settings in Hu et al. (2014b). |
5.3 Evaluation
We evaluate all models using both intrinsic and extrinsic metrics. Intrinsic evaluation is used to measure the topic quality or coherence learned from the training set, and extrinsic evaluation measures performance after applying the trained distributions to downstream crosslingual applications. For all the following experiments and tasks, we start by analyzing languages in HighLan. Then we apply the analyzed results to LowLan.
We choose topic coherence (Hao, Boyd-Graber, and Paul 2018) and crosslingual document classification (Smet, Tang, and Moens 2011) as intrinsic and extrinsic evaluation tasks, respectively. The reason for choosing these two tasks is that they examine the models from different angles: Topic coherence looks at topic-word distributions, whereas classification focuses on document-topic distributions. Other evaluation tasks, such as word translation detection and crosslingual information retrieval, also utilize the trained distributions, but here we focus on a straightforward and representative task.
5.3.1 Intrinsic Evaluation: Topic Quality
Intrinsic evaluation refers to evaluating the learned model directly without applying it to any particular task; for topic models, this is usually based on the quality of the topics. Standard evaluation measures for monolingual models, such as perplexity (or held-out likelihood; Wallach et al. 2009) and Normalized Pointwise Mutual Information (npmi, Lau, Newman, and Baldwin (2014)), could potentially be considered for crosslingual models. However, when evaluating multilingual topics, how words in different languages make sense together is also a critical criterion in addition to coherence within each of the languages.
In monolingual studies, Chang et al. (2009) show that held-out likelihood is not always positively correlated with human judgments of topics. Held-out likelihood is additionally suboptimal for multilingual topic models, because this measure is only calculated within each language, and the important crosslingual information is ignored.
Crosslingual Normalized Pointwise Mutual Information (cnpmi; Hao, Boyd-Graber, and Paul 2018) is a measure designed specifically for multilingual topic models. Extended from the widely used npmi to measure topic quality in multilingual settings, cnpmi uses a parallel reference corpus to extract crosslingual coherence. cnpmi correlates well with bilingual speakers’ judgments on topic quality and predictive performance in downstream applications. Therefore, we use cnpmi for intrinsic evaluations.
Definition 2 (Crosslingual Normalized Pointwise Mutual Information, cnpmi)
Intuitively, a coherent topic should contain words that make sense or fit in a specific context together. In the multilingual case, cnpmi measures how likely it is that a bilingual word pair appears in a similar context provided by the parallel reference corpus. We provide toy examples in Figure 10, where we show three bilingual topics. In Topic A, both languages are about “language,” and all the bilingual word pairs have high probability of appearing in the same comparable document pairs. Thus Topic A is coherent crosslingually, and thus expected to have a high cnpmi score. Although we can identify the themes within each language in Topic B, that is, education in English and biology in Swahili, most of the bilingual word pairs do not make sense or appear in the same context, which gives us a low cnpmi score. The last topic is not coherent even within each language, so it has low cnpmi as well. Through this example, we see that cnpmi detects crosslingual coherence in multiple ways, unlike other intrinsic measures that might be adapted for crosslingual models.
In our experiments, we use 10,000 linked Wikipedia article pairs for each language pair (en, ℓ) (20,000 in total) as the reference corpus, and set C = 10 by default. Note that HighLan has more Wikipedia articles, and we make sure the articles used for evaluating cnpmi scores do not appear in the training set. However, for LowLan, because the number of linked Wikipedia articles is extremely limited, we use all the available pairs to evaluate cnpmi scores. The statistics are shown in Table 6
. | . | English . | Paired language . | ||||
---|---|---|---|---|---|---|---|
#docs | #tokens | #types | #docs | #tokens | #types | ||
HighLan | |||||||
ar | 10,000 | 3,597,322 | 128,926 | 10,000 | 996,801 | 64,197 | |
de | 10,000 | 2,155,680 | 103,812 | 10,000 | 1,459,015 | 166,763 | |
es | 10,000 | 3,021,732 | 149,423 | 10,000 | 1,737,312 | 142,086 | |
ru | 10,000 | 3,016,795 | 154,442 | 10,000 | 2,299,332 | 284,447 | |
zh | 10,000 | 1,982,452 | 112,174 | 10,000 | 1,335,922 | 144,936 | |
LowLan | |||||||
am | 4,316 | 9,632,700 | 269,772 | 4,316 | 403,158 | 91,295 | |
ay | 4,187 | 5,231,260 | 167,531 | 4,187 | 280,194 | 32,424 | |
mk | 10,000 | 11,080,304 | 301,026 | 10,000 | 3,175,182 | 245,687 | |
sw | 10,000 | 13,931,839 | 341,231 | 10,000 | 1,755,514 | 134,152 | |
tl | 6,471 | 7,720,517 | 645,534 | 6,471 | 1,124,049 | 83,967 |
. | . | English . | Paired language . | ||||
---|---|---|---|---|---|---|---|
#docs | #tokens | #types | #docs | #tokens | #types | ||
HighLan | |||||||
ar | 10,000 | 3,597,322 | 128,926 | 10,000 | 996,801 | 64,197 | |
de | 10,000 | 2,155,680 | 103,812 | 10,000 | 1,459,015 | 166,763 | |
es | 10,000 | 3,021,732 | 149,423 | 10,000 | 1,737,312 | 142,086 | |
ru | 10,000 | 3,016,795 | 154,442 | 10,000 | 2,299,332 | 284,447 | |
zh | 10,000 | 1,982,452 | 112,174 | 10,000 | 1,335,922 | 144,936 | |
LowLan | |||||||
am | 4,316 | 9,632,700 | 269,772 | 4,316 | 403,158 | 91,295 | |
ay | 4,187 | 5,231,260 | 167,531 | 4,187 | 280,194 | 32,424 | |
mk | 10,000 | 11,080,304 | 301,026 | 10,000 | 3,175,182 | 245,687 | |
sw | 10,000 | 13,931,839 | 341,231 | 10,000 | 1,755,514 | 134,152 | |
tl | 6,471 | 7,720,517 | 645,534 | 6,471 | 1,124,049 | 83,967 |
5.3.2 Extrinsic Evaluation: Crosslingual Classification
Crosslingual document classification is the most common downstream application for multilingual topic models (Smet, Tang, and Moens 2011; Vulić et al. 2015; Heyman, Vulic, and Moens 2016). Typically, a model is trained on a multilingual training set in languages ℓ1 and ℓ2. Using the trained topic-vocabulary distributions ϕ, the model infers topics in test sets and .
In multilingual topic models, document-topic distributions θ can be used as features for classification, where the vectors in language ℓ1 train a classifier tested by the vectors in language ℓ2. A better classification performance indicates more consistent features across languages. See Figure 11 for an illustration. In our experiments, we use a linear support vector machine to train multilabel classifiers with five-fold cross-validation. Then, we use micro-averaged F-1 scores to evaluate and compare performance across different models.
For crosslingual classification, we also require held-out test data with labels or annotations. In our experiments, we construct test sets from two sources: TED Talks 2013 (ted) and Global Voices (gv). ted contains parallel documents in all languages in HighLan, whereas gv contains all languages from both HighLan and LowLan.
Using the two multilingual sources, we create two types of test sets for HighLan—ted + ted and ted + gv, and only one type for LowLan—ted+gv. In ted+ted, we infer document-topic distributions on documents from ted in English and the paired language. This only applies to HighLan, because ted do not have documents in LowLan. In ted+gv, we infer topics on English documents from ted, and infer topics on documents from gv in the paired language (both HighLan and LowLan). The two types of test sets also represent different application situations. ted + ted implies that the test documents in both languages are parallel and come from the same source, whereas ted + gv represents how the topic model performs when the two languages have different data sources.
Both corpora are retrieved from http://opus.nlpl.eu/ (Tiedemann 2012). The labels, however, are manually retrieved from http://ted.com/ and http://globalvoices.org. In ted corpus, each document is a transcript of a talk and is assigned to multiple categories on the Web page, such as “technology,” “arts,” and so forth. We collect all categories for the entire ted corpus, and use the three most frequent categories—technology, culture, science—as document labels. Similarly, in gv corpus, each document is a news story, and has been labeled with multiple categories on the Web page of the story. Because in ted + gv, the two sets are from different sources, and training and testing is only possible when both sets share the same labels, we apply the same three labels from ted to gv as well. This processing requires minor mappings, for example, from “arts-culture” in gv to “culture” in ted. The data statistics are presented in Table 7.
. | . | Corpus statistics . | Label distributions . | ||||
---|---|---|---|---|---|---|---|
#docs | #types | #tokens | #technology | culture | science | ||
ted | |||||||
ar | 1,112 | 1,066,754 | 15,124 | 384 | 304 | 290 | |
de | 1,063 | 774,734 | 19,826 | 364 | 289 | 276 | |
es | 1,152 | 933,376 | 13,088 | 401 | 312 | 295 | |
ru | 1,010 | 831,873 | 17,020 | 346 | 275 | 261 | |
zh | 1,123 | 1,032,708 | 19,594 | 386 | 315 | 290 | |
gv (HighLan) | |||||||
ar | 2,000 | 325,879 | 13,072 | 510 | 489 | 33 | |
de | 1,481 | 269,470 | 16,031 | 346 | 344 | 42 | |
es | 2,000 | 367,631 | 11,104 | 457 | 387 | 38 | |
ru | 2,000 | 488,878 | 16,157 | 516 | 369 | 62 | |
zh | 2,000 | 528,370 | 18,194 | 499 | 366 | 56 | |
gv (LowLan) | |||||||
am | 39 | 10,589 | 4,047 | 3 | 3 | 1 | |
ay | 674 | 66,076 | 4,939 | 76 | 100 | 46 | |
mk | 1,992 | 388,713 | 29,022 | 343 | 426 | 182 | |
sw | 1,383 | 359,066 | 14,072 | 137 | 110 | 71 | |
tl | 254 | 26,072 | 6,138 | 32 | 67 | 19 |
. | . | Corpus statistics . | Label distributions . | ||||
---|---|---|---|---|---|---|---|
#docs | #types | #tokens | #technology | culture | science | ||
ted | |||||||
ar | 1,112 | 1,066,754 | 15,124 | 384 | 304 | 290 | |
de | 1,063 | 774,734 | 19,826 | 364 | 289 | 276 | |
es | 1,152 | 933,376 | 13,088 | 401 | 312 | 295 | |
ru | 1,010 | 831,873 | 17,020 | 346 | 275 | 261 | |
zh | 1,123 | 1,032,708 | 19,594 | 386 | 315 | 290 | |
gv (HighLan) | |||||||
ar | 2,000 | 325,879 | 13,072 | 510 | 489 | 33 | |
de | 1,481 | 269,470 | 16,031 | 346 | 344 | 42 | |
es | 2,000 | 367,631 | 11,104 | 457 | 387 | 38 | |
ru | 2,000 | 488,878 | 16,157 | 516 | 369 | 62 | |
zh | 2,000 | 528,370 | 18,194 | 499 | 366 | 56 | |
gv (LowLan) | |||||||
am | 39 | 10,589 | 4,047 | 3 | 3 | 1 | |
ay | 674 | 66,076 | 4,939 | 76 | 100 | 46 | |
mk | 1,992 | 388,713 | 29,022 | 343 | 426 | 182 | |
sw | 1,383 | 359,066 | 14,072 | 137 | 110 | 71 | |
tl | 254 | 26,072 | 6,138 | 32 | 67 | 19 |
6 Document-Level Transfer and Its Limitations
We first explore the empirical characteristics of document-level transfer, using doclink, c-bilda, and softlink.
Multilingual corpora can be loosely categorized into three types: parallel, comparable, and incomparable. A parallel corpus contains exact document translations across languages, of which EuroParl and the Bible, discussed before, are examples. A comparable corpus contains document pairs (in the bilingual case), where each document in one language has a related counterpart in the other language. However, these document pairs are not exact translations of each other, and they can only be connected through a loosely defined “theme.” Wikipedia is an example, where document pairs are linked by article titles. Incomparable corpora contain potentially unrelated documents across languages, with no explicit indicators of document pairs.
With different levels of comparability comes different availabilities of such corpora: It is much harder to find parallel corpora in low-resource languages. Therefore, we first focus on HighLan, and use Wikipedia to simulate the low-resource situation in Section 6.1, where we find that doclink and c-bilda are very sensitive to the training corpus, and thus might not be the best option when it comes to low-resource languages. We then examine LowLan in Section 6.2.
6.1 Sensitivity to Training Corpus
We first vary the comparability of the training corpus and study how different models behave under different situations. All models are potentially affected by the comparability of the training set, although only doclink and c-bilda explicitly rely on this information to define transfer operations. This experiment shows that models transferring knowledge on the document level (doclink and c-bilda) are very sensitive to the training set, but can be almost entirely insensitive with appropriate modifications to the transfer operation as in softlink.
6.1.1 Experiment Settings
For each language pair (en,ℓ), we construct a random subsample of 2,000 documents from Wikipedia in each language (4,000 in total). To vary the comparability, we vary the proportion of linked Wikipedia articles between the two languages, from 0.0, 0.01, 0.05, 0.1, 0.2, 0.4, 0.8, to 1. When the percentage is zero, the bilingual corpus is entirely incomparable, that is, no document-level translations can be found in another language, and doclink and c-bilda degrade into monolingual LDAs. The indicator matrix used by transfer operations in Section 4.1.1 is a zero matrix δ = 0. When the percentage is one, meaning each document from one language is linked to one document from another language, the corpus is considered fully comparable, and δ is an identity matrix 1. Any number between 0 and 1 makes the corpus partially comparable to different degrees. The cnpmi and crosslingual classification results are shown in Figure 12, and the shades indicate the standard deviations across five Gibbs sampling chains. For voclink and softlink, we use all the dictionary entries.
6.1.2 Results
In terms of topic coherence (cnpmi), both doclink and c-bilda have competitive performance on cnpmi, and achieve full potential when the corpus is fully comparable. As expected, models transferring knowledge at the document level (doclink and c-bilda) are very sensitive to the training corpus: The more aligned the corpus is, the better topics the model learns. For the word-level model, voclink roughly stays at the same performance level, which is also expected, because this model does not use linked documents as supervision. However, its performance on Russian is surprisingly low compared with other languages and models. In the next section, we will look closer at this problem by investigating the impact of dictionaries.
It is notable that softlink, a document-level model, is also insensitive to the training corpus and outperforms other models most of the time. Recall that on the document level, softlink defines transfer operation on document-topic distributions θ, similarly to doclink and c-bilda, but using dictionary resources. This implies that good design of the supervision δ in the transfer operation could lead to a more stable performance across different training situations.
When it comes to the classification task, the F-1 scores of doclink and c-bilda have very large variations, and the increasing trend of F-1 scores is less obvious than with cnpmi. This is especially true when the percentage of linked documents is very small. For one, when the percentage is small, the transfer on the document level is less constrained, leaving the projection of two languages into the same topic space less predictive. The evaluation scope of cnpmi is actually much smaller and more concentrated than classification, because it only focuses on the top C words, which does not lead to large variations.
One consistent result we notice is that softlink still performs well on classification with very small variations and stable F-1 scores, which again benefits from the definition of transfer operation in softlink. When transferring topics to another language, softlink uses dictionary constraints as in voclink, but instead of a simple one-on-one word type mapping, it expands the transfer scope to the entire document. Additionally, softlink distributionally transfers knowledge from the entire corpus in another language, which actually reinforces the transfer efficiency without relying on direct supervision at the document level.
6.2 Performance on LowLan
In this section, we take a look at languages in LowLan. For softlink and voclink, we use all dictionary entries to train languages in LowLan, because the sizes of dictionaries in these languages are already very small. We again use a subsample of 2,000 Wikipedia document pairs with English to make the results comparable with HighLan. In Figure 13(a), we also present results of models for HighLan using fully comparable training corpora and full dictionaries for direct comparison of the effect of language resources.
In most cases, transfer on document level (particularly c-bilda) performs better than on word levels, in both HighLan and LowLan. Considering the number of dictionary entries available from Table 4, it is reasonable to suspect that the dictionary is a major factor affecting the performance of word-level transfer.
On the other hand, although softlink does not model vocabularies directly as in voclink, transferring knowledge at the document level with a limited dictionary still yields competitive cnpmi scores. Therefore, in this experiment on LowLan, we see that with the same lexicon resource, it is generally more efficient to transfer knowledge at the document level. We will also explore this in detail in Section 7.
We also present a comparison of micro-averaged F-1 scores between HighLan and LowLan in Figure 13(b). The test set used for this comparison is ted + gv, since ted does not have articles available in LowLan. Also, languages such as Amharic (am) have fewer than 50 gv articles available, which is an extremely small number for training a robust classifier, so in these experiments, we only train classifiers on English (ted articles) and test them on languages in HighLan and LowLan (gv articles).
Similarly, the classification results are generally better in document-level transfer, and both c-bilda and softlink give similar scores. However, it is worth noting that voclink has very large variations in all languages, and the F-1 scores are very low. This again suggests that transferring knowledge on the word level is less effective, and in Section 7 we study in detail why this is the case.
7 Word-Level Transfer and Its Limitations
In the previous section, we compared different multilingual topic models with a focus on document-level models. We draw conclusions that doclink and c-bilda are very sensitive to the training corpus, which is natural due to their definition of supervision as a one-to-one document pair mapping. On the other hand, the word-level model voclink in general has lower performance, especially with LowLan, even if the corpus is entirely comparable.
One interesting result we observed from the previous section is that softlink and voclink use the same dictionary resource while transferring topics on different levels, and softlink generally has better performance than voclink. Therefore, in this section, we explore the characteristics of the word-level model voclink and compare it with softlink to study why it does not use the same dictionary resource as effectively.
To this end, we first vary the amount of dictionary entries available and compare how softlink and voclink perform (Section 7.1). Based on the results, we analyze word-level transfer from three different angles: dictionary usage (Section 7.2) as an intuitive explanation of the models, topic analysis (Section 7.3) from a more qualitative perspective, and comparing transfer strength (Section 7.4) as a quantitative analysis.
7.1 Sensitivity to Dictionaries
Word-level models such as voclink use a dictionary as supervision, and thus will naturally be affected by the dictionary used. Although softlink transfers knowledge on the document level, it uses the dictionary to calculate the transfer distributions used in its document-level transfer operation. In this section, we focus on the comparison of softlink and voclink.
7.1.1 Sampling the Dictionary Resource
The dictionary is the essential part of softlink and voclink and is used in different ways to define transfer operations. The availability of dictionaries, however, varies among different languages. From Table 4, we notice that for LowLan the number of available dictionary entries is very limited, which suggests it could be a major factor affecting the performance of word-level topic models. Therefore, in this experiment, we sample different numbers of dictionary entries in HighLan to study how this alters performance of softlink and voclink.
Given a bilingual dictionary, we add only a proportion of entries in it to softlink and voclink. As in the previous experiments varying the proportion of document links, we change the proportion from 0, 0.01, 0.05, 0.1, 0.2, 0.4, 0.8, to 1.0. When the proportion is 0, both softlink and voclink become monolingual LDA and no transfer happens; when the proportion is 1, both models reach their highest potential with all the dictionary entries available.
We also sample the dictionary in two manners: random- and frequency-based. In random-based, the entries are randomly chosen from the dictionary, and the five chains have different entries added to the models. In frequency-based, we select the most frequent word types from the training corpus.
Figure 14 shows a detailed comparison among different evaluations and languages. As expected, adding more dictionary entries helps both softlink and voclink, with increasing cnpmi scores and F-1 scores in general. However, we notice that adding more dictionary entries can boost softlink’s performance very quickly, whereas the increase in voclink’s cnpmi scores is slower. Similar trends can be observed in the classification task as well, where adding more words does not necessarily increase voclink’s F-1 scores, and the variations are very high.
This comparison provides an interesting insight to increasing lexical resources efficiently. In some applications, especially related to low-resource languages, the number of available lexicon resources is very small, and one way to solve this problem is to incorporate human feedback, such as interactive topic modeling proposed by Hu et al. (2014a). In our case, a native speaker of the low-resource language could provide word translations that could be incorporated into topic models. Because of limited time and financial budget, however, it is impossible to translate all the word types that appear in the corpus, so the challenge is how to boost the performance of the target task as much as possible with less effort from humans. In this comparison, we see that if the target task is to train coherent multilingual topics, training softlink is a more efficient way than voclink.
7.1.2 Varying Comparability of the Corpus
For softlink and voclink, the dictionary is only one aspect of the training situation. As discussed in our document-level experiments, the training corpus is also an important factor that could affect the performance of all topic models. Although corpus comparability is not an explicit requirement of softlink and voclink, the comparability of the corpus might affect the coverage provided by the dictionary or affect performance in other ways. In softlink, comparability could also affect the transfer operator’s ability to find similar documents to link to. In this section, we study the relationship between dictionary coverage and comparability of the training corpus.
Similar to the previous section, we vary the dictionary coverage from 0.01, 0.05, 0.1, 0.2, 0.4, 0.8, to 1, using the frequency-based method as in the last experiment. We also vary the number of linked Wikipedia articles from 0, 0.01, 0.05, 0.1, 0.2, 0.4, 0.8, to 1. We present cnpmi scores in Figure 15(a), where the results are averaged over all five languages in HighLan. It is clear that softlink outperforms voclink, regardless of training corpus and dictionary size. This implies that softlink could potentially learn coherent multilingual topics even when the training conditions are unfavorable: for example, when the training corpus is incomparable and there is only a small number of dictionary entries.
The results of crosslingual classification are shown in Figure 15(b). When the test sets are from the same source (ted + ted), softlink utilizes the dictionary more efficiently and performs better than voclink. In particular, F-1 scores of softlink using only 20% of dictionary entries is already outperforming voclink using the full dictionary. A similar comparison can also be drawn when the test sets are from different sources such as ted + gv.
7.1.3 Discussion
From the results so far, it is empirically clear that transferring knowledge on the word level tends to be less efficient than the document level. This is arguably counter-intuitive. Recall that the goal of multilingual topic models is to let semantically related words and translations have similar distributions over topics. The word-level model voclink directly uses this information—dictionary entries—to define transfer operations, yet its cnpmi scores are lower. In the following sections, therefore, we try to explain this apparent contradiction. We first analyze the dictionary usage of voclink (Section 7.2), and then lead our discussion on the transfer strength comparisons between document and word levels for all models (Sections 7.3 and 7.4).
7.2 Dictionary Usage
In practice, the assumption of voclink is also often weakened by another important factor: the presence of word translations in the training corpus. Given a word pair , the assumption of voclink is valid only when both words appear in the training corpus in their respective languages. If is not in , will be treated as an untranslated word instead. Figure 16 shows an example of how tree structures in voclink are affected by the corpus and the dictionary.
In Figure 17, we present the statistics of word types from different sources on a logarithmic scale. “Dictionary” is the number of word types that appeared in the original dictionary as shown in the last column of Table 4, and we use the same preprocessing to the dictionary as to the training corpus to make sure the quantities are comparable. “Training set” is the number of word types that appeared in the training set, and “Linked by voclink” is the number of word types that are actually used in voclink, that is, the number of non-zero entries in δ in the transfer operation.
Note that even when we use the complete dictionary to create the tree structure in voclink, in LowLan, there are far more word types in the training set than those in the dictionary. In other words, the supervision matrix δ used by is never actually full rank, and thus, the full potential of voclink is very difficult to achieve due to the properties of the training corpus. This situation is as if the document-level model doclink had only half of the linked documents in the training corpus.
On the other hand, we notice that in HighLan, the number of word types in the dictionary is usually comparable to that of the training set (except in ar). For LowLan, however, the situation is quite the contrary: There are more word types in the training set than in the dictionary. Thus, the availability of sufficient dictionary entries is especially a problem for LowLan.
We conclude from Figure 15(a) that adding more dictionary entries will slowly improve voclink, but even when there are enough dictionary items, due to model assumptions, voclink will not achieve its full potential unless every word in the training corpus is in the dictionary. A possible solution is to first extract word alignments from parallel corpora, and then create a tree structure using those word alignments, as experimented in Hu et al. (2014b). However, when parallel corpora are available, we have shown that document-level models such as doclink work better anyway, and the accuracy of word aligners is another possible limitation to consider.
7.3 Topic Analysis
Whereas voclink uses a dictionary to directly model word translations, softlink uses the same dictionary to define the supervision in transfer operation differently on the document level. Experiments show that transferring knowledge on the document level with a dictionary (i.e., softlink) is more efficient, resulting in stable and low-variance topic qualities in various training situations. A natural question is why the same resource results in different performance on different levels of transfer operations. To answer this question from another angle, we further look into the actual topics trained from softlink and voclink in this section. The general idea is to look into the same topic output from softlink and voclink and see what topic words they have in common (denoted as ), and what words they have exclusively, denoted as and for softlink and voclink, respectively. The words in are those with lower topic coherence and are thus the key to understanding the suboptimal performance of voclink.
7.3.1 Aligning Topics
7.3.2 Comparing Document Frequency
Using the approximate alignment algorithm we described above, we are now able to compare each aligned topic pair between voclink and softlink.
Then we calculate the average document frequencies over all the words in each of the sets, and we show the results in Figure 18.
We observe that the average document frequencies over words in are consistently lower in every language, whereas those in are higher. This implies that voclink tends to give rare words higher probability in the topic-word distributions. In other words, voclink gives high probabilities to words that only appear in specific contexts, such as named entities. Thus, when evaluating topics using a reference corpus, the co-occurrence of such words with other words is relatively low due to lack of that specific context in the reference corpus.
We show an example of an aligned topic in Figure 19. In this example, we see that although both voclink and softlink can discover semantically coherent words shown in , voclink focuses more on words that only appear in specific contexts: There are many words (mostly named entities) in that only appear in one document. Due to lack of this very specific context in the reference corpora, the co-occurrence of these words with other more general words is likely to be zero, resulting in lower cnpmi.
7.4 Comparing Transfer Strength
While we have looked at the topics to explain what kind of words produced by voclink make the model’s performance lower than softlink, in this section, we try to explain why this happens by analyzing their transfer operations. Recall that voclink defines transfer operations on topic-node distributions (Equation (23)), while softlink defines transfer on document-topic distributions θ. The differences between transfer levels with the same resources leads to a suspicion that document level has a “stronger” transfer power.
To apply this idea, for each token, we first obtain three distributions described before: , , and . Then we calculate cosine similarities and . If , we know that is dominant and helps shape the conditional distribution ; in other words, the document level transfer is stronger. We calculate the ratio of similarities for all the tokens in every model, and take the model-wise average over all the tokens (Figure 20). The most balanced situation is r = 1, meaning transfers on both word and document levels are contributing equally to the conditional distributions.
From the results, we notice that both doclink and c-bilda have stronger transfer strength on the document level, which means that the transfer operations on the document levels are actually informing the decision of a token’s topic. However, we also notice that voclink has very comparable transfer strength to doclink and c-bilda, which makes less sense, because voclink defines transfer operations on the word level. This implies that transferring knowledge on the word level is weaker. This also explains why, in the previous section, voclink tends to find topic words appearing in only a few documents.
It is also interesting to see softlink having a relatively good balance between document and word levels, with consistently the most balanced transfer strengths across all models and languages.
8 Remarks and Conclusions
Multilingual topic models use corpora in multiple languages as input with additional language resources as supervision. The traits of these models inevitably lead to a wide variety of training scenarios, especially when a language’s resources are scarce, whereas most previous studies on multilingual topic models have not analyzed in depth the appropriateness of different models for different training situations and resource availability. For example, experiments are most often done in European languages, with models that are typically trained on parallel or comparable corpora.
The contributions of our study are providing a unifying framework of these different models, and systematically analyzing their efficacy in different training situations. We conclude by summarizing our findings along two dimensions: training corpora characteristics and dictionary characteristics, since these are the necessary components to enable crosslingual knowledge transfer.
8.1 Model Selection
Document-level models are shown to work best when the corpus is parallel or at least comparable. In terms of learning high-quality topics, doclink and c-bilda yield very similar results. However, since c-bilda has a “language selector” mechanism in the generative process, it is slightly more efficient for training Wikipedia articles in low-resource languages, where the document lengths have large gaps compared to English. softlink, on the other hand, only needs a small dictionary to enable document-level transfer, and yields very competitive results. This is especially useful for low-resource languages when the dictionary size is small and only a small number of comparable document pairs are available for training.
Word-level models are harder to achieve full potential of transfer, due to limits in the dictionary size and training sets, and unrealistic assumptions of the generative process regarding dictionary coverage. The representative model, voclink, has similarly good performance on document classification as other models, but the topic qualities according to coherence-based metrics are lower. Comparing to softlink, which also requires a dictionary as resource, directly modeling word translations in voclink turns out to be a less efficient way of transferring dictionary knowledge. Therefore, when using dictionary information, we recommend softlink over voclink.
8.2 Crosslingual Representations
As an alternative method to learning crosslingual representations, crosslingual word embeddings have been gaining attention (Ruder, Vulic, and Søgaard 2019; Upadhyay et al. 2016). Recent crosslingual embedding architectures have been applied to a wider range of applications in natural language processing, and achieve state-of-the-art performance. Similar to the topic space in multilingual topic models, crosslingual embeddings learn semantically consistent features in a shared embedding space for all languages.
Both approaches—topic modeling and embedding—have advantages and limitations. Multilingual topic models still rely on supervised data to learn crosslingual representations. The choice of such supervision and model is important, which leads to our main discussion of this work. Topic models have the advantage of being interpretable. Embedding methods are powerful in many natural language processing tasks, and the representations are more fine-grained. Recent advancements in crosslingual embedding training do not require crosslingual supervision resources such as dictionary or parallel data (Artetxe, Labaka, and Agirre 2018; Lample et al. 2018), which is a large step toward generalization of crosslingual modeling. Although it is an open problem on how to interpret the results and how to reduce the heavy computing resources required, embedding based methods are a promising research direction.
Relations to Topic Models
A very common strategy for learning crosslingual embeddings is to use a projection matrix as supervision or sub-objective to learn a projection matrix that projects independently trained monolingual embeddings into a shared crosslingual space (Dinu and Baroni 2014; Faruqui and Dyer 2014; Tsvetkov and Dyer 2016; Vulić and Korhonen 2016).
In multilingual topic models, the supervision matrix δ plays the role of a projection matrix between languages. In doclink, for example, projects document to the document space of ℓ1 (Equation (15)). softlink provides a simple extension by forming δ to a matrix of transfer distirbutions based on word-level document similarities. voclink applies projections in the form of word translations.
Thus, we can see that the formation of projection matrices in multilingual topic models is still static and restricted to an identity matrix or a simple pre-calculated matrix. A generalization would be to add learning the projection matrix itself as an objective into multilingual topic models. This could be a way to improve voclink by extending word associations to polysemy across languages, and making it less dependent on context.
8.3 Future Directions
Our study inspires future work in two directions. The first direction is to increase the efficiency of word-level knowledge transfer. For example, it is possible to use co-location information of translated words to transfer knowledge, though cautiously, to untranslated words. It has been shown that word-level models can help find new word translations, for example, by using the existing dictionary as “seed,” and gradually adding more internal nodes to the tree structure using trained topic-word distributions. Additionally, our analysis showed the benefits of using a “language selector” in c-bilda to make the generative process of doclink more realistic, and one could also implement a similar mechanism in voclink to make the conditional distributions for tokens less dependent on specific context.
The second direction is more general. By systematically synthesizing various models and abstracting the knowledge transfer mechanism through an explicit transfer operation, we can construct models that shape the probabilistic distributions of a target language using that of a source language. By defining different transfer operations, more complex and robust models can be developed, and this transfer formulation may provide new ways of constructing models than with a traditional joint formulation (Hao and Paul 2019). For example, softlink is generalization doclink based on transfer operations that does not have an equivalent joint formulation. This framework for thinking about multilingual topic models may lead to new ideas for other models.
Notes
Although some models, as in Hu et al. (2014b), transfer knowledge at both document and word levels, in this analysis, we only focus on the word level where no transfer happens on the document level. The generalization simply involves using the same transfer operation on θ that is used in doclink.