Abstract
Topic taxonomy discovery aims at uncovering topics of different abstraction levels and constructing hierarchical relations between them. Unfortunately, most prior work can hardly model semantic scopes of words and topics by holding the Euclidean embedding space assumption. What’s worse, they infer asymmetric hierarchical relations by symmetric distances between topic embeddings. As a result, existing methods suffer from problems of low-quality topics at high abstraction levels and inaccurate hierarchical relations. To alleviate these problems, this paper develops a Box embedding-based Topic Model (BoxTM) that maps words and topics into the box embedding space, where the asymmetric metric is defined to properly infer hierarchical relations among topics. Additionally, our BoxTM explicitly infers upper-level topics based on correlation between specific topics through recursive clustering on topic boxes. Finally, extensive experiments validate high-quality of the topic taxonomy learned by BoxTM.
1 Introduction
Taxonomy knowledge discovery, the process of extracting latent semantic hierarchies from text corpora, is a crucial yet challenging research field. For text mining applications, it can serve as the foundation of complex question answering (Luo et al., 2018) and recommendation systems (Xie et al., 2022). An important line of research focuses on learning word-level or entity-level taxonomies (Miller, 1995; Jiang et al., 2022), but such products may encounter problems of low coverage, high redundancy, and limited information (Zhang et al., 2018). Since a topic can cover the semantics of a set of coherent words, some works propose to use topics as the basic taxonomic units. Taking the topic taxonomy of the arXiv website as an example, “computer science” is an academic discipline highlighted by general keywords of “information”, “computation”, and “automation”. It involves various sub-fields such as “computation and language” and “computer vision”, which have specific keywords of “language” and “image”, respectively. With this topic taxonomy, users can readily retrieve papers of interest and explore related research fields.
Early methods for topic taxonomy discovery (Blei et al., 2003a; Kim et al., 2012; Mimno et al., 2007) take a probabilistic perspective originated from LDA (Blei et al., 2003b). In these approaches, each topic is a distribution across words. A document is generated by sampling topics in different levels, and then sampling words from the selected topics iteratively. As a more flexible and efficient solution compared with probabilistic models, the Hierarchical Neural Topic Models (HNTMs) that adopt deep generative models and Neural Variational Inference (NVI) have been developed in recent years (Isonuma et al., 2020). With remarkable developments of text representation learning (Pennington et al., 2014; Devlin et al., 2019; Vilnis, 2021), mining topic taxonomy in the high-quality embedding space has become a promising idea. Particularly, the latest HNTMs (Chen et al., 2021b; Duan et al., 2021a) extend the Embedded Topic Modeling (ETM) (Dieng et al., 2020) method to topic taxonomy discovery. With the assumption that topics and their keywords are close in the embedding space, these models utilize dot products between topic and word embeddings to infer topic-word distributions.
In parallel, some other methods conduct recursive clustering on word embeddings to construct topic taxonomy directly (Zhang et al., 2018; Grootendorst, 2022). Such clustering-based methods often train the word embedding space on local contexts, which helps them capture accurate word semantics. Unfortunately, they have difficulty in exploiting global statistics of word occurrences, such as Bag-of-Words and TF-IDF representations. As a result, topics mined by these methods are highly coherent but may not be representative of the entire corpus. Due to this flaw of clustering-based methods, HNTMs persist as the prevailing paradigm for topic taxonomy discovery.
Despite the impressive performance of existing HNTMs, they suffer from the following problems. (1) Suboptimal representation: Most of these methods are limited in modeling semantic scopes of words and topics at different abstraction levels using classic point embeddings (Pennington et al., 2014). Instead, geometric embeddings such as hyperbolic and box embeddings are more effective representations for structured data, including knowledge graphs and taxonomies (Bai et al., 2021; Abboud et al., 2020). Although HyperMiner (Xu et al., 2022) attempts to uncover topic taxonomy within a geometric embedding space, it simply replaces point embeddings in traditional HNTMs with hyperbolic embeddings and lacks in-depth analysis. This makes HyperMiner suffer from the following problems. (2) Topic collapse: Prior models struggle to learn high-quality topics, especially at higher abstraction levels. In particular, their top-level topics often degenerate into clusters of meaningless common words (Wang et al., 2023; Wu et al., 2023). (3) Inaccurate hierarchy relations: Many existing HNTMs rely on the symmetric distance metric (i.e., dot product) to infer the asymmetric hierarchy relations among topics. Such approximation results in an inaccurate hierarchical topic structure.
Considering the above challenges, we propose to learn topic taxonomy in the box embedding space (Vilnis et al., 2018) and develop a Box embedding-based Topic Model (BoxTM)1 following the framework of NVI. Figure 1 shows the differences of the topic taxonomy discovery processes in the point embedding space and the box embedding space, which are adopted by most existing HNTMs and our BoxTM, respectively. And the topic taxonomy discovery process in the hyperbolic embedding space is similar to that in the point embedding space. Specifically, BoxTM represents a topic or word as a hyperrectangle instead of a point, whose volume is proportional to the size of its semantic scope. In other words, the box embedding of a general topic covers a relatively larger region than that of a specific topic. Additionally, we conduct recursive clustering on the box embeddings of the lower-level topics to extract the upper-level topics. This approach leverages the connection between descendant topics to precisely capture the semantics of the upper-level topics, which can address the topic collapse problem caused by unguided upper-level topic mining. Intuitively, we employ symmetry and asymmetry distance metrics defined in the box embedding space respectively to capture similarity and hierarchy relations among topics. In summary, the main contributions of this paper are as follows:
We propose representing topics and words as box embeddings to capture their semantic scopes and accurately infer the hierarchical relations among these topics.
We propose to conduct recursive clustering on leaf topics to mine upper-level topics, which is an interpretable and effective way to capture the semantics of upper-level topics.
We conduct intrinsic evaluation, extrinsic evaluation, human evaluation, and qualitative analysis to validate the effectiveness of our model compared to state-of-the-art baselines.
2 Related Work
2.1 Document Generation-based Methods
The classic topic model, i.e., LDA (Blei et al., 2003b), uses a document generative process under the framework of probabilistic graphical models to extract flat topics. As an extension of LDA to topic taxonomy discovery, a series of hierarchical topic models has been proposed, such as nCRP (Blei et al., 2003a) and rCRP (Kim et al., 2012). Despite their popularity, they suffer from high complexity of posterior inference. Recently, HNTMs (Isonuma et al., 2020; Chen et al., 2021a), based on NVI and deep generative model, have been developed to tackle this problem.
Inspired by the Embedded Topic Model (ETM) (Dieng et al., 2020), nTSNTM (Chen et al., 2021b), and SawETM (Duan et al., 2021a) project topics and words into the same Euclidean embedding space and construct topic taxonomy via the symmetric distances between topic and word points. Due to the advantage of hyperbolic space in modeling tree-structured data (Nickel and Kiela, 2017), HyperMiner (Xu et al., 2022) adopts a hyperbolic embedding space to discover topic taxonomy. However, HyperMiner still uses the symmetric distance metric (i.e., dot product) to infer the complex relations among topics and randomly initializes topic embeddings, following prior HNTMs. Such approximation of asymmetric relations and “cold start” of embedding learning result in a risk of top-level topics collapsing into meaningless common words. To alleviate the latter problem, C-HNTM (Wang et al., 2023) attempts to learn topics of different levels using different semantic patterns. Specifically, C-HNTM learns level-2 topics by clustering on word embeddings, and it adopts ETM to mine leaf topics. Unfortunately, C-HNTM lacks the flexibility to learn topic taxonomies of different depths.
2.2 Clustering-based Methods
Since pre-trained embedding models (Devlin et al., 2019; Pennington et al., 2014) have boosted the performance of many text mining tasks in recent years, a branch of research attempts to mine flat (Sia et al., 2020; Meng et al., 2022) or hierarchical topics (Zhang et al., 2018; Grootendorst, 2022) from high-quality embedding spaces directly. As a representative clustering-based method, TaxoGen (Zhang et al., 2018) conducts hierarchical clustering to group similar words into clusters (topics) and split coarse clusters (topics) into specific ones. Additionally, it ranks the importance of each word to its topic by some manually designed metrics, such as the symmetric distance between a word and its cluster centroid. Importantly, most clustering-based methods train word embedding spaces on local contexts, which enables them to capture accurate semantics of words but hinders them from getting high-quality topics, because the boundaries between clusters are blurred in such delicate embedding spaces. Regardless, since topics are semantic summaries of corpora, global semantic information is more critical for topic mining compared to local contexts. However, clustering-based methods have trouble in utilizing the global statistics of word occurrences effectively. For example, both BERTopic (Grootendorst, 2022) and TaxoGen (Zhang et al., 2018) simply apply TF-IDF information as weights for topic keyword ranking.
2.3 Supervised Methods
Apart from self-supervised topic taxonomy discovery, another line of research tries to adopt a word-level knowledge graph (Lee et al., 2022; Meng et al., 2020) or manually built topic hierarchy (Duan et al., 2021b) as the “framework” of the topic taxonomy. As a representative method of supervised HNTMs, TopicNet (Duan et al., 2021b) adopts prior knowledge from WordNet (Miller, 1995). Specifically, TopicNet discovers each topic and each topic hierarchical relation guided by a seed word and the hypernym-hyponym relation between seed words, respectively. Similarly, a clustering-based method called TaxoCom (Lee et al., 2022) uses manually defined seed words as centers of topic clusters. Unfortunately, there may be a semantic gap between the general knowledge graph and the target corpus, and it’s difficult and costly to determine a complete topic hierarchy manually. Therefore, self-supervised topic taxonomy discovery is more flexible and versatile, since it does not rely on prior knowledge.
3 Background Knowledge
As a representative geometric embedding technology, the box embedding method represents a word or topic as a box (i.e., axis-aligned hyperrectangle) instead of a point in the traditional Euclidean embedding method. With extra degrees of freedom, box embeddings can capture semantic scopes and asymmetric relations of objects (Vilnis et al., 2018; Li et al., 2019; Dasgupta et al., 2020).
(box embedding). A D-dimensional box is determined by its minimum and maximum coordinates in each axis, parameterized by a pair of vectors (xm, xM), where xm, xM ∈ [0,1]D and xm, i ≤xM, i, for ∀i ∈{1…D}.
(box operations). Let denote box embeddings of objects A and B, respectively. The basic box operations are defined as follows:
(volume). The volume of Box(A) is defined as .
(intersection). If there is an overlap between Box(A) and Box(B), their intersection box is defined to be ; otherwise, it is defined to be Box(A) ∧Box(B) : =⊥.
(union). The union box of Box(A) and Box(B) is defined as .
Note that box embeddings are closed under the intersection and union operations. For simplicity, the base box operations are described above, while in practice we adopt the Gumbel version that is more stable for training (Dasgupta et al., 2020).
Accordingly, we have Rs(A, B) =Rs(B, A). To mitigate the bias towards large boxes, we can regularize the Rs(A, B) metric through division by in practice.
Discussion of Box Embeddings for Taxonomy Learning
Most of the previous works (Vilnis et al., 2018; Lees et al., 2020; Dasgupta et al., 2020) learn box embeddings of pre-defined entities or words for taxonomy completion in a supervised manner. For instance, Vilnis et al. (2018) first proposed to train box embeddings for words on the incomplete ontology, in order to infer missing hypernym relations. Unlike these supervised methods, this paper aims at self-supervised topic taxonomy construction from unstructured text via box embeddings. This research problem poses new challenges for box embedding learning. Accordingly, we propose a recursive clustering algorithm for self-supervised box embedding learning, which is integrated with a VAE framework to provide an efficient solution for topic taxonomy construction based on box embeddings.
4 Proposed Method
In this section, we introduce the proposed BoxTM in detail. Firstly, we propose the box embedding-based document generative process in Section 4.1, which is the main framework of BoxTM. In general, BoxTM infers topic distributions via the symmetric affinities and semantic scopes of topics and words in the box embedding space. Additionally, the hierarchical relations are modeled by the values of the asymmetric metric between topic boxes. Subsequently, we introduce more detailed designs of BoxTM, including a novel workflow of recursive topic clustering for upper-level topic mining (Section 4.2) and two self-training tasks for modeling the semantic scopes of words and topics better (Section 4.3). Finally, we introduce the learning strategy of BoxTM in Section 4.4.
4.1 Document Generative Process
BoxTM holds the assumption that a document is generated by any topics in the topic taxonomy and adopts a bottom-up hierarchical topic discovery method following Chen et al. (2021b). For NVI, BoxTM adopts a classic Variational AutoEncoder (VAE) with a logistic normal distribution (Atchison and Shen, 1980) as the prior of topic proportion. A VAE consists of an encoder that learns hierarchical topic proportions given document representations and a decoder that reconstructs documents based on hierarchical topic proportions and topic distributions. Figure 2 shows the main framework of BoxTM.
In summary, we describe the document generative process of BoxTM as follows:
- ⊳
For global topics, k ∈{1,…, K-1}:
- ⊳
For each document:
Draw the leaf topic proportion .
Infer the upper-level topic proportion πk +1 by Eq. (7), for level k ∈{1,…, K-1}.
For each word wj in the document:
Draw topic level k ∼Uniform(K).
Draw topic assignment .
Draw word .
4.2 Recursive Topic Clustering
Unlike most HNTMs that randomly initialize embeddings of topics in different abstraction levels, BoxTM conducts recursive clustering on topic boxes to learn upper-level topics. Notably, such a method can alleviate the problem of topic collapse, since the upper-level topic mining is guided by the correlation between lower-level topics. For the selection of clustering algorithms, we adopt the Affinity Propagation (AP) (Frey and Dueck, 2007) algorithm for its flexibility and interpretability.2
After conducting (K-1) times of topic clustering recursively, BoxTM can mine topics of K levels in a bottom-up manner.
4.3 Semantic Scope Modeling
The effectiveness of our box embedding-based document generative process with recursive topic clustering is based on an important premise that box embeddings can accurately model the semantic scopes of words and topics. Here we propose two self-supervised tasks by means of word-level and topic-level constraints for semantic scope modeling.
4.3.1 Word-level Constraint
Importantly, the semantic scope of each word consists of its abstraction level and semantics, which correspond to the volume and position of its box, respectively. Inspired by GloVe (Pennington et al., 2014), we propose to encode the (co-)occurrence patterns of words into word boxes.
Our key insight is that the marginal probability P(wj) of word wj reveals its abstraction level. Besides, as the distributional hypothesis states that similar words wi and wi′ tend to co-occur with the same word wj, the joint probability P(wi, wj) may reflect the correlation between the semantics of wi and wj. In practice, the joint and marginal probabilities can be estimated by P(wi, wj) ∼ Xij and P(wj) ∼ Xj, where Xij is the co-occurrence time of wi and wj in the corpus, and . Integrating these patterns, we propose that the values of the asymmetric metric Ra(wi|wj) in the box embedding space should be consistent with the conditional probability Pi|j = P(wi|wj) = Xij/Xj.
4.3.2 Topic-level Constraint
4.4 Learning Strategy
Then we adopt the Adam optimizer to update the network parameters of the encoder and box embeddings of topics and words. Based on the updated topic boxes, we perform a correction for the topic taxonomy using Eq. (13). The training workflow of BoxTM is shown in Algorithm 1. Intuitively, topic boxes overlap less along with the training to capture diverse semantics, which limits the effectiveness of our recursive clustering module at the late phase of training. To tackle this problem, we use the early stopping trick that stops recursive clustering after the γ-th iteration. In the following experiments, γ is set to 100.
5 Experiments
5.1 Experimental Settings
5.1.1 Datasets
We conduct comprehensive evaluations on three benchmark datasets with latent topic hierarchies: (1) 20news3 : A corpus consists of 20 newsgroups (Song and Roth, 2014). (2) NYT4 : A set of news articles from the New York Times, which are categorized into 25 classes. (3) arXiv5 : A set of paper abstracts covering 53 classes from arXiv website. The latter two datasets are collected by Meng et al. (2019). Table 1 shows the statistics of all datasets. After preprocessing of removing stopwords and low-frequency words, we split documents into a training set and a testing set with the ratio of 6:4. In addition, we adopt 20% of documents in the training set as a validation set.
dataset . | #document . | #word . | #class . | ||
---|---|---|---|---|---|
#train . | #valid . | #test . | |||
20news | 9,007 | 2,251 | 7,487 | 1,838 | 20 |
NYT | 6,279 | 1,569 | 5,233 | 8,171 | 25 |
arXiv | 110,451 | 27,612 | 92,042 | 11,799 | 53 |
dataset . | #document . | #word . | #class . | ||
---|---|---|---|---|---|
#train . | #valid . | #test . | |||
20news | 9,007 | 2,251 | 7,487 | 1,838 | 20 |
NYT | 6,279 | 1,569 | 5,233 | 8,171 | 25 |
arXiv | 110,451 | 27,612 | 92,042 | 11,799 | 53 |
5.1.2 Baselines
We compare our model with state-of-the-art topic taxonomy discovery models based on different frameworks, including document generation-based methods of nTSNTM6 (Chen et al., 2021b), SawETM7 (Duan et al., 2021a), HyperMiner8 (Xu et al., 2022), and C-HNTM9 (Wang et al., 2023), as well as a clustering-based method of TaxoGen10 (Zhang et al., 2018). Notably, HyperMiner adopts the hyperbolic embedding space, and the others hold the Euclidean embedding space assumption.
5.1.3 Hyperparameter Settings
The maximum depth of the topic taxonomy is set to 3 for the 20news and NYT datasets following Chen et al. (2021b). To evaluate the flexibility of BoxTM and baseline models, the maximum depth for the large dataset arXiv is set to 5. Additionally, the maximum number of leaf topics of nTSNTM is 200 following the setting in its paper, which can get a reasonable number of topics adaptively based on the stick-breaking process. According to the number of active topics obtained by nTSNTM, of BoxTM and the other HNTMs is set to 50/50/100 for three datasets, respectively. For TaxoGen, the maximum number of clusters is set to 5/5/3. The embedding dimension of BoxTM is set to 50 following Vilnis et al. (2018). Since box embeddings have 2 parameters per dimension, the embedding size of baselines are set to 100 for a fair comparison.
Other hyperparameters of baselines take the optimal values reported in their papers. For BoxTM, the learning rate is 5e-3, the dimension of hidden layers is 256, and the max margin m is set to 10. The weight of HT gradually increases to the maximum value (βmax = 0.005) during training, when the constant weight of CO is set to 3.
5.2 Intrinsic Evaluation of Topic Taxonomy
For a reasonable topic taxonomy, each topic is a set of closely coherent words and diverse from one another. Also, keywords of a parent topic tp and its child topic tc are coherent but have different semantic abstraction levels. Thus we validate the quality of the topic taxonomy from the following perspectives: (1) Topic Coherence (C): We adopt a classic metric NPMI (Lau et al., 2014) to quantify the coherence of mined topics. (2) Topic Diversity (D): The widely used TU (Nan et al., 2019) metric is for assessing the diversity among all topics, which is calculated by the number of unique keywords among all topics. (3) Hierarchical Coherence (HC): We adopt the CLNPMI (Chen et al., 2021b) metric to evaluate the hierarchical coherence between topics tp and tc.
Because highly overlapping topics may cause inflated coherence scores, the product of NPMI and TU are used as an integrated metric (C*D) for a comprehensive validation (Dieng et al., 2020). For the aforementioned metrics, we calculate the average of the scores of top-5, top-10, and top-15 topic words. Because the source code of nTSNTM and the algorithm of C-HNTM cannot adapt to topic taxonomy with more than 3 levels, their results on the arXiv dataset are not reported.
As shown in Table 2, BoxTM achieves new state-of-the-art results on most metrics across three datasets, when HyperMiner using hyperbolic embeddings outperforms SawETM. These results validate the advantage of geometric (i.e., hyperbolic and box) embeddings on topic taxonomy discovery over traditional point embeddings. Compared to C-HNTM that performs poorly on the HC metric, the proposed recursive topic clustering module of BoxTM can effectively learn topics of different levels. While both SawETM and HyperMiner fail to learn a deep topic taxonomy on the arXiv dataset with massive documents, BoxTM remains outstanding performance on topic quality and hierarchical coherence. It validates that BoxTM not only has scalability for large-scale data but also has flexibility to learn topic taxonomies of different structures. In terms of the clustering-based method, TaxoGen obtains high scores of topic diversity (D), because each word only belongs to one topic at each level in its approach. However, it neglects the polysemy of some words, i.e., a word can be the keyword of different topics, which leads to its performance decline on topic coherence. For example, the word “driver” could be the keyword of topics “hardware” and “motorcycles”.
model . | 20news . | NYT . | arXiv . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
C . | D . | C*D . | HC . | C . | D . | C*D . | HC . | C . | D . | C*D . | HC . | |
nTSNTM | 0.212 | 0.728 | 0.154 | 0.134 | 0.221 | 0.420 | 0.093 | 0.079 | – | – | – | – |
SawETM | 0.221 | 0.404 | 0.089 | 0.098 | 0.228 | 0.476 | 0.109 | 0.084 | 0.134 | 0.256 | 0.034 | 0.047 |
HyperMiner | 0.224 | 0.459 | 0.103 | 0.102 | 0.231 | 0.500 | 0.115 | 0.101 | 0.142 | 0.382 | 0.054 | 0.050 |
C-HNTM | 0.196 | 0.633 | 0.124 | 0.090 | 0.152 | 0.458 | 0.070 | 0.036 | – | – | – | – |
TaxoGen | 0.202 | 0.789 | 0.159 | 0.123 | 0.239 | 0.881 | 0.210 | 0.111 | 0.214 | 0.681 | 0.146 | 0.084 |
BoxTM | 0.301 | 0.661 | 0.199 | 0.159 | 0.409 | 0.648 | 0.265 | 0.177 | 0.257 | 0.672 | 0.173 | 0.113 |
model . | 20news . | NYT . | arXiv . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
C . | D . | C*D . | HC . | C . | D . | C*D . | HC . | C . | D . | C*D . | HC . | |
nTSNTM | 0.212 | 0.728 | 0.154 | 0.134 | 0.221 | 0.420 | 0.093 | 0.079 | – | – | – | – |
SawETM | 0.221 | 0.404 | 0.089 | 0.098 | 0.228 | 0.476 | 0.109 | 0.084 | 0.134 | 0.256 | 0.034 | 0.047 |
HyperMiner | 0.224 | 0.459 | 0.103 | 0.102 | 0.231 | 0.500 | 0.115 | 0.101 | 0.142 | 0.382 | 0.054 | 0.050 |
C-HNTM | 0.196 | 0.633 | 0.124 | 0.090 | 0.152 | 0.458 | 0.070 | 0.036 | – | – | – | – |
TaxoGen | 0.202 | 0.789 | 0.159 | 0.123 | 0.239 | 0.881 | 0.210 | 0.111 | 0.214 | 0.681 | 0.146 | 0.084 |
BoxTM | 0.301 | 0.661 | 0.199 | 0.159 | 0.409 | 0.648 | 0.265 | 0.177 | 0.257 | 0.672 | 0.173 | 0.113 |
Furthermore, Figure 3 illustrates the C*D scores at each level of BoxTM and baselines on the NYT dataset. Both coherence and diversity of the level-2 topics of all models have different degrees of improvement compared to leaf topics. However, most baselines fail to learn high-quality topics at the root level, that is, they encounter the topic collapse problem. And topics mined by BoxTM remain high-quality at all levels, due to the effectiveness of the proposed recursive topic clustering module.
5.3 Extrinsic Evaluation of Topic Taxonomy
As an important application scenario for topic taxonomy discovery, the tree structure and keywords of the mined topic taxonomy can serve as auxiliary knowledge to improve the performance of hierarchical text clustering (Lee et al., 2022). Specifically, each topic is regarded as a cluster, characterized by its keywords. We utilize the topic structure and the top-15 keywords of all topics learned by our BoxTM and baseline models as the inputs of a hierarchical text clustering model named WeSHClass (Meng et al., 2019). For the evaluation metrics, we adopt two external criteria of clustering (i.e., ARI and Fβ) using golden labels of documents (Steinbach et al., 2005).
Table 3 shows the results of BoxTM and baseline models on the hierarchical text clustering task. Particularly, BoxTM and other HNTMs significantly outperform C-HNTM and TaxoGen that conduct clustering on word embeddings to mine topics, which reveals the limitation of latter methods in learning document-level semantics. Among HNTMs, BoxTM achieves the best results overall (ARI = 0.254 and Fβ = 0.296 in average), followed by SawETM (ARI = 0.226 and Fβ = 0.267 in average). Although SawETM outperforms BoxTM on the arXiv dataset, it cannot discover coherent topics according to the intrinsic evaluation. These results show that there is a tradeoff between learning high-quality topics and document-level semantics for topic modeling methods, and our BoxTM strikes a good balance.
model . | 20news . | NYT . | arXiv . | |||
---|---|---|---|---|---|---|
ARI . | Fβ . | ARI . | Fβ . | ARI . | Fβ . | |
nTSNTM | 0.081 | 0.133 | 0.389 | 0.448 | – | – |
SawETM | 0.074 | 0.123 | 0.452 | 0.494 | 0.151 | 0.184 |
HyperMiner | 0.075 | 0.127 | 0.421 | 0.466 | 0.115 | 0.151 |
C-HNTM | 0.056 | 0.104 | 0.143 | 0.216 | – | – |
TaxoGen | 0.066 | 0.132 | 0.310 | 0.367 | 0.097 | 0.133 |
BoxTM | 0.117 | 0.168 | 0.541 | 0.577 | 0.103 | 0.143 |
model . | 20news . | NYT . | arXiv . | |||
---|---|---|---|---|---|---|
ARI . | Fβ . | ARI . | Fβ . | ARI . | Fβ . | |
nTSNTM | 0.081 | 0.133 | 0.389 | 0.448 | – | – |
SawETM | 0.074 | 0.123 | 0.452 | 0.494 | 0.151 | 0.184 |
HyperMiner | 0.075 | 0.127 | 0.421 | 0.466 | 0.115 | 0.151 |
C-HNTM | 0.056 | 0.104 | 0.143 | 0.216 | – | – |
TaxoGen | 0.066 | 0.132 | 0.310 | 0.367 | 0.097 | 0.133 |
BoxTM | 0.117 | 0.168 | 0.541 | 0.577 | 0.103 | 0.143 |
5.4 Human Evaluation
To complement the above automatic metrics, we also utilize a manual evaluation task of topic intrusion (Chang et al., 2009) to further validate the ability of topics at different levels to describe documents. As shown in Figure 4 (left), human raters are shown a document from the testing set of NYT, along with four topics represented by their top-10 keywords. Three of them are the top-3 topics at the same level assigned to the given document by the topic model, while the remaining intruder topic is sampled randomly from the other low probability topics. We recruit ten graduate students majoring in computer science as raters and instruct them to choose topics that are not relevant to the documents. For evaluation, we compare our BoxTM with two strong baselines, i.e., SawETM and HyperMiner, excluding TaxoGen that cannot infer the topic distributions of documents. According to the value of Light’s kappa (Light, 2011) (κ = 0.607), the annotation results of the ten raters have a fairly high degree of agreement.
Figure 4 (right) shows the precision scores of different models on this task. The performance of all three models on the manual assessment is generally consistent with those on the extrinsic evaluation. Notably, our BoxTM achieves an overall optimal result, which indicates that it generates different levels of topics that describe documents in alignment with human judgment.
5.5 Ablation Analysis
In this section, we conduct an ablation study to analyze the roles of several key components of BoxTM, whose results are shown in Table 4. Most importantly, the ablation models that replace box embeddings with traditional point embeddings (i.e., the point models), experience a drastic performance drop in both topic quality and extrinsic evaluation compared to BoxTM. Within several clustering algorithms, the point model using AP clustering (w/ AP) performs better than those with kmeans++ (w/ kmeans) or agglomerative clustering (w/ hier).
embedding . | model . | C*D . | HC . | ARI . | Fβ . |
---|---|---|---|---|---|
box | BoxTM | 0.265 | 0.177 | 0.541 | 0.577 |
wo/ CO | 0.266 | 0.191 | 0.449 | 0.489 | |
wo/ HT | 0.276 | 0.157 | 0.299 | 0.355 | |
wo/ clus | 0.256 | 0.139 | 0.337 | 0.394 | |
point | w/ kmeans | 0.201 | 0.174 | 0.397 | 0.441 |
w/ AP | 0.241 | 0.158 | 0.444 | 0.488 | |
w/ hier | 0.208 | 0.162 | 0.417 | 0.458 | |
wo/ clus | 0.193 | 0.153 | 0.376 | 0.423 |
embedding . | model . | C*D . | HC . | ARI . | Fβ . |
---|---|---|---|---|---|
box | BoxTM | 0.265 | 0.177 | 0.541 | 0.577 |
wo/ CO | 0.266 | 0.191 | 0.449 | 0.489 | |
wo/ HT | 0.276 | 0.157 | 0.299 | 0.355 | |
wo/ clus | 0.256 | 0.139 | 0.337 | 0.394 | |
point | w/ kmeans | 0.201 | 0.174 | 0.397 | 0.441 |
w/ AP | 0.241 | 0.158 | 0.444 | 0.488 | |
w/ hier | 0.208 | 0.162 | 0.417 | 0.458 | |
wo/ clus | 0.193 | 0.153 | 0.376 | 0.423 |
In terms of the proposed box embedding regularizations, BoxTM wo/ HT fails to capture the proper semantic scopes of topics at different levels, leading to worse performance on the HC metric as well as the downstream task. Though BoxTM wo/ CO remains competitive on intrinsic evaluation, its performance on the hierarchical text clustering task drops compared to BoxTM.
5.6 Case Study of Topic Taxonomy
In this section, we evaluate the mined topic taxonomy qualitatively via a case study. Figure 5(a) illustrates some sample topics from the 5-level topic taxonomy learned by BoxTM on the arXiv dataset. A level-4 topic about “network” branches into child topics related to “computer communication networks” (left), “optimization algorithms” (middle), and “applications” (right). Furthermore, in the field of “applications”, there are sub-fields that focus on different research problems, including “computation and language” and “computer vision and pattern recognition”. Moreover, Figure 5(b) shows some topics related to “sports” and “administration” mined by BoxTM on NYT.
5.7 Analysis of Taxonomy Depth
In the aforementioned experiments, we set the maximum depth to the same value for all models by following Chen et al. (2021b). As a complement, Figure 6 illustrates the performance of our BoxTM compared to the top-2 best performing baselines (i.e., TaxoGen and HyperMiner) for different settings of taxonomy depth. In most cases, BoxTM outperforms baselines with the same taxonomy depth. Nevertheless, how to determine an appropriate taxonomy depth in the real-life applications is a valuable but challenging problem.
Considering that the automatic metrics (e.g., C and HC) may be sensitive to the taxonomy depth, we also conduct a qualitative analysis to discuss the influence of taxonomy depths on our BoxTM. As shown in Figure 7, the leaf topic about “Galerkin methods” is assigned to the parent topic related to “numerical analysis” for K = 3. And when K = 4, BoxTM further extracts a level-4 topic that is related to “general algorithm”. Interestingly, when the structure of the taxonomy continues to deepen (K = 5), BoxTM identifies that “Galerkin methods” is commonly applied in the field of “physics” as a classic PDE solver. Overall, our BoxTM can discover topics with different granularity and the hierarchical relations under varying settings of taxonomy depth. Therefore, users can set the taxonomy depth according to their practical requirements.
Moreover, unlike most HTMs that require a fixed taxonomy depth, the recursive topic clustering module in BoxTM provides a promising solution for determining the taxonomy depth adaptively. Specifically, BoxTM can halt topic clustering when the number of topics at the top level is smaller than a threshold, which is easier to determine compared to the taxonomy depth. Figure 7 (adaptive) illustrates the topic pathway mined by BoxTM when the threshold is set to 10.
5.8 Qualitative Analysis of Box Embeddings
In this section, we examine whether box embeddings can reflect the asymmetric relation between parent and child topics. For example, topic 2-5 (i.e., the 5-th topic at level-2) learned by BoxTM on NYT is related to “religion” and topic 1-13 is one of its children, while topic 1-27 is about “hardware”, characterized by keywords such as “drive” and “controller”. As shown in Figure 8(a), the boxes of upper-level topics entail those of their children. Besides, Figure 8(b) illustrates that the box embedding of child topic 1–13 has a larger overlap with its parent topic 2–5 compared to a randomly sampled topic 2–11, with p = 0.007 < 0.05 according to the paired sample t-test.
6 Conclusion
This paper proposes a novel model called BoxTM for self-supervised topic taxonomy discovery in the box embedding space. Specifically, BoxTM embeds both topics and words into the same box embedding space, where the symmetric and asymmetric metrics are defined to infer the complex relations among topics and words properly. Additionally, instead of initializing topic embeddings randomly, BoxTM uncovers upper-level topics via recursive clustering on topic boxes.
While our BoxTM has achieved state-of-the-art performance in multiple evaluation experiments, it also exhibits a limitation in efficiency. The point model, a variant of BoxTM that replaces the box embeddings with point embeddings, is trained for 0.22 GPU (GTX 1080 Ti) hour on the 20news dataset. Due to the extra computation of box operations compared to dot product, BoxTM costs about 1.0 hour, which reveals the research space for efficient computation of box embeddings.
Acknowledgments
We express our profound gratitude to the action editor and reviewers for their valuable comments and suggestions. This research has been supported by the National Natural Science Foundation of China (62372483), the Faculty Research Grants (DB24A4 and DB24C5) of Lingnan University, Hong Kong, the Research Grants Council of the Hong Kong Special Administrative Region, China (UGC/FDS16/E01/19), and the Hong Kong Research Grants Council under the General Research Fund (project no. PolyU 15200021).
Notes
The source code of our model is available in public at: https://github.com/luyy9apples/BoxTM.
References
Author notes
Action Editor: Ivan Titov