Abstract
Topic modeling has been a widely used tool for unsupervised text analysis. However, comprehensive evaluations of a topic model remain challenging. Existing evaluation methods are either less comparable across different models (e.g., perplexity) or focus on only one specific aspect of a model (e.g., topic quality or document representation quality) at a time, which is insufficient to reflect the overall model performance. In this paper, we propose WALM (Word Agreement with Language Model), a new evaluation method for topic modeling that considers the semantic quality of document representations and topics in a joint manner, leveraging the power of Large Language Models (LLMs). With extensive experiments involving different types of topic models, WALM is shown to align with human judgment and can serve as a complementary evaluation method to the existing ones, bringing a new perspective to topic modeling. Our software package is available at https://github.com/Xiaohao-Yang/Topic_Model_Evaluation.
1 Introduction
Topic modeling (Blei et al., 2003), a popular unsupervised text analysis technique, has been applied to various domains, including information retrieval (Yi and Allan, 2009), marketing analysis (Reisenbichler and Reutterer, 2019), social media analysis (Laureate et al., 2023), bioinformatics (Liu et al., 2016), and more. A topic model typically learns a set of global topics to interpret a text corpus and the topic proportion of a document as its semantic representation.
Although topic models have been time-tested for two decades, as an unsupervised technique, comprehensive evaluations of a topic model remain challenging (Zhao et al., 2021a). Originally, topic models are implemented as probabilistic graphical models such as Latent Dirichlet Allocation (Blei et al., 2003) and many of its Bayesian extensions (e.g., Blei et al., 2010; Paisley et al., 2015; Gan et al., 2015; Zhou et al., 2016; Zhao et al., 2018a, 2018b). For these models, it has been common practice to measure the log-likelihood or perplexity of a model on held-out test documents. While log-likelihood or perplexity provides a straightforward quantitative comparison between models, several issues still persist. Since topic models are not primarily designed to predict words in documents but rather to learn semantically meaningful topics and interpretable document representations, these metrics fail to capture these aspects. Furthermore, estimating the predictive probability is often intractable for Bayesian models, and different papers may employ different sampling or approximation techniques (Wallach et al., 2009; Buntine, 2009). For recently proposed Neural Topic Models (NTMs) (Zhao et al., 2021a), the computation of log-likelihood is even more inconsistent.
In addition to log-likelihood or perplexity, document representation quality and topic quality are evaluated separately. For document representation quality, downstream task performance is typically used as a metric, such as document classification (Yang et al., 2023), clustering (Zhao et al., 2021a), and retrieval (Larochelle and Lauly, 2012). For topic quality, the ultimate evaluation method is human evaluation, which is time-consuming and expensive. Thus, various automatic metrics have been proposed, such as topic coherence (Lau et al., 2014), which measures how semantically coherent the representative words in a topic are, and topic diversity (Dieng et al., 2020), which measures how diverse discovered topics are. To comprehensively evaluate the performance of a topic model, one needs to report multiple metrics on both document representation and topic qualities. However, these metrics can be contradictory, e.g., a topic model with good topic quality may not preserve good quality on document representation, and vice versa. This discrepancy complicates the model selection process for topic models in practice.
In this paper, we aim to develop a new evaluation approach for topic modeling that considers both the semantic quality of document representations and topics in a joint manner, leveraging the power of Large Language Models (LLMs). Our key idea is as follows: After being trained, a topic model can infer a document’s distribution over topics and each topic is a distribution over vocabulary words. With these two distributions, a model can generate a set of “topical” words given a document, such as by looking at its representative topics and the representative words of each topic. The generation of the topical words takes both the topic distribution of a document and the word distributions of the topics into account, which captures the semantic summary of the document and is expected to align with the keywords identified by humans. Given the high cost of human evaluation, we propose using LLMs as a proxy by employing appropriate prompts to generate keywords for the document, which are then compared with the topical words produced by a topic model. Finally, to quantify the agreement between the words from the topic model and the LLM, a series of WALM (Word Agreement with Language Model) metrics are proposed. WALM has the following appealing properties:
It is a joint metric that evaluates the quality of both document representations and topics.
It assesses how effectively a topic model captures the semantics of a document, which is a core objective of topic modeling.
It allows for comparisons across various types of topic models.
To examine WALM series metrics, we conduct extensive experiments using various popular topic models on different datasets, comparing them with other widely used topic model evaluation metrics. Moreover, human evaluation is also conducted to demonstrate the alignment of WALM with human judgment.
2 Related Work
As an unsupervised technique for uncovering hidden themes in text, evaluating topic models remains challenging. Early evaluations of a topic model rely on the log-likelihood or perplexity of held-out documents (Blei et al., 2003), which measures how well the model predicts the words of documents. As the computation of predictive probability is often intractable for conventional Bayesian topic models, various sampling or approximation techniques have been proposed (Wallach et al., 2009; Buntine, 2009). Apart from the inconsistent estimation, held-out likelihood is regarded as not correlated with the interpretability of topics from a human perspective (Chang et al., 2009), prompting the direct evaluation of topics and document representation quality.
As for the evaluation of topics, Chang et al. (2009) design the word and topic intrusion tasks for human annotators, where high-quality topics or document representations are those where annotators can easily identify the intruders. Newman et al. (2010) and Mimno et al. (2011) evaluate topic coherence by direct ratings from human experts. Although human judgment is commonly regarded as the gold standard, it is expensive and impractical for large-scale evaluation. Automated evaluation of topic coherence is more practical, such as Normalized Pointwise Mutual Information (NPMI) (Lau et al., 2014), which relies on the co-occurrence of the topic’s top words in the reference corpus to measure topic coherence, with the underlying assumption that a large reference corpus such as Wikipedia can capture prevalent language patterns. Although they automate the evaluation of topics and strongly correlate with human judgment (Newman et al., 2010), counting word co-occurrence in a large reference corpus is still relatively expensive. Moreover, coherence metrics can vary depending on the reference corpus, and there is no single “right” reference corpus that is suitable for all datasets (Doogan and Buntine, 2021). Recent works propose leveraging word embeddings (Nikolenko, 2016) or contextualized embeddings (Hoover et al., 2021) for efficiently evaluating topic coherence, incorporating semantics from pre-trained embeddings. Due to common posterior collapse issues (Lucas et al., 2019) in the growing field of neural topic models (Zhao et al., 2021a), recent works also consider topic diversity (Dieng et al., 2020) during evaluation, which measures how distinct the top words of each topic are.
As for the evaluation of document representation, early works focus on how well the topic proportion of a document represents the document content, assessed through a topic intrusion task by human annotators (Chang et al., 2009), which is further extended as automated metrics (Bhatia et al., 2017, 2018). Recent topic models often use the topic proportions as document representations, the quality of which is commonly investigated through downstream tasks, including their use as features for document classification (Nguyen and Luu, 2021), clustering (Zhao et al., 2021b), and retrieval (Larochelle and Lauly, 2012). Recently, the generalization ability of topic models is investigated by evaluating their quality of document representations across different unseen corpora (Yang et al., 2023).
In the era of LLMs (Brown et al., 2020; Thoppilan et al., 2022; Touvron et al., 2023a, b; Chowdhery et al., 2024), recent research has begun leveraging LLMs to evaluate topic models, such as using ChatGPT1 as a proxy for human annotators for word intrusion and topic rating tasks for evaluating topic coherence (Stammbach et al., 2023; Rahimi et al., 2024). The focus of these works is still on topic quality only.
In this work, we propose new evaluation metrics for topic models, differing from previous works in the following ways: (1) Unlike evaluations that focus on only sub-components of a topic model (i.e., topics or document representations), our evaluation metrics offer a joint approach to topic model evaluation, considering both topics and document representations together. (2) Compared with log-likelihood or perplexity, which also evaluate based on documents, our evaluation metrics consider semantics from documents and align with the focus of topic modeling. (3) Different from recent LLM-based evaluations that use LLMs for topic quality evaluation, ours considers both topic quality and document representation quality and our use of LLMs is quite different from previous works.
3 Background
4 Method
4.1 Motivation
Both topics and document representations are important components of a topic model. To comprehensively evaluate a topic model, it is common practice to report the performance of both parts. This can be done by measuring topic quality using metrics such as NPMI and assessing document representation quality through downstream classification accuracy (ACC) (see section 5.1 for details of metrics calculation). However, a model that prioritizes topic quality (e.g., NPMI) may not perform well in terms of document representations (e.g., ACC), and vice versa, which creates difficulty during model selection, as illustrated in Figure 1. This inconsistency in the performance of the two components is also indicated by Bhatia et al. (2017). Therefore, evaluating a topic model based on sub-components only is insufficient to reveal the entire model’s performance. Recent topic models often focus on improving topic quality, such as clustering-based models (Sia et al., 2020; Grootendorst, 2022), but they do not evaluate their effectiveness in representing documents. In this work, we aim to introduce a novel evaluation method for topic modeling that jointly assesses the semantic quality of both topics and document representations, with the help of large language models.
Performance rankings of topic quality (NPMI) and document representation quality (ACC) during model selection. The best model state/checkpoint can be determined using either NPMI or ACC as the selection criterion. However, it can be observed that the rankings for topic quality and document representation quality are inconsistent under the same selection criteria. Experiments are conducted five times, with the number of topics set to 50.
Performance rankings of topic quality (NPMI) and document representation quality (ACC) during model selection. The best model state/checkpoint can be determined using either NPMI or ACC as the selection criterion. However, it can be observed that the rankings for topic quality and document representation quality are inconsistent under the same selection criteria. Experiments are conducted five times, with the number of topics set to 50.
4.2 Key Idea
4.3 Word Suggestion by LLM
Keyword Suggestion
An example prompt and output of keywords suggestion by the LLM. In this example, the number of keywords (i.e., N) is set to 5.
An example prompt and output of keywords suggestion by the LLM. In this example, the number of keywords (i.e., N) is set to 5.
Topic-Aware Keywords Suggestion
Using these corpus-level topics, we prompt the LLM to generate keywords for each document in a two-stage process, considering the overarching themes of the collection. In the first stage, the LLM selects relevant topics for the target document from the corpus-level topics. In the second stage, we prompt the LLM to generate indexing words for the document based on each selected topic. The final set of keywords is obtained by merging the words generated for each selected topic. An example prompt and output for topic-aware keywords suggestion is shown in Figure 3.
An illustration of topic-aware keywords suggestion pipeline. The words highlighted in green represent collection-level topics generated by the LLM. Each topic selected in stage 1 is used in the stage 2 prompt to generate topic-aware keywords.
An illustration of topic-aware keywords suggestion pipeline. The words highlighted in green represent collection-level topics generated by the LLM. Each topic selected in stage 1 is used in the stage 2 prompt to generate topic-aware keywords.
4.4 Choices of the Score Function
For the score function S(·,·) in Eq. 4, we propose different ways to calculate it: (1) Overlap-based, which computes the number of overlapping words between w and k, and (2) Embedding-based, which calculates the overall semantic similarity between the two word sets using pre-trained word embeddings.
Word Overlap
Synset Overlap
Word Optimal Assignment
Word Optimal Transport
Compared with our OA and OT formulations for WALM, they are similar in that they both construct the cost matrix C using cosine distance between pre-trained word embeddings. However, they differ in the following ways: (1) OA treats words in the set as equal, while OT considers probability mass of each word. (2) OA can be viewed as a “hard” assignment problem between two word sets because the entries of A are binary. In contrast, OT can be regarded as a “soft” assignment because of the spread of probability mass in P.
5 Experiments
5.1 Experimental Setup
Datasets
Evaluated Models
We conduct experiments on 7 popular topic models from traditional probabilistic to recent neural topic models. (1) Latent Dirichlet Allocation (LDA) (Blei et al., 2003), the most popular probabilistic topic model that assumes a document is generated by a mixture of topics. (2) LDA with Products of Experts (PLDA) (Srivastava and Sutton, 2017), an early NTM that applies the product of experts instead of the mixture of multinomials in LDA. (3) Neural Variational Document Model (NVDM) (Miao et al., 2017), a pioneer NTM that uses a Gaussian as the prior distribution of topic proportions of documents. (4) Embedded Topic Model (ETM) (Dieng et al., 2020), an NTM that involves word and topic embeddings in the generative process. (5) Neural Topic Model with Covariates, Supervision, and Sparsity (SCHOLAR) (Card et al., 2018), an NTM that applies a logistic normal prior on topic proportions and leverages extra information from metadata. (6) Neural Sinkhorn Topic Model (NSTM) (Zhao et al., 2021b), a recent NTM based on an optimal transport framework. (7) Contrastive Learning Neural Topic Model (CLNTM) (Nguyen and Luu, 2021), a recent NTM that uses contrastive learning to regularize document representations. We keep all these models’ default settings as suggested in their implementations. All experiments are conducted 5 times with different model random seeds; mean and standard deviation values are reported.
Settings of WALM
For the WALM settings, we use GloVe word embeddings pre-trained on Wikipedia (Pennington et al., 2014)4 in our embedding-based metrics. For the LLM generation settings, we use LLAMA3-8B-Instruct5 for our main experiments. We employ greedy decoding during LLM generation to ensure deterministic outputs, setting the maximum number of generated tokens to 300. When prompting the LLM, we limit the number of generated keywords to 5. For topical words from the topic model, we select the top 10 weighted words from the document-word distribution for each given document.
Settings of Existing Metrics
We also evaluate topic models with existing commonly used metrics to compare with ours. (1) Topic Coherence and Diversity: We apply NPMI to evaluate topic coherence using Wikipedia as the reference corpus, done by the Palmetto package6 (Röder et al., 2015). Following standard protocol, we consider the top 10 words of each topic and obtain the average NPMI score of topics by selecting the top 50% coherent topics. As for Topic Diversity (TD), we compute the percentage of unique words in the top 25 words of all topics, as defined in Dieng et al. (2020). (2) Document Representation Quality: We conduct document classification and clustering to evaluate the representation capability of topic models. As for classification, we train a Random Forest classifier based on the training documents’ representation and test the accuracy (ACC) in the testing documents, as in previous works such as Nguyen and Luu (2021). As for clustering, we conduct K-Means clustering based on test documents’ representation and report the Purity (KM-Purity) and Normalized Mutual Information (KM-NMI), as in previous works such as Zhao et al. (2021b). (3) Perplexity: We use document completion perplexity (Wallach et al., 2009) to evaluate the predictive ability of topic models. We split each test document into two equal-length folds randomly. Then we compute the Document Completion Perplexity (DC-PPL) on the second fold of documents based on the topic proportion inferred from the first fold, as in previous works such as Dieng et al. (2020).
5.2 Results and Analysis
Topic Model Evaluation with WALM
We assess topic models’ performance based on our evaluation metrics on both 20News and DBpedia. We have the following observations based on our results illustrated in Figure 4: (1) The WALM values of most models on DBpedia show better performance than 20News, which indicates that it is easier for topic models to generate informative topical words for short documents than long documents. (2) The performance ranking indicated by overlap-based metrics (e.g., Soverlap and Ssynset) and embedding-based metrics (e.g., Soa and Sot) is slightly different. The reason is that embedding-based metrics consider the semantic distance among words, which can be more flexible than the exact match in overlap-based metrics. (3) It can be observed that there is little improvement from recent NTMs over LDA and NVDM in terms of our joint metrics. The potential reason is that most contemporary NTMs primarily focus on enhancing topic coherence while neglecting the generation of documents, thus showing weak performance in generating topical words of documents as indicated by WALM. (4) When topic-aware keyword suggestion is applied in WALM (Figure 5), the performance ranking of LDA surpasses that of NVDM as the number of topics increases in the long-document dataset (i.e., 20News). This suggests that LDA benefits more from an increased number of topics when generating topic-aware keywords for documents compared to NVDM.
Topic models’ performance in terms of WALM with keywords suggestion by the LLM on 20News (top row) and DBpedia (bottom row). Error bars represent the standard deviation (omitted for values smaller than the symbol size).
Topic models’ performance in terms of WALM with keywords suggestion by the LLM on 20News (top row) and DBpedia (bottom row). Error bars represent the standard deviation (omitted for values smaller than the symbol size).
Topic models’ performance in terms of WALM with topic-aware keywords suggestion by the LLM on 20News (top row) and DBpedia (bottom row). Error bars represent the standard deviation (omitted for values smaller than the symbol size).
Topic models’ performance in terms of WALM with topic-aware keywords suggestion by the LLM on 20News (top row) and DBpedia (bottom row). Error bars represent the standard deviation (omitted for values smaller than the symbol size).
Learning Curves of WALM
In Figure 6, we illustrate the learning curves of topic models in terms of WALM, clearly showing how each metric changes throughout the training process. We observe that most topic models improve with training and eventually converge to a stable state. However, NVDM exhibits overfitting in the later stages of training, as indicated by its WALM scores. Additionally, WALM approaches based on keyword suggestions and topic-aware keyword suggestions exhibit slightly different trends in their learning curves. For instance, LDA surpasses NVDM in the later training stages when topic-aware keywords are used. This suggests that NVDM prioritizes document-level generation while LDA shows stronger awareness of collection-level topics.
Learning curves of topic models in terms of WALM with keyword suggestions (top row) and topic-aware keyword suggestions (bottom row) from the LLM on the 20News test set, with the number of topics set to 50. The area within the error bands represents the standard deviation.
Learning curves of topic models in terms of WALM with keyword suggestions (top row) and topic-aware keyword suggestions (bottom row) from the LLM on the 20News test set, with the number of topics set to 50. The area within the error bands represents the standard deviation.
Qualitative Analysis on Topical Words for Documents
We qualitatively investigate the topical words of documents by topic models at different stages in Table 1, where we randomly sample one document for 20News and DBpedia, respectively. We have the following observations based on our results: (1) The topical words at the beginning phase contain less semantically related words about the documents than those at convergence, which aligns with the learning status (as in Figure 6) indicated by WALM. (2) The topical words of NVDM include more words that reveal the documents’ main messages than LDA, which aligns with the ranking (as in Figure 5) suggested by WALM. (3) The keywords generated by the LLM are similar to those provided by human annotators for the example documents. (4) By using topic-aware keywords suggestion in WALM, the LLM tends to provide keywords that convey the high-level concepts of the topics. For instance, “troubleshooting” is identified for the first example document, and “entertainment” for the second, which offers higher-level information from topics besides individual document.
Documents’ topical words from topic models at the beginning phase (e.g., NVDM_B, LDA_B) and convergence phase (e.g., NVDM_C, LDA_C) according to WALM, where the number of topics is set to 50.
Document . | Model . | Topical Words . |
---|---|---|
It’s my understanding that, when you format a magneto-optical disc, (1) the formatting software installs a driver on the disc, (2) if you insert the disc in a different drive, then this driver is loaded into the computer’s memory and then controls the drive, and (3) if this driver is incompatible with the drive, then the disc can not be mounted and/or properly read/written. Is that correct? | LDA_B | drive, disk, card, controller, hard, mb, file, scsi, bios, power |
LDA_C | drive, disk, scsi, hard, card, controller, mb, floppy, ide, sale | |
NVDM_B | driver, drive, problem, card, time, file, thanks, need, email, work | |
NVDM_C | drive, driver, hard, scsi, window, cd, mb, floppy, disc, work | |
LLM | formatting, magneto-optical, driver, disc, incompatible | |
LLM (Topic-Aware) | troubleshooting, formatting, incompatibility, magneto-optical, driver, disc, mounting | |
Human | driver, disc, computer, hardware, software, memory, formatting, incompatible | |
Wrong World. Wrong World is a 1985 Australian film directed by Ian Pringle. It was filmed in Nhill and Melbourne in Victoria Australia. | LDA_B | film, american, released, directed, football, album, summer, played, team, hospital |
LDA_C | film, played, directed, baseball, league, australian, major, drama, football, award | |
NVDM_B | specie, album, school, known, located, north, film, directed, american, released | |
NVDM_C | film, album, released, second, south, new, directed, american, australian, known | |
LLM | world, film, australian, directed, victoria | |
LLM (Topic-Aware) | film, industry, production, cinema, entertainment | |
Human | film, movie, directed, director, australian, melbourne, victoria |
Document . | Model . | Topical Words . |
---|---|---|
It’s my understanding that, when you format a magneto-optical disc, (1) the formatting software installs a driver on the disc, (2) if you insert the disc in a different drive, then this driver is loaded into the computer’s memory and then controls the drive, and (3) if this driver is incompatible with the drive, then the disc can not be mounted and/or properly read/written. Is that correct? | LDA_B | drive, disk, card, controller, hard, mb, file, scsi, bios, power |
LDA_C | drive, disk, scsi, hard, card, controller, mb, floppy, ide, sale | |
NVDM_B | driver, drive, problem, card, time, file, thanks, need, email, work | |
NVDM_C | drive, driver, hard, scsi, window, cd, mb, floppy, disc, work | |
LLM | formatting, magneto-optical, driver, disc, incompatible | |
LLM (Topic-Aware) | troubleshooting, formatting, incompatibility, magneto-optical, driver, disc, mounting | |
Human | driver, disc, computer, hardware, software, memory, formatting, incompatible | |
Wrong World. Wrong World is a 1985 Australian film directed by Ian Pringle. It was filmed in Nhill and Melbourne in Victoria Australia. | LDA_B | film, american, released, directed, football, album, summer, played, team, hospital |
LDA_C | film, played, directed, baseball, league, australian, major, drama, football, award | |
NVDM_B | specie, album, school, known, located, north, film, directed, american, released | |
NVDM_C | film, album, released, second, south, new, directed, american, australian, known | |
LLM | world, film, australian, directed, victoria | |
LLM (Topic-Aware) | film, industry, production, cinema, entertainment | |
Human | film, movie, directed, director, australian, melbourne, victoria |
Correlation to Other Metrics
We compute Pearson’s correlation coefficients among existing and WALM series metrics, similar to the correlation analysis in previous works such as Doogan and Buntine (2021) and Rahimi et al. (2024). Pearson’s correlation coefficients among the metrics are plotted in a heatmap in Figure 7. Based on the results, we observe that: (1) WALM variants are highly correlated with each other since they originate from the same mechanism. (2) Compared with perplexity, which also evaluates the entire model based on documents, WALM shows weak correlations, suggesting a new family of evaluation metrics. (3) Compared with other types of evaluations, WALM has moderate correlations with document representation metrics (e.g., KM-Purity, KM-NMI, and ACC), and weak correlations with topic quality metrics (e.g., NPMI and TD). This indicates that our joint evaluation metrics take both components into account without relying solely on either one. These observations suggest that WALM can serve as a complementary evaluation method to existing approaches.
5.3 Contextualized Embeddings for WALM
Obtaining Contextualized Embeddings
Recall that in Eq. 10 and Eq. 12, the cost matrix C is constructed using cosine distances between word embeddings. Here, we change our construction of C from using static word embeddings from GloVe (Pennington et al., 2014) to the contextualized word embeddings from the LLM, considering that the same word may have different semantic meanings in different contexts. We obtain the contextualized embeddings of a word given a document differently in two cases: (1) When the target word appears in the context document, we take the average embeddings of each occurrence as the contextualized embedding. (2) When there is no occurrence of the target word in the given document, we add an auxiliary sentence to the document in the following format:
“<Given Document>. This document is talking about <Target Word>.”
Then, we obtain the contextualized embedding of the target word given the document with the auxiliary sentence. By replacing the global word embeddings with contextualized word embeddings, we have new variants of our embedding-based WALM (i.e., Soa and Sot), i.e., and .
Observations
Since the cost of obtaining contextualized embeddings is high for LLMs, we compute and in a case study, where we test NVDM on 100 documents randomly sampled from the test sets of 20News and DBpedia, respectively. We plot the learning curves on test documents in Figure 8. We observe that using word embeddings or contextualized embeddings in our embedding-based scores exhibits similar trends but with different values on both datasets.
Learning curves of NVDM in terms of embedding-based metrics and their contextualized variants on 20News (top row) and DBpedia (bottom row). The area within the error bands represents the standard deviation.
Learning curves of NVDM in terms of embedding-based metrics and their contextualized variants on 20News (top row) and DBpedia (bottom row). The area within the error bands represents the standard deviation.
5.4 Sensitivity Study
Here, we examine two factors that can influence the WALM scores: the number of keywords generated by the LLM and the choice of the LLM. To investigate the effect of the number of keywords, we vary the number from 3 to 10 and plot the performance ranking of topic models in Figure 9 (top row). We observe that, although the values of WALM metrics can vary with different numbers of keywords, the overall performance ranking of the topic models remains largely unaffected by these changes, especially for the overlap-based metrics. To investigate the effect of LLMs, we use different latest LLMs for keyword generation apart from LLAMA3-8B-Instruct, including Mistral-7B-Instruct-v0.3 (Jiang et al., 2023), Phi-3-Mini-128K-Instruct (Abdin et al., 2024) and Yi-1.5-9B-Chat (Young et al., 2024). From the results illustrated in Figure 9 (bottom row), we observe that overlap-based metrics show minimal variation with different choices of LLMs, and the performance ranking of the topic models is unaffected in most cases. These observations suggest that the overlap-based metrics are less sensitive to the number of words and the choice of LLMs.
Sensitivity study. Top row: Performance of topic models in terms of WALM with varying numbers of keywords. Bottom row: Performance of topic models in terms of WALM with different LLMs. Experiments are conducted on the 20News dataset with the number of topics set to 50. Error bars represent the standard deviation (omitted for values smaller than the symbol size).
Sensitivity study. Top row: Performance of topic models in terms of WALM with varying numbers of keywords. Bottom row: Performance of topic models in terms of WALM with different LLMs. Experiments are conducted on the 20News dataset with the number of topics set to 50. Error bars represent the standard deviation (omitted for values smaller than the symbol size).
5.5 Comparisons with Human Annotation
Evaluation Gap with Human Annotation
As human annotation is expensive for large-scale investigation, we randomly sample 200 test documents from 20News and DBpedia as a case study. We engaged three English speakers as annotators, trained with a few examples, to provide keywords that capture the main points of each document. Then, given a trained topic model, we compute the gap between using the words from the LLM and human in our metrics using Eq. 13.
The results are illustrated in Figure 10, where the evaluated model is NVDM with K = 50 trained on 20News and DBpedia, respectively. We have the following observations based on the results: (1) Comparing the datasets, the gap between using human judgment and the LLM in 20News is lower than in DBpedia in most cases. This indicates that for long documents such as those in 20News, the topical words generated by the LLM are closer to human judgment than in short documents in DBpedia. (2) Comparing the metrics, Soa exhibits the lowest gap among WALM metrics, with a gap value of 0.03 and 0.15 on 20News and DBpedia, respectively. This shows the effectiveness of using the LLM as a proxy for human judgment when applied in Soa. (3) Comparing the embeddings, using contextualized embeddings from the LLM can further narrow the evaluation gap for Soa and Sot on short documents.
Evaluation gap between using the LLM and human judgment as the “true” topical words. Error bars represent the standard deviation.
Evaluation gap between using the LLM and human judgment as the “true” topical words. Error bars represent the standard deviation.
Correlation with Human Annotation
We use an existing annotated dataset, 500N-KPCrowd (Marujo et al., 2012) for the keyphrase extraction task (Hasan and Ng, 2014), where each test document is paired with labeled keywords. We run LDA on the training documents and infer the topical words for the test documents, then compute the Pearson’s correlation coefficient between the WALM scores using the LLM-generated keywords and the test labels as the ground truth. The results are illustrated in Table 2. We observe that (1) using keywords from the LLM in WALM scores correlates with using the labeled keyphrases, and (2) the correlation can potentially improve when more keywords are included in the LLM’s suggestions.
Pearson’s correlation coefficient between WALM using LLM-generated keywords and human annotations as the ground truth on the 500N-KPCrowd dataset.
. | Soverlap . | Ssynset . | Soa . | Sot . |
---|---|---|---|---|
5-word suggestion | 0.55 | 0.50 | 0.57 | 0.63 |
10-word suggestion | 0.52 | 0.58 | 0.56 | 0.68 |
. | Soverlap . | Ssynset . | Soa . | Sot . |
---|---|---|---|---|
5-word suggestion | 0.55 | 0.50 | 0.57 | 0.63 |
10-word suggestion | 0.52 | 0.58 | 0.56 | 0.68 |
6 Conclusion
In this work, we propose WALM for topic model evaluation, which takes both topic and document representation quality into account jointly. WALM measures the agreement between the topical words generated by topic models and those from the LLM for given documents. The topical words from the LLM are obtained through keyword prompting or topic-aware keyword prompting, with the latter tending to capture higher-level information. To quantify the agreement between word sets, we propose different calculations, including overlap-based and embedding-based metrics. Our experiments demonstrate that the WALM series effectively reflect the capability of topic models to provide semantic summaries of documents. We show that WALM metrics align with human judgment and can serve as an informative complementary method for topic model evaluation. We suggest that overlap-based metrics demonstrate better sensitivity handling, while embedding-based metrics show a smaller evaluation gap. A potential risk of using WALM is that models chasing this metric only may be affected by the bias of LLMs. To mitigate the risk, we suggest using WALM with other metrics together.
Acknowledgments
We thank the anonymous reviewers and the action editor, Michael Elhadad, for their valuable feedback, which has significantly strengthened this work.
Notes
We omit topics in Eq. 3 as they are considered part of the parameters of the generative process ϕ.
We assume μ(k, k) is a uniform distribution over the keywords k from the LLM. Thus, k is a uniform probability vector.
References
Author notes
Action Editor: Michael Elhadad