Abstract
Topic analysis aims to study topic evolution and trends in order to help researchers understand the process of knowledge evolution and creation. This paper develops a novel topic evolution analysis framework, which we use to demonstrate, forecast, and explain topic evolution from the perspective of the geometrical motion of topic embeddings generated by pretrained language models. Our data set comprises approximately 15 million papers in the computer science field, with 7,000 “fields of study” to represent the topics. First, we demonstrate that over 80% of topics have undergone obvious motion in the semantic vector space, based on the hyperplane and its normal vector generated by a support vector machine. Subsequently, we verified the predictability of the motion based on three vector regression models by predicting topic embeddings. Finally, we employed a decoder to explain the predicted motion, whose forecast embeddings can capture about 50% of unseen topics. Our research framework shows that topic evolution can be analyzed via the geometrical motion of topic embeddings, and the semantic motion of old topics nurtures new topics. The current study opens new research pathways in topic analysis and sheds light on the topic evolution mechanism from a novel geometric perspective.
PEER REVIEW
1. INTRODUCTION
In the science of science, topic analysis has been widely studied to understand and track how knowledge changes by analyzing topic evolution and topic trends (Qian, Liu, & Sheng, 2020). Such analysis not only helps researchers to identify and track emerging topics, hot topics, and knowledge transfer (Xie, Zhang et al., 2020) but also assists governments, funding agencies, and corporations in effectively capturing the full picture of a scientific domain (Qian et al., 2020; Yu & Xiang, 2023).
Related studies have mainly conducted topic analysis by examining the external representation and internal content of a scientific topic (Zhang, Zhang et al., 2016). External representation analysis typically involves scrutinizing term frequency, the number of papers, and the volume of citations associated with a specific topic (Bornmann & Haunschild, 2022; Cheng, Wang et al., 2020; Lu, Huang et al., 2021), while internal content analysis essentially focuses on delving into the semantic essence of a topic via topic models and language models (Gozuacik, Sakar, & Ozcan, 2023; Hu, Qi et al., 2018; Yu & Xiang, 2023). However, to the best of our knowledge, existing studies have not examined topic evolution by analyzing the geometrical motion of topic embeddings. As embedding spaces can represent semantic space (Xie et al., 2020), the semantic evolution of a topic may be encoded in the geometrical motion of its embeddings. As the change in the embedding of a keyword indicates a corresponding change in its meaning, the evolution of a topic may be represented by the motion of its embeddings. To verify this idea, this study makes the first attempt to demonstrate, predict, and explain the geometrical motion of topic embeddings, thereby presenting a novel research perspective in topic analysis.
The specific research questions of this study are as follows:
As a topic develops, do its embeddings move in semantic vector space? If so, how can we quantify and visualize the direction of motion?
To what extent is this direction predictable? Can machine learning methods be used to effectively predict this motion?
Can the predicted motion be used to explain topic evolution and, if so, how?
To address these questions, we propose a novel topic evolution analysis framework, using which we re-examine topic evolution by demonstrating, predicting, and explaining the motion of topics in semantic vector space. Specifically, we employed over 20 million papers in the computer science (CS) field extracted from Microsoft Academic Graph (MAG) (Zhang, Liu et al., 2019) as our data set and employed approximately 7,000 FoSL2 (Level 2 fields of study) generated by MAG (Shen, Ma, & Wang, 2018) to represent topics. We first used pretrained language models to create context-based embeddings to encode the topics through topic words and the title field of papers; these we denoted as topic embeddings. Then, using a support vector machine (SVM), we verified that over 80% of topics experienced obvious motion in semantic vector space and displayed this motion in 2D- and 3D-space. Next, inspired by the work of Gozuacik et al. (2023), we employed three vector regression models to predict topic embeddings, by which we demonstrated the predictability of the motion. Finally, we trained a predictive decoder to explain the predicted topic embeddings by decoding the forecast vectors into topic words. Our findings revealed that old topic evolution nurtures new topics, and our framework can predict unseen topics. Our code is open and accessible1.
The current study has the following theoretical and practical implications. First, we found that topic embeddings experienced obvious semantic motion in semantic vector space, which represents the process of topic evolution from a geometric perspective. Second, we verified the predictability of the motion of topic embeddings by treating it as a vector regression task, which provides a promising topic detection method. Third, we explained the predicted embeddings via a text generation model and creatively predicted unseen topics. In conclusion, our research provides a new explanatory framework for topic evolution analysis and consequently broadens the research path in topic analysis.
The rest of the paper is organized as follows: Section 2 reviews related studies. Section 3 introduces our problem definition and research method and Section 4 presents our data set. Section 5 details experimental settings and results. In Section 6, we conclude this paper by presenting its implications, limitations, and directions for future work.
2. BACKGROUND
Topic analysis mainly investigates the evolutionary patterns of topics based on their external and internal characteristics. The former typically include term frequency, number of papers, and number of citations, while the latter generally indicate research content.
The external features of topics have been widely studied. For example, Zhang et al. (2016) presented a general research framework for topic analysis and forecasting, in which a k-means-based clustering method is proposed to identify topics and a similarity measure function is used to detect topic relationships and trends. Effendy and Yap (2017) proposed a FoS score to quantify the trend in a topic, by which they analyzed the publication and citation trends along with the evolution of topics in the CS field. Qin, Zeng, and Ma (2021) identified topics in over 10 million CS papers and conducted trend and correlation analysis based on the number of papers. Bornmann and Haunschild (2022) analyzed the dynamics of topics from the annual number of publications in chemistry from 2014 to 2020; they found that optical phenomena and electrochemical technologies are emerging topics. Lu, Yang, and Wang (2022) examined the characteristics of topics in the biomedical field and found that topic category, clinical significance, and the narrow terms of a topic impact the popularity of a new topic.
Related studies have also discovered and analyzed topics from keyword co-occurrence and citation networks. For instance, Jensen, Liu et al. (2016) constructed a heterogeneous network consisting of papers, venues, authors, and topics, and used restricted metapaths of the network to build a topic evolution tree. Yu and Fang (2023) likewise considered a heterogeneous network consisting of papers, journals, authors, and topics, and proposed a four-entity reinforced ranking model to evaluate topic impact. Cheng et al. (2020) developed a keyword-citation-keyword network and utilized the PageRank algorithm to disclose important topics. The MatrixSim model proposed by Wang, He et al. (2022) uses the local structure of topic communities in coword networks to detect topic evolution paths.
Moreover, machine learning and deep learning models have recently yielded fruitful results in detecting topic trends. Lu et al. (2021) adopted author-defined keywords to represent topics, and employed four keyword features to predict topic trends. Taheri and Aliakbary (2022) employed FoS to represent topics, and predicted the topic trends by using a long short-term memory neural network (LSTM). Xu, Du et al. (2022) proposed a topic trend prediction model based on multi-LSTM and graph convolutional networks, which integrates the interactive influence among topics from different publications. Although these studies have made substantial progress in topic analysis, they generally lack analysis of the topics’ research content.
With the advances in natural language processing (NLP) technology, researchers have conducted semantic analysis in topic evolution through topic and language models. Latent Dirichlet allocation (LDA) (Blei, Ng, & Jordan, 2003) and its variants are frequently used to identify and analyze topics. For example, Zhang, Chen et al. (2017) identified topics in the journal Knowledge-Based Systems using an LDA-based topic model, then predicted topic trends via a probability-based weighting approach. Qian et al. (2020) presented a hierarchical topic model by which they visually investigated topic-trees in AI from 2009 to 2018. Yu and Xiang (2023) collected 177,204 articles in AI research from 1990 to 2021 and used an LDA model to identify 40 topics from the abstracts. They analyzed the research characteristics and research content of papers from different publication sources. In addition, some studies have analyzed semantic embeddings such as BERT (Devlin, Chang et al., 2018) and ELMO (Peters, Neumann et al., 2018) in topic analysis. Hu et al. (2018) represented the semantic information of keywords with the static embeddings generated by the word2vec model (Mikolov, Chen et al., 2013). They reduced these high-dimensional vectors into two-dimensional vectors using the t-SNE algorithm and presented the “Ghost City.” Hu, Luo et al. (2019) also analyzed topic evolution geographically using spatial autocorrelation based on word2vec embeddings. They found that the impact of a keyword can only affect its surrounding keywords. Xie et al. (2020) identified topics in Chinese- and English-language publications in the field of library and information science based on LDA, utilizing the LDA probability value and multilingual as well as monolingual BERT to compute multilingual topic similarity. Gozuacik et al. (2023) used a neural network model to predict a word embeddings matrix and identify topics based on the co-similarity matrix computed by the predicted embeddings. Their approach can successfully detect emerging and disappearing topics.
Taken together, these studies reveal an active and methodologically diverse research landscape surrounding topic evolution. However, to the best of our knowledge, no existing studies have attempted to analyze topic evolution geometrically by examining the motion of context-based embeddings in semantic vector space. To fill this gap, we aim to demonstrate, predict, and explain the semantic motion of topic embeddings, thereby broadening the research path in topic analysis.
3. METHODOLOGY
3.1. Problem Definition
This study makes a preliminary attempt to re-examine topic evolution by analyzing the geometrical motion of topic embeddings generated by pretrained language models in semantic vector space. As shown in Figure 1, our framework includes three steps:
Demonstrate the motion of topic embeddings (blue);
Investigate the predictability of the motion (orange); and
Explain the predicted motion (red).
To generate topic embeddings, we employed pretrained language models to encode a topic i through context-based embeddings Xi(t), in which a topic and the title field of a paper on that topic are used as the input, Pθ(Xi(t)∣Zi(t)). The embedding space is then used to represent semantic space, and Xi(t) is utilized to encode the topic.
As shown in the blue text in Figure 1, we analyzed the semantic evolution of a topic by studying the geometrical motion of its embeddings in semantic vector space, thus re-examining the process of topic evolution from the viewpoint of geometry. Specifically, we employed a support vector machine (SVM) to classify embeddings of a topic from two adjacent time periods, where the hyperplane obtained by the SVM and its normal vector are utilized to visualize the direction of the topic’s motion.
The orange text in the same diagram describes our use of vector regression models to forecast topic embeddings based on past motion trajectory, Pψ(Xi(t)∣Xi(0:t − 1)), which is actually a time series prediction problem. To the best of our knowledge, we make the first attempt to detect topic evolution in this way.
The red text at the right of Figure 1 indicates the process of explaining the predicted topic embeddings based on a text generation model (i.e., decoder). Specifically, we decoded the predicted embeddings into topic words, Pϕ(Zi(t)∣Xi(t)), and scrutinized whether our method could creatively predict unseen topics that then appear in the future. Figuratively speaking, the knowledge distribution in a specific field can be understood as a picture puzzle in semantic space. The motion of topic embeddings intuitively reflects topic evolution, and the missing pieces in the puzzle are filled in by new topics generated during the evolution of old topics. Overall, our research framework is akin to an encoder-decoder framework in which the hidden embeddings are predicted.
3.2. Demonstrating the Motion of Topic Embeddings via SVM
First, we explain how knowledge related to a topic may be encoded through context-based embeddings created by a pretrained language model. As shown in Figure 2, we concatenated a topic and the title field of a paper on that topic as the model input. The topic word represents explicit knowledge related to the topic, while the title contains the context-based knowledge. Departing from previous works that encoded a topic with one or a few static vectors (Hu et al., 2018, 2019; Xie et al., 2020), we represented a topic with numerous context-based embeddings. Hence, rich semantic knowledge of a topic is encoded by its embeddings (hereafter “topic embeddings”). In this study, the FoS (fields of study) generated by MAG are taken as topics (Shen et al., 2018). Three pretrained language models (i.e., “all-MiniLM-L6-v2,” “all-mpnet-base-v2,” and “bert-base-uncased”) are used to generate embeddings. Of these, “bert-base-uncased” (Devlin et al., 2018) is implemented through the Hugging Face library2, while “all-mpnet-base-v2” and “all-MiniLM-L6-v2” (Reimers & Gurevych, 2019) are implemented through the Sentence-Transformers library3.
3.3. Forecasting Motion via Vector Regression Models
After establishing the motion of topic embeddings, it is natural to ask whether this motion can be forecast—and if so, how. In contrast to Gozuacik et al.’s (2023) research focusing on predicting the static embeddings of target words, we undertake a preliminary exploration to predict the motion of context-based embeddings of a topic. Our task is more challenging but may provide a novel path for detecting topic evolution by forecasting the motion of topic embeddings.
Figure 3 presents a three-dimensional schematic diagram illustrating the mobility of topic embeddings of a specific topic i within semantic vector space from t1 to t2. Forecasting the motion of i can be taken as the time series prediction task. Given the embeddings of the topic i from t0 to t1 (blue dots), our goal is to predict embeddings from t1 + 1 to t2 (orange dots); that is, to find the complex functional relationship Xi(t1 + 1:t2) = F(Xi(t0:t1)). Thus, this is fundamentally a vector regression task. However, existing research focuses more on predicting univariate time series (Lim, Arık et al., 2021; Shchur, Turkmen et al., 2023). To simplify our problem, we predicted the motion of the centroid, = 1/∣Xi(:t)∣∑x∈X(:t)x, where Xi(:t) denotes embeddings of a topic i up until t. There are two families of forecasting methods for large panels of time series. The first method is to fit local models to each individual time series; the second is built on expressive models that are fit globally to all time series at once (Shchur et al., 2023). We employed a linear model and vector autoregression model (local models) as well as a neural network model (global) to fulfill our prediction task.
3.4. Explaining Forecast Motion via a Text Generation Model
Once the motion of topic embeddings is predictable, another key issue is how to explain the predicted embeddings, which may encode the coordinates of new knowledge created during topic evolution. Because these dense vectors cannot be directly understood by humans, we explained the predicted embeddings through a decoder 𝒟, by which predicted embeddings can be decoded into topic words.
Specifically, we adopted a neural network with gate recurrent unit (GRU) (Chung, Gulcehre et al., 2014) and position-wise feed-forward network (FFN) (Vaswani, Shazeer et al., 2017) to build the decoder 𝒟, as shown in Figure 4. The input of 𝒟 is the embeddings of a topic i (x ∈ Xi(t)), while the output is the corresponding topic word. The word embedding matrix in the pretrained language model is also replicated into 𝒟. <SOS> and <EOS> respectively indicate the start and end token of the decoding process. The FFN and GRU are described in Eqs. 5–8, where ReLU = max (0, *) and σ(*) is the sigmoid function. 𝒟 was also implemented via a PyTorch framework.
4. DATA COLLECTION AND PREPROCESSING
The Microsoft Academic Graph (MAG)7 generated in 2020 (Zhang et al., 2019) includes over 200 million scientific papers in 19 top-level fields. A series of “fields of study” with six levels (i.e., FoSL0, …, FoSL5) represent the research topics (Shen et al., 2018), allowing researchers to conduct topic analysis at different granularities. Many studies have used MAG for scientific investigations (e.g., De Domenico, Omodei, & Arenas, 2016; Huang, Lu et al., 2022; Taheri & Aliakbary, 2022).
As we are most familiar with the computer science (CS) field, we extracted papers and FoSL2 in this field from MAG to conduct our experiments. Specifically, we extracted about 15 million papers studying at least one FoSL2. The annual number of papers, which increases exponentially, is shown in Figure 5(a). Compared with FoSL0 and FoSL1, FoSL2 is more specific (Effendy & Yap, 2017), and therefore used here to represent topics; there are in total 13,825 FoSL2 (e.g., “linguistic model,” “diffusion network,” etc.). In Figure 5(b), the x-axis denotes the number of papers studying an FoSL2, while the y-axis denotes the number of FoSL2. There are 7,093 FoSL2 that were adopted at least 100 times (FoSL2 ≥ 100), but only 3,920 FoSL2 used 500 times or more (FoSL2 ≥ 500). We employed the year when a FoSL2 was first adopted 10 times as the time it appears (), the distribution of which is shown in Figure 5(c). We next introduce the data sets used in our three research questions.
First, we used a topic i and the title field of a paper on the topic to create topic embeddings, where i ∈ S (S = FoSL2 ≥ 100). The embeddings of the topic are Xi(t), t ∈ [, t2020]. Xi(t) is the Ni(t) × K matrix, where Ni(t) and K indicate the number of papers studying i published at t and the dimensions of embeddings, respectively. We demonstrate the motion of topic embeddings by Xi(t), i ∈ S, t ∈ [1990, 2010].
Second, we selected 2,496 topics from i ∈ S3 (S3 = FoSL2 ≥ 500 ∩ ≤ 1985) and the time span t ∈ [1980, 2000] to check the predictability of the motion. Specifically, Xi(t) t ∈ [1980, 1990] is taken as the training data, and Xi(t) t ∈ [1991, 2000] is taken as the test data. The training data is used to fit the prediction models, while the test data is utilized to evaluate prediction accuracy. We repeated our experiments with prediction durations of five, seven, and 10 years, respectively. Notably, only a topic i ∈ S3 is exposed to the prediction model F; topics with > 1985 have never been known to F.
Third, to explain the predicted embeddings, we employed topics i ∈ S and corresponding embeddings to train a predictive decoder 𝒟, by which we can check whether embeddings predicted by F based on i ∈ S3 do encode unseen topics in S2 or S5. As S3 ⊂ S, 𝒟 has more comprehensive topic information than F and therefore 𝒟 serves as a “prophet” for F. Figure 6 summarizes the inclusion relationships among the above sets.
Statistics on the number of computer science publications and topics.
5. EXPERIMENTS AND RESULTS
5.1. Evaluation Metrics
We utilized a series of machine learning models to fulfill our experiments. Specifically, the SVM was utilized to classify topic embeddings from two adjacent time periods, which is fundamentally a binary classification task. Hence, the precision, recall, and F1 score were used to evaluate the classification performance. Three vector regression models were adopted to predict topic embeddings, which is actually a time series prediction task. Thus, the root mean square error (RMSE), mean absolute error (MAE), and mean absolute percent error (MAPE) were used to evaluate the prediction performance. Finally, a text generation model was trained to fulfill a multiclassification task through decoding predicted embeddings into topic words. Its accuracy was utilized as a measure of classification performance.
5.2. Topic Embeddings Undergo Motion in Semantic Vector Space
We re-examined topic evolution by analyzing the geometrical motion of a topic’s embeddings in semantic vector space (Xi(t), i ∈ S, t ∈ [1990, 2010]). Ethayarajh, Duvenaud, and Hirst (2018) and Toubia, Berger, and Eliashberg (2021) showed that Euclidean distance between two embeddings reflects their semantic difference. Therefore, we first reported the distribution of Euclidean distance of the topic centroid from 1990 to 2010 (i.e., ‖‖2). In Figure 7, the distributions of the distance based on S (FoSL2 ≥ 100) and that based on S3 ∪ S4 (FoSL2 ≥ 500) are plotted in red and blue bar charts, where the corresponding black lines represent the kernel density estimation. The title of each subfigure shows which pretrained language model is used to generate the embeddings. For embeddings created in the three language models, the blue and red distributions are both bell-shaped with mean greater than 0, showing that these topics experienced an obvious motion in semantic vector space during the observed period.
To further verify and illustrate this motion, we trained the SVM with the linear kernel to classify (t) t ∈ [1980, 1990] and (t) t ∈ [1991, 2000], where is a set of 2D-vectors obtained by PCA. The cumulative distribution of F1 score is shown in Figure 8. For embeddings created in the three language models, the blue and red cumulative distribution functions (CDF) are both less than 0.2 when F1 < 0.5, which suggests that less than 20% of topics cannot be effectively classified. In other words, (t) t ∈ [1980, 1990] and (t) t ∈ [1991, 2000] of the majority of topics (80%) were distributed on both sides of a hyperplane generated by an SVM; these topics experienced semantic evolution, and their embeddings underwent motion in semantic vector space.
Subsequently, we randomly selected nine cases to conduct visualization analysis. Figure 9 presents scatterplots showing 2D vectors of a topic i obtained by PCA, where the blue and red dots respectively indicate embeddings generated by papers on i published before 2000 and after 2001. The dotted black line is the hyperplane generated by the SVM, while the black arrow denotes its normal vector (and hence the direction of motion). For all cases, the red and blue points are roughly distributed on either side of the black line. For instance, the vectors for the topic “ontology” move upwards from the bottom, which suggests the exploration of new knowledge in semantic space.
Nine case topics illustrating semantic evolution in 2D semantic space.
Figure 10 plots 3D vectors for the same nine topics, again obtained by PCA through scatterplots. The settings in Figure 10 are similar to those in Figure 9. The gray plane is the hyperplane, while the black arrow denotes its normal vector; the blue points gradually cross the plane to reach red points. The 3D vectors for the topic “body of knowledge,” for example, move from the interior to the exterior of the region bounded by the plane. In summary, over 80% of topics in the CS field have undergone obvious motion in the semantic vector space, which suggests that they have experienced semantic evolution.
Nine case topics illustrating semantic evolution in 3D semantic space.
We also explored the relationship between the motion distance of a topic i and the number of topics that have been studied with i. If two topics are studied in the same paper, they are recorded as co-occurring once. As shown in the first row of Figure 11, as the number of topics studied with i increases, the motion distance decreases. The Pearson coefficient (Pearsonr) is shown in the left corner of each subfigure, and the red solid line and blue dashed line respectively represent linear fitting curves with negative slopes. To avoid the impact of topics studied infrequently with i, we also repeated the above experiments by filtering topics with less than median co-occurrence frequency. As shown in the second row of Figure 11, we obtained similar results with a more obvious negative correlation.
Relationship between movement distance and number of topics studied with i.
This result is very surprising but reasonable. Admittedly, if topic i is only studied together with a specific topic j, the semantic content of i tends to evolve toward j, which suggests that their research content is closely related. However, if i is studied with diverse topics, the changes in its semantic structure tend to be vague. In Figure 12, we randomly selected seven topics to visually explain this phenomenon; the embeddings of these topics are projected into 2D space via PCA. For example, “vehicle dynamics” and “simulation system” are often studied together, and thus the former gradually moves toward the latter. However, “linear combination” has been frequently studied with multiple topics, and therefore its distribution pattern is relatively loose and its motion unobvious.
A case to clarify the negative correlation between motion distance and co-occurrence.
A case to clarify the negative correlation between motion distance and co-occurrence.
5.3. The Motion of Topic Embeddings Is Predictable
After demonstrating that topics undergo motion in semantic vector space, we predicted the motion using vector regression models (i.e., LR, VAR, and LSTM). A total of 2,496 topics (S3) were selected to conduct prediction experiments, where Xi(t) t ∈ [1980, 1990] and Xi(t) t ∈ [1991, 2000] were used to build the training set and test set, respectively.
Specifically, for the LR and VAR, we trained a local model for each topic, using the training data to fit model parameters and the test data to evaluate the prediction accuracy. For the LSTM, we followed Gozuacik et al.’s (2023) experimental settings and trained a global model for all topics. In the training phase of the LSTM, Xi(t) t ∈ [1980, 1985] was used to forecast Xi(t) t ∈ [1986, 1990], while, in its test phase, the entire training set was used as the input. We adjusted the prediction duration to five, seven, and 10 years, and repeated the experiments. Figure 13 reports the prediction performance on 2,496 topics in terms of RMSE, MAE, and MAPE. The x-axis of each subfigure indicates the model name, while the y-axis indicates the evaluation metrics; different colors denote different prediction durations. For embeddings generated in the three pretrained language models and with different prediction durations, the VAR model almost always achieves the best prediction accuracy on all evaluation metrics. The LR model achieves the second-highest prediction accuracy, and the neural network exhibits the worst prediction results, possibly due to insufficient training data. In Table 1 we further report the median value of the prediction accuracy distribution of the 2,496 topics, with optimal results shown in boldface. The small values of RMSE, MAE, and MAPE suggest that the VAR, to some extent, produces satisfactory prediction results. For example, for embeddings created from the BERT model, the median values of RMSE, MAE, and MAPE are 0.0130, 0.0094, and 20.27% for T = 10, respectively. Hence, the motion of topic embeddings is somewhat predictable. As the prediction duration increases, the prediction accuracy of the three models displays a downward trend, suggesting that our prediction task is challenging and requires further improvements to accuracy in the future.
Median prediction accuracy of the three vector regression models
Metric . | T . | Models . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
all-MiniLM-L6-v2 . | all-mpnet-base-v2 . | bert-base-uncased . | ||||||||
LR . | VAR . | LSTM . | LR . | VAR . | LSTM . | LR . | VAR . | LSTM . | ||
RMSE | 5 | 0.0035 | 0.0020 | 0.0063 | 0.0026 | 0.0014 | 0.0046 | 0.0137 | 0.0087 | 0.0505 |
7 | 0.0042 | 0.0024 | 0.0066 | 0.0031 | 0.0017 | 0.0048 | 0.0167 | 0.0105 | 0.0508 | |
10 | 0.0053 | 0.0030 | 0.0071 | 0.0038 | 0.0021 | 0.0053 | 0.0214 | 0.0130 | 0.0515 | |
MAE | 5 | 0.0027 | 0.0016 | 0.0051 | 0.0019 | 0.0011 | 0.0036 | 0.0104 | 0.0064 | 0.0403 |
7 | 0.0032 | 0.0018 | 0.0052 | 0.0022 | 0.0013 | 0.0038 | 0.0124 | 0.0077 | 0.0405 | |
10 | 0.0039 | 0.0022 | 0.0056 | 0.0027 | 0.0016 | 0.0042 | 0.0155 | 0.0094 | 0.0410 | |
MAPE | 5 | 52.86% | 29.41% | 98.81% | 59.09% | 32.91% | 142.10% | 20.95% | 13.04% | 80.56% |
7 | 65.35% | 36.77% | 107.71% | 71.18% | 39.66% | 151.06% | 25.61% | 16.11% | 82.94% | |
10 | 83.40% | 45.62% | 120.84% | 91.97% | 49.95% | 166.24% | 33.60% | 20.27% | 88.16% |
Metric . | T . | Models . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
all-MiniLM-L6-v2 . | all-mpnet-base-v2 . | bert-base-uncased . | ||||||||
LR . | VAR . | LSTM . | LR . | VAR . | LSTM . | LR . | VAR . | LSTM . | ||
RMSE | 5 | 0.0035 | 0.0020 | 0.0063 | 0.0026 | 0.0014 | 0.0046 | 0.0137 | 0.0087 | 0.0505 |
7 | 0.0042 | 0.0024 | 0.0066 | 0.0031 | 0.0017 | 0.0048 | 0.0167 | 0.0105 | 0.0508 | |
10 | 0.0053 | 0.0030 | 0.0071 | 0.0038 | 0.0021 | 0.0053 | 0.0214 | 0.0130 | 0.0515 | |
MAE | 5 | 0.0027 | 0.0016 | 0.0051 | 0.0019 | 0.0011 | 0.0036 | 0.0104 | 0.0064 | 0.0403 |
7 | 0.0032 | 0.0018 | 0.0052 | 0.0022 | 0.0013 | 0.0038 | 0.0124 | 0.0077 | 0.0405 | |
10 | 0.0039 | 0.0022 | 0.0056 | 0.0027 | 0.0016 | 0.0042 | 0.0155 | 0.0094 | 0.0410 | |
MAPE | 5 | 52.86% | 29.41% | 98.81% | 59.09% | 32.91% | 142.10% | 20.95% | 13.04% | 80.56% |
7 | 65.35% | 36.77% | 107.71% | 71.18% | 39.66% | 151.06% | 25.61% | 16.11% | 82.94% | |
10 | 83.40% | 45.62% | 120.84% | 91.97% | 49.95% | 166.24% | 33.60% | 20.27% | 88.16% |
To visually observe the forecast effect, we randomly selected 12 topics to serve as case studies. For the convenience of visualization, for each topic i, we randomly select a specific dimension k along which to compare the predicted value and actual value. In Figure 14, the x-axis indicates the time, while the y-axis indicates the value of the dimension k of the centroid of a topic i. The gray shaded zone represents the training set, and the blue and red lines respectively denote the actual and predicted values generated by the VAR model. The evolutionary trends of the red and blue curves are similar, which also indicates that the motion of a topic can be predicted to some extent.
Notably, RMSE, MAE, and MAPE cannot directly evaluate whether the predicted embeddings are meaningful; that is, the small values of RMSE, MAE, and MAPE cannot guarantee that the predicted embeddings do encode meaningful topics. Therefore, in the following subsection, we also evaluated the predicted embeddings through the decoder 𝒟, by which we can directly check the quality of predicted embeddings by decoding them into topic words.
5.4. Forecast Topic Embeddings Encode Unseen Topics
To explain the predicted motion of a topic, we trained a decoder 𝒟 to decode the forecast centroid into human-readable topic words. Specifically, for embeddings created by “all-MiniLM-L6-v2,” “all-mpnet-base-v2,” and “bert-base-uncased,” we trained decoders whose classification accuracy on S was 76.61%, 82.97%, and 98.84%, respectively. Considering that the text generation model is used for multiclassification, the accuracy is acceptable. The VAR model trained on S3 is used to forecast embeddings.
In the training process, the embeddings of S3 (∣S3∣ = 2,496) were used to train the prediction model F, which means that only these topics are exposed to F. The topic sets S4 and S5 (∣S4∣ = 1424, ∣S5∣ = 4,597), have never been exposed to F. Subsequently, we employed F to predict the centroid movement in 300 time steps and used 𝒟 to decode the predicted embeddings, by which we can obtain a set of created topic words (SF). If SF ∩ S4 ≠ ∅ or SF ∩ S5 ≠ ∅, this means that the predicted embeddings can encode new topics that have never been exposed to F. We computed the ratio ∣SF ∩ S4∣/∣S4∣ and ∣SF ∩ S5∣/∣S5∣, which suggests to what extent these predicted embeddings can explain new topics created after 1985 (tborn > 1985).
As shown in Figure 15, we observed similar results for embeddings generated from the three language models. The red and blue lines respectively indicate ∣SF ∩ S4∣/∣S4∣ and ∣SF ∩ S5∣/∣S5∣; the latter is consistently lower than the former, because S4 ⊂ S5. As the prediction duration (T) increases, the ratios gradually increase and then converge. The red and blue lines have risen to approximately 6% and 4%, respectively, which means that these predicted embeddings explain about 100 topics that appear after 1985 and 150 topics that are unexposed to F. In other words, some predicted vectors can be decoded into new knowledge, which suggests that the semantic evolution of old topics nurtures new topics. In addition, we also created a baseline (shown in gray) by using the embeddings of exposed topics (S3) as the input to 𝒟, and generated topics are denoted as SB. The two gray lines always obtain zero values, because these embeddings can only be decoded into old topics that belong to S3; this further supports the above conclusion. However, there are still many new topics “born” after 1985 that cannot be predicted by the current method.
The research process of scientists inevitably undergoes the influence of random factors, which affects scientists’ topic selection process (Huang, Huang et al., 2023; Jia, Wang, & Szymanski, 2017). Therefore, instead of directly using predicted centroids as the input to 𝒟, we sampled one vector as the input from Normal(μ, σ) with the predicted centroids as μ and predicted standard deviation as σ. This captures the intuition that the motion of a topic may be affected by random oscillations. In Figure 16, the trends of the red and blue lines are consistent with those in Figure 15, but the ratios have improved greatly for “all-MiniLM-L6-v2” and “all-mpnet-base-v2”: the red and blue lines now converge to around 50% and 30%, respectively. This means that these predicted embeddings explain about 500 topics that appear after 1985 and 1,000 topics that are unexposed to F. Moreover, the gray baseline curves remain lower than the red and blue lines. This verifies our above conclusions, and further underscores the predictability of the motion. Table 2 reports the exact value of ∣SF ∩ S3∣/∣S3∣, ∣SF ∩ S4∣/∣S4∣, and ∣SF ∩ S5∣/∣S5∣; the large value of ∣SF ∩ S3∣/∣S3∣ indicates that many predicted vectors still represent the old topics exposed to F, because these old topics are used to train F.
Ratio of predicted topics that appear after 1985 accounting for random factors.
Ratio of predicted topics that appear after 1985 accounting for random factors.
Ratio of created topics that appear after 1985 when T = 300
. | Models . | |||||
---|---|---|---|---|---|---|
all-MiniLM-L6-v2 . | all-mpnet-base-v2 . | bert-base-uncased . | ||||
Mean (%) . | Sample (%) . | Mean (%) . | Sample (%) . | Mean (%) . | Sample (%) . | |
∣SF ∩ S3∣/∣S3∣ | 95.19 | 99.48 | 89.30 | 99.71 | 98.52 | 98.84 |
∣SF ∩ S4∣/∣S4∣ | 7.09 | 36.59 | 4.99 | 54.42 | 5.48 | 6.18 |
∣SF ∩ S5∣/∣S5∣ | 3.76 | 22.34 | 2.89 | 37.93 | 4.05 | 4.42 |
∣SB ∩ S3∣/∣S3∣ | 90.22 | 97.24 | 81.45 | 98.48 | 98.44 | 98.52 |
∣SB ∩ S4∣/∣S4∣ | 0.14 | 0.63 | 0.00 | 1.47 | 0.28 | 0.35 |
∣SB ∩ S5∣/∣S5∣ | 0.13 | 0.41 | 0.04 | 0.89 | 0.22 | 0.22 |
. | Models . | |||||
---|---|---|---|---|---|---|
all-MiniLM-L6-v2 . | all-mpnet-base-v2 . | bert-base-uncased . | ||||
Mean (%) . | Sample (%) . | Mean (%) . | Sample (%) . | Mean (%) . | Sample (%) . | |
∣SF ∩ S3∣/∣S3∣ | 95.19 | 99.48 | 89.30 | 99.71 | 98.52 | 98.84 |
∣SF ∩ S4∣/∣S4∣ | 7.09 | 36.59 | 4.99 | 54.42 | 5.48 | 6.18 |
∣SF ∩ S5∣/∣S5∣ | 3.76 | 22.34 | 2.89 | 37.93 | 4.05 | 4.42 |
∣SB ∩ S3∣/∣S3∣ | 90.22 | 97.24 | 81.45 | 98.48 | 98.44 | 98.52 |
∣SB ∩ S4∣/∣S4∣ | 0.14 | 0.63 | 0.00 | 1.47 | 0.28 | 0.35 |
∣SB ∩ S5∣/∣S5∣ | 0.13 | 0.41 | 0.04 | 0.89 | 0.22 | 0.22 |
Subsequently, we randomly selected four cases to qualitatively examine the semantic relationships between the exposed topic j and meaningful created topics i ∈ SF ∩ S encoded by predicted embeddings. In Figure 17, the x-axis indicates the prediction duration (T), and the y-axis denotes the time when a topic is adopted (tborn). The heading of each subplot is a case topic, the embeddings of which are predicted through a vector regression model, and the topics within the graph area of each subplot are predicted topics generated by the decoder using the predicted centroids as input. The predicted topics that belong to S3 are marked in black, while those that belong to S5 are marked in red. The embeddings created from “All-MiniLM-L6-v2” are used.
As expected, when topic embeddings of j move in semantic space (T increases), its semantics gradually enter the territory of other related topics, and predicted embeddings are decoded into other old topics (e.g., the black dots). Importantly, some new topics are also created during the process (e.g., red dots). The old topics tend to be the theoretical and technical foundation of new topics; for example, the “travelling salesman problem” is a classic combinatorial optimization problem for computing an optimal path, which is an important theoretical foundation of “bicycle sharing” services. Likewise, “database query” is closely related to “query analysis” and “big data”; “naturalistic driving” research aims to study the behavior and performance of drivers and is a research frontier in “vehicle dynamics”; and “sentiment analysis” is a more delicate procedure within “language analysis.” In short, our framework makes a preliminary attempt to predict and explain the geometrical motion of topic embeddings, creatively detecting unseen topics in the process.
6. DISCUSSION AND CONCLUSION
Unlike previous topic analysis research, which has mainly focused on adoption frequency, scientific impact, and static embeddings of topics (Bu, Li et al., 2021; Hu et al., 2019; Lu et al., 2021, 2022), we introduced a novel topic evolution analysis framework in which the process of a topic’s evolution is regarded as the motion of its embeddings in semantic vector space. We answered three interesting and meaningful research questions and verified the feasibility of our framework, utilizing a series of pretrained language models and machine learning models to conduct our experiments. In the following section, we highlight the theoretical and practical implications of our study, acknowledge limitations, and propose some directions for future work.
6.1. Theoretical Implications
This study has several theoretical implications of note. First, we presented a novel topic evolution analysis framework in which we re-examined topic evolution as the motion of topic embeddings in semantic vector space. Specifically, instead of analyzing the external features of a topic, our framework uses context-based embeddings to encode its knowledge and demonstrates, predicts, and explains the geometrical motion of topic embeddings. Consequently, our research framework provides a fresh perspective for topic evolution analysis.
Second, this study illustrated that most topics within the CS field have obviously moved in semantic vector space, thereby geometrically mirroring the topics’ semantic evolution process. The Euclidean distance between the centroids of a topic at two distinct time points was used to measure its movement distance, demonstrating that the position of topics in semantic space would not remain unchanged. An SVM was used to further investigate the movement of topics; the classification accuracy of the SVM with linear kernel significantly exceeded 0.5, revealing that the topic embeddings in two consecutive periods fell on either side of the hyperplane. The hyperplane created by the SVM and its corresponding normal vector were utilized to visualize the movement direction of a topic in 2D (3D) space, by which the trajectory of topic embeddings crossing the hyperplane along the normal direction can be displayed in an intuitive and accessible way. Moreover, we discovered a negative correlation between the movement distance of a topic and its number of co-occurring topics, showing that combinatorial innovation considerably influences the semantic evolution of topics. To the best of our knowledge, we are the first to explore the semantic evolution of topics in this way, providing a theoretical contribution that goes deep into the mechanism of topic evolution.
Third, this study further verified that the motion of topic embeddings is somewhat predictable; however, there is still room for improvement in prediction accuracy. Unlike previous works related to topic detection, which have focused on forecasting the frequency and/or impact of a topic (Liang, Mao et al., 2021; Lu et al., 2021; Taheri & Aliakbary, 2022), we aim to detect the semantic evolution of a topic by predicting the movement direction of its embeddings. Compared with Gozuacik et al.’s (2023) research focusing on predicting the static embedding of words, our task is to predict the motion trajectory of centroids in the distribution of topic embeddings generated by numerous papers. Hence, we propose an interesting and novel semantic evolution prediction task, which focuses on forecasting the motion of topics. To fulfill the prediction task, we employed three common vector regression models (LR, VAR, and LSTM). Of these, the VAR consistently attained the optimal prediction accuracy across different prediction durations, the LR exhibited the second-best prediction performance, and the LSTM achieved the poorest forecast outcomes, possibly because of insufficient training data. The low values of RMSE and MAE, along with the various individual cases presented above, uphold the predictability of the geometrical motion of a topic to some extent. Our study indicates that detecting topic trends from the movement of semantic vectors is a feasible and promising research path and enriches the current set of approaches to detect topic evolution. However, the rapid increase in MAPE with increases in prediction duration suggests that the prediction performance should be further improved in the future.
Fourth, we made a preliminary effort to explain the predicted motion by training a text generation model, which decoded the predicted embeddings into human-readable topic words. We found that, as the prediction durations increase, about 6% of new topics can be predicted; that is, these created topics do appear in the actual list of FoSL2. We further introduced randomness into the predicted centroid by sampling from a normal distribution with the predicted mean and standard deviation. In this way, nearly 50% of new topics can be creatively predicted. Our findings show that the semantic evolution of old topics nurtures new topics, and our research framework effectively demonstrates, predicts, and explains this process. To the best of our knowledge, the current study represents the first attempt to predict unseen topics and provide explanatory research results. However, this research is still in a preliminary exploration stage; in the future, we will focus on improving prediction accuracy and the interpretability of prediction results.
6.2. Practical Implications
This study also holds practical implications for researchers interested in a variety of topic analysis problems. First, our proposed topic evolution analysis framework may be used to analyze the characteristics of emerging topics as well as disruptive topics and then assist in detecting them. To be specific, a topic generally undergoes geometrical motion in semantic space, which suggests the process of creating new knowledge and new topics. Emerging and disruptive topics encompass prospective and innovative knowledge (Choi & Park, 2019; Lee, Kwon et al., 2018; Xu, Hao et al., 2019) that have the potential to influence or disrupt the evolution of other topics and even the development of an entire field; thus, such topics may exhibit a more pronounced motion trend in semantic vector space. Consequently, researchers may employ our framework to investigate the motion of these topics, using the predicted embeddings to detect and elucidate their development. Our framework has the ability to detect the semantic evolution of a topic, while uncovering its evolutionary relationship with other topics, though it must be noted that the framework requires further exploration and improvement before it is formally applied to practical scenarios. Emerging/disruptive topic detection represents one promising research direction that may be furthered by our framework.
Second, our topic evolution analysis framework may also be employed to thoroughly examine a field’s overall development by visually displaying the distribution of topic embeddings across that field. As mentioned, some topics experience semantic evolution during their development, and these may approach or separate in semantic space. Hence, from a macro perspective, it is meaningful to comprehensively grasp and examine the motion of these topics for a deeper understanding of the development of the field. In the future, our framework may assist in simulating and deducing scientific development in a specific field by detecting the movement as well as the structure of topics in semantic space.
6.3. Limitations and Future Work
This study also has some limitations that may be addressed in future research. First, we examined the geometrical motion of topic embeddings via an SVM, using the normal vector of the hyperplane obtained by the SVM to display the motion direction. We treated this simply as a binary classification task, but in the future, it will be necessary to divide time periods more finely to observe the continuous motion of a topic in semantic vector space. This, in turn, may help researchers to further understand the topic evolution process.
Second, we make the first attempt to predict the semantic evolution of a topic by forecasting context-based embeddings. Although the VAR attains small values of RMSE and MAE, the large value of MAPE on a 10-year horizon suggests that there is considerable room for improvement in prediction accuracy. Moreover, the neural network model achieves the worst prediction performance, which may be a result of insufficient training data. In the future, sophisticated models must be designed to fit the complicated, high-dimensional data distribution, and the training data must be expanded.
Third, we make a preliminary attempt to train a predictive decoder to explain predicted topic embeddings. However, the decoder is trained based on all existing knowledge and therefore is a prophet for the prediction model, which is only exposed to some old topics. This pseudo-prediction approach simplifies our complex research problems and facilitates the gradual exploration of the feasibility of our research framework. In the future, a truly predictive decoder with the same knowledge exposed to the prediction model needs to be created. Moreover, instead of generating phrases, the decoder may need to generate sentences to clarify the predicted knowledge, and large language models may be employed to integrate with our framework.
AUTHOR CONTRIBUTIONS
Shengzhi Huang: Conceptualization, Formal analysis, Methodology, Writing—original draft. Wei Lu: Methodology, Supervision, Validation. Qikai Cheng: Data curation, Investigation. Yong Huang: Data curation, Funding acquisition, Writing—review & editing. Fan Yi: Validation, Writing—review & editing. Liang Zhu: Investigation, Validation.
COMPETING INTERESTS
The authors declare no competing interest.
FUNDING INFORMATION
This work was supported by the Postdoctoral Fellowship Program of CPSF under grant number GZB20240565 and Youth Science Foundation of the National Natural Science Foundation of China (grant no. 72004168).
Notes
REFERENCES
Author notes
Handling Editor: Li Tang