Abstract
The Chinese Software Developer Network (CSDN) is one of the largest information technology communities and service platforms in China. This paper describes the user profiling for CSDN, an evaluation track of SMP Cup 2017. It contains three tasks: (1) user document keyphrase extraction, (2) user tagging and (3) user growth value prediction. In the first task, we treat keyphrase extraction as a classification problem and train a Gradient-Boosting-Decision-Tree model with comprehensive features. In the second task, to deal with class imbalance and capture the interdependency between classes, we propose a two-stage framework: (1) for each class, we train a binary classifier to model each class against all of the other classes independently; (2) we feed the output of the trained classifiers into a softmax classifier, tagging each user with multiple labels. In the third task, we propose a comprehensive architecture to predict user growth value. Our contributions in this paper are summarized as follows: (1) we extract various types of features to identify the key factors in user value growth; (2) we use the semi-supervised method and the stacking technique to extend labeled data sets and increase the generality of the trained model, resulting in an impressive performance in our experiments. In the competition, we achieved the first place out of 329 teams.
1. INTRODUCTION
User profiling is very important for precise recommendation and personalized services in the Internet era. In this paper, we focus on three key tasks of user profiling for the Chinese Software Developer Network (CSDN), an evaluation track of SMP Cup 2017 [1]. CSDN is one of the largest information technology communities and service platforms in China. CSDN has about 50 million registered users and over 100,000 users communicate and share knowledge and information via its Web forums every day [1]. The three tasks of user profiling for CSDN include user document keypharase extraction, user tagging and user growth value prediction.
Keyphrase extraction is to automatically extract the top k important phrases that can represent the main idea of a document. A supervised method is used for keyphrase extraction. First, we generate 301,076 candidate phrases for the whole corpus. It covers more than 90% over the annotated keyphrases in the training set. Then we extract statistical features and semantic features of the input examples, and compare the effectiveness of these features. Finally, by incorporating both the statistical features and semantic features, we achieved the highest score among all competitors.
The second task is to label user interests. The challenge contains a variety of issues. First, users may have more than one interest. The problem can be solved by creating several independent binary classifications. However, the correlations between interests are not considered. For instance, users labeled with “machine learning” are likely to have the interests of “deep learning.” Second, the labeled data are insufficient due to costly manual labeling. If there are too many parameters in the model, it may lead to overfitting. To address the above-mentioned challenges, we propose a two-phase model. To encode the high dimensional features into low dimensional features to decrease the number of parameters, the independent binary classifications are formulated and the probabilities of user labels are predicted by each classifier in the first phase. To capture the correlations between labels, the neural network with softmax loss is adopted, where the probabilities from the first phase are regarded as the input in the second phase. The labels corresponding to the highest probabilities predicted by the neural network are users' final labels.
In task 3, we need to predict the growth value of users in CSDN during the next year based on their posts and behaviors. After formulating this task as a regression problem, we explore various types of features to incorporate the key factors driving the growth of user value. Besides, semi-regression methods and a stacking technique are introduced into our prediction architecture to overcome the lack of labeled data and enhance the generality of the prediction model. Experimental results show that our proposed architecture can effectively predict user growth value.
The rest part of this paper will introduce our work in detail for each task, and the last section presents the conclusions and the future work.
2. KEYPHRASE EXTRACTION
Keyphrases are single or multi-word expressions that can represent the main topics of a document. They are proved useful in many areas, e.g., information retrieval [2], text summarization [3], text clustering [4], text categorization [5], text indexing [6] and opinion mining [7]. Moreover, keyphrases help users grasp the main topics of a Web page [8].
2.1 Related Work
Keyphrase extraction methods can be divided into two categories, unsupervised and supervised methods. In both of them, the extraction process can be divided into two steps: candidate phrase generation and finding top k phrases.
Candidate phrase generation is to extract a set of phrases or words using heuristic rules. These rules should be (1) general: the rule should not depend on the training data only; (2) high quality: the generated candidate phrase set should include the keyphrases that are “labels” in training data as many as possible; (3) minimum size: rules are designed to avoid spurious instances and keep the number of candidates to a minimum. Typical heuristic rules include removing stop words [9], allowing n-grams that appear in Wikipedia article titles to be candidates [10], extracting noun phrases [11], extracting n-gram or noun phrases that satisfy pre-defined lexico-syntactic patterns [12]. In this paper, we generate candidate phrases with a frequency threshold rule.
Extracting top k keyphrases is to find the most important phrases from a document. The typical unsupervised methods of this approach can be categorized into ranking-based methods such as TF-IDF and TextRank [13] and clustering-based methods. TF-IDF, short for term frequency–inverse document frequency, is a numerical statistic that can reflect how important a phrase is to a document in a collection or corpus. The TextRank algorithm is an extension of the PageRank algorithm, a graph-based ranking method which was initially used to rank the importance of Web pages [14]. The idea of TextRank is that the importance of a candidate in a certain document depends on its relation with other candidate phrases in this document, which means, the more other candidate phrases a candidate is related with, the more important this candidate is; the more important other candidates are related to a candidate, the more important this candidate is. Researchers computed relatedness between candidates using co-occurrence counts [15] and semantic relatedness [16], and used TextRank to generate top k important keyphrases [13].
The two important issues of the supervised method for keyphrase extraction are feature design and modeling. Berend and Farkas [17] invented SZTERGAK system to extract features. They classified features to five categories, i.e. standard features, phrase-level features, document-level features, corpus-level features and external knowledge-based features. These features perform well on their tasks, but there are still some other features to discover. For example, the semantic embedding features for the phrases and documents.
Keyphrase extraction is treated as a binary classification task. For the supervised method, the goal of modeling is to train a classifier to determine whether a candidate phrase is a keyphrase, while the annotated keyphrases in a document and other non-keyphrase candidates are used to generate positive and negative examples. So far, researchers have tried different learning algorithms to train the binary classifier. Frank et al. used naïve Bayes to train the classifier [18], and Turney tried using decision trees in 1999 [19]. Some other learning methods such as maximum entropy [20], multi-layer perceptron [21] and support vector machines [22] are also explored. Gradient boosting is a machine learning technique used for regression and classification problems, and eXtreme Gradient Boosting (XGBoost) is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable [23]. In the keyphrase extraction task, we choose XGBoost tool, and the inputs are features of <doc, phrase> pair, where the doc represents one document, and the phrase is a candidate phrase extracted from this document.
2.2 Candidate Phrase Generation
In keyphrase extraction task of SMP Cup 2017, the organizers provided a total of 1,022 documents together with five annotated keyphrases for each document. These annotated keyphrases were used for positive examples, and the other candidate phrases were taken as negative examples.
2.2.1 Procedure
Existing methods for candidate phrase generation, as introduced in Section 2.1, work well for English documents, but are not suitable for Chinese documents due to challenges in Chinese word segmentation. Moreover, the formats of the annotated keyphrases in training document set are quite diverse. Specifically, they can be a single Chinese character, a Chinese phrase with multiple characters (mostly with 2 to 7 characters), an English word, an English phrase with multiple words, a phrase consisting of both English and Chinese, and even an English phrase containing special characters, like c++, node.js, utf-8, etc. The high diversity of keyphrase format greatly increases the difficulty of candidate phrase generation.
To combat the aforementioned challenges, we adopt a heuristic procedure to automatically generate the candidate phrases (Figure 1):
Take the total 1 million documents as a corpus.
Extract the Chinese content of each document in the corpus.
Use seg-for-search-engine mode in Jieba to obtain the segmentation results of the Chinese phrases.
Extract Chinese phrases with 2 to 7 characters.
Use phrase frequency to filter phrases extracted in step 4.
Traverse the whole corpus and extract English phrases or combination phrases, which consist of both English words and Chinese characters generated in step 5, and order them by frequency to generate non-Chinese format candidates.
Combine the candidate phrases extracted in steps 5 and 6.
Since a higher proportion of keyphrases in training set and the documents in this corpus are Chinese, we deal with Chinese phrases and English phrases separately in step 2. For Chinese phrases, the difficulty is how to balance the number of candidate phrases and the coverage of annotated keyphrases in the training set. At the very beginning, we simply use an open source library named Jieba for word segmentation. The shortcoming of this method is that only a small proportion of annotated keyphrases are included in the candidate phrases. Then we extracted every possible phrase with a length between 2 and 7 characters, which leads to explosive growing number of candidate phrases. There is a heuristic rule that the minimum unit of a Chinese phrase is a word instead of a character, so finally we use seg-for-search-engine mode in Jieba to extract every possible word and its position in the document. Then we construct every possible phrase with a length between 2 and 7 characters (step 3 and step 4). In step 5, we calculate the phrase frequency in the whole corpus, and use phrases with high frequency as the Chinese candidate phrases. Then, we traverse the corpus again to extract the phrases only in English and combination phrases. In the final stage, we merge all candidate phrases generated in step 5 and step 6 as our final candidate phrases.
2.2.2 Results
In total, we extract 301,076 phrases as the candidate phrases. Note the phrases that include special characters are not included, for there are a great number of such phrases in the whole corpus. As the amount of documents we need to deal is small, we extract phrases with special characters for each document. Finally, we generate 239,405 training examples from 1,022 documents in the training set and 233,580 validation examples from 1,022 documents in the validation set. We analyze the relation between candidate phrases in training examples and validation examples. The results are shown in Table 1. Our candidate phrases cover more than 90% of the annotated keyphrases. The difference between row 1 and row 2 comes from the noise in data, as there are some annotated keyphrases which do not exist in that document. The examples in “not in candidate phrases” of row 2 and row 3 are those phrases with special characters.
Candidate phrases.
. | In candidate phrases . | Not in candidate phrases . |
---|---|---|
Annotated keyphrases | 4,711 | 399 |
Positive training examples | 4,454 | 14 |
Negative training examples | 229,986 | 4,951 |
Validation examples | 233,580 | 4,983 |
. | In candidate phrases . | Not in candidate phrases . |
---|---|---|
Annotated keyphrases | 4,711 | 399 |
Positive training examples | 4,454 | 14 |
Negative training examples | 229,986 | 4,951 |
Validation examples | 233,580 | 4,983 |
2.3 Feature Engineering
2.3.1 Statistical Features
Similar to Berend and Farkas' work [17], we divide the statistical features into four categories, features at the phrase level, at the document level and at the corpus level and external knowledge-based features.
Phrase-level features: In this task, we generate 31-dimensional features at the phrase level in terms of phrase length and formats: Chinese or English phrases, phrases containing special characters, the location of the special characters, a noun or a phrase containing a noun, the number of Chinese characters, letters or special characters it has, a subset of other candidate phrases (if phrase A is a subset of phrase B, then we call B is a super-phrase of A), etc.
Document-level features: They are calculated based on the candidate phrases and the document the phrases belong to. The intuitive features are calculated directly between a phrase and the document, like phrase frequency, whether a phrase appears in the title, and the location features. Also, the features of the document and the sentence in which a candidate phrase appears are also included in our model. Another type of important document-level features is the relation between a candidate phrase and some generated results from a document. For example, we analyze top 10 keywords, top 10 keyphrases and top 3 key sentences of a document using TextRank, and calculate whether a candidate phrase is included in the results. Finally, we calculate 27-dimensional document-level features in total.
Corpus-level features: We extract totally 4-dimensional features. They are phrase frequency in the whole corpus, the TF-IDF value, IDF value, and the total number of titles in which a candidate phrase appears.
External knowledge-based features: We use two external knowledge sources, the Wikipedia entry data and the Baidu Encyclopedia entry data. Specifically, we extract 4-dimensional features: whether a candidate phrase is a Wikipedia entry or is a Baidu Encyclopedia entry, whether a candidate phrase has a super-phrase that is a Wikipedia entry or a Baidu Encyclopedia entry.
2.3.2 Semantic Features
Semantic features refer to the word embedding features. In this task, the semantic features we extract include the embedding vectors of each candidate phrase, statistical features of the embedding vector and some distance features between a candidate phrase and the document it belongs to. The main procedures are summarized as follows.
Generate the embedding vector for every word in the corpus. There are two frequently used open source libraries, the word2vec and the GloVe. While the GloVe focuses on the global information, the word2vec focuses on the local information. The dimension is also an important hyper-parameter in word embedding. In this task, we set the dimension to be 128 and 256. Finally, we generate four types of embedding vectors for each word in the corpus.
Generate the embedding vector for phrases and document. We represent the embedding vectors of a candidate phrase or a document simply by summing the embedded values of words it is composed of. Note that the embedding vectors for candidate phrases and a document could be normalized.
Calculate the distance features and the distribution features. Since a document's title often contains its main idea, we use the embedding vectors of the title instead of the document to calculate the distance features of a candidate phrase, i.e., the distance between the title of the document and the candidate phrase. There are eight types of distance features we generate, i.e., the cosine similarity, the Cityblock distance, the Jaccard coefficient, the Canberra distance, the Euclidean distance, the Minkowski distance, the Bray-Curtis distance and the word mover's distance [24]. The distribution features can reflect the distributions of the embedding vectors of a candidate phrase in a title. We calculate the skewness and the kurtosis value as the distribution features.
2.4 Experimental Results
We finally extract 66-dimensional statistical features and 400-dimensional semantic features. We compare the effect of both the statistical features and the semantic features by using XGBoost. The parameters in the model are listed as follows: eta = 0.03, max_depth = 8, subsample = 0.8, colsample_bytree = 1.0, alpha = 0, lambda = 1 and gamma = 0. The performances on the validation set for different types of features are illustrated in Table 2.
Performance of different types of features.
Features . | Score . |
---|---|
Semantic features | 0.533 |
Statistical features | 0.631 |
Both | 0.676 |
Features . | Score . |
---|---|
Semantic features | 0.533 |
Statistical features | 0.631 |
Both | 0.676 |
Table 2 shows that the statistical features perform better than the semantic features. When combining the two types of features, we find that the performance is significantly enhanced, which means that the semantic features contain more information than the statistical features. Finally, we use the combined features as our model input, and our score is the best both in the validation set and in the test set.
3. USER TAGGING
It is challenging to label users' interests in our data set due to two main reasons: (1) users may have several interests, and (2) the labeled users are insufficient. To tackle these challenges, we propose a two-phase model to assign labels to users.
3.1 Related Work
Boutell et al. [25] proposed binary relevance which independently trains one binary classifier for each label. For unseen instance x, the labels that x has are those that their probabilities are beyond the threshold. However, the correlations between labels are not considered in the model. Read et al. [26] leveraged the correlations and proposed a classier chains model. The basic idea is to form a chain of binary classifiers, where the prior result will be the feature in the next model. Unfortunately, when the cardinality of label space is large, this chain method is not scalable. Another heuristic way is to learn the neural network with softmax loss, and the labels with the top r highest probabilities are assigned to users. However, when the training set is small, it is likely that the model is overfitting with high-dimensional features.
3.2 Problem Formulation
Let Dtrain = {(U1, L1), (U2, L2), …, (UN, LN)} denote the training set containing |Dtrain| = N users, and Dtest = {(U1, L1), (U2, L2), …, (UM, LM)} denote the test set containing |Dtest| = M users, where Ui = (Ui1, Ui2, …, Uid) denotes ds features of user Ui, and Li = (Li1, Li2, …, Lir) denotes r labels for each user. Let LS = {ls1, ls2, …, lsk} denote label space which contains k labels and for (Ui, Li), Li ⊆ LS.Table 3 summarizes the notations used in this paper. For simplicity, we denote both Dtrain and Dtest as D. Given this, we define the input and output of our problems as follows:
Input: the input of our problem consists of a set of N users in Dtrain with corresponding r labels.
Output: Assign r labels from LS to M user in Dtest.
Notations.
Symbol . | Description . |
---|---|
Dtrain | The training set |
Dtest | The test set |
N, M | The # of users in training set and test set |
Ui = (Ui1, Ui2, …, Uid) | The d dimensions of user Ui |
Li = (Li1, Li2, …, Lir) | The r labels that Ui has |
LS = {ls1, ls2, …, lsk} | The label space that contains k labels |
Symbol . | Description . |
---|---|
Dtrain | The training set |
Dtest | The test set |
N, M | The # of users in training set and test set |
Ui = (Ui1, Ui2, …, Uid) | The d dimensions of user Ui |
Li = (Li1, Li2, …, Lir) | The r labels that Ui has |
LS = {ls1, ls2, …, lsk} | The label space that contains k labels |
3.3 Two-phase Model
In Section 3.2, we formulate the problem of labeling users as a multi-label classification problem, where our goal is to tag each user with labels from given label space. Compared with the traditional multi-label classification problem, where the number of labels assigned to each sample is not necessarily the same, our task requires us to assign the same number of labels to each user. In this section, we take into account both insufficient samples and the correlation among labels.
Figure 2 illustrates the framework of our two-phase model, where the first phase is concerned with independent binary classifications (blue rectangle) and the second phase a neural network (green rectangle). In the first phase, we formulate each label as a binary classification problem. After training, we obtain k classifiers. In the second phase, we train a neural network, where the input values are the probabilities predicted by the first phase, and the softmax is the loss function. Finally, r labels with the highest probabilities are chosen to be a user's interests.
3.3.1 Independent Binary Classifications
We first transfer the training set and test set as follows: for lsi in label space LS, if lsi ∊ Lj for (Uj, Lj) in D, {(Uj, 1)} is a training sample for label lsi, otherwise {(Uj, 0)} is a training sample for label lsi. Denoting with Ďi the set of training samples for label lsi, we have
Then we train k binary classifications with Ďi independently. In addition, we leverage bagging strategy for learning from Ďi to reduce the generalization errors.
Bagging strategy We divide Ďi into c folds randomly, and the continuous c-1 folds are for training, the rest 1 fold for validation. Finally, we obtain c classifiers for Ďi.
After training, k * c classifiers are learned, where k is the number of labels in label space, and c is the quantity of classifiers of each label. When predicting unlabeled users in Dtest, the features of users are fed to c classifiers for each label in label space and the mean of the probabilities predicted by c classifiers is used as the input which is fed to the second phase.
3.3.2 Softmax Classification
The labels which correspond to the highest r probabilities predicted by the independent binary classifications can be assigned to users heuristically. However, there are two main weaknesses. First, the probabilities of different labels are not comparable, e.g., the label with higher probability may not be the users' label as the probabilities are not predicted by the same model. Second, the relevance of labels is not considered, which is a strong prior that the first phase does not leverage.
Here we introduce the model of the second phase, which is a neural network with softmax loss to resolve the above weaknesses. Figure 3 illustrates the three-layer architecture of the neural network of our model. The input layer contains r units having r probabilities predicted by the independent binary classifiers, and the output layer contains r units. In the hidden layer, we leverage dropout to avoid overfitting. The softmax loss is defined as below:
where Pi is the probability of user i's labels predicted by the previous classification, and Li is the ground truth of user i, and
Similarly, we adopt bagging strategy to decrease the loss in generalization performance. When predicting the unlabeled users, the labels corresponding to the highest probabilities are the users' labels.
3.4 Experiments Results
3.4.1 Data Collection
The problem is that given the blog content, the users' behaviors, the structure relationships and chatting relationships between users in CSDN, we need to tag each user with three labels from the label space of 42 labels, e.g., machine learning, Web development, etc. Table 4 lists the statistics of the data collection.
The statistics of data sets.
Item . | Volume . |
---|---|
#blogs | 1,000,000 |
#record of posting | 1,000,000 |
#record of browsing | 3,536,444 |
#record of commenting | 182,273 |
#record of voting up | 95,668 |
#record of voting down | 9,326 |
#record of favorite | 104,723 |
#structure relationship | 667,037 |
#chatting relationship | 46,572 |
#users in training set | 1,055 |
#users in test set | 1,055 |
Item . | Volume . |
---|---|
#blogs | 1,000,000 |
#record of posting | 1,000,000 |
#record of browsing | 3,536,444 |
#record of commenting | 182,273 |
#record of voting up | 95,668 |
#record of voting down | 9,326 |
#record of favorite | 104,723 |
#structure relationship | 667,037 |
#chatting relationship | 46,572 |
#users in training set | 1,055 |
#users in test set | 1,055 |
Note: “Voting up” means users click a “like” button, while “voting-down” means users click a “dislike” button.
3.4.2 Feature Definition
It turns out that content features are the most helpful features, of which the basic idea is that the users with similar interests tend to post (browse, comment on, add favorites, vote down and vote up) the similar blogs.
Doc2Vec [27]: Doc2vec is an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, and the docs are embedded into distributed vectors. We believe the users associated with similar blogs tend to have the similar interests. The features of users are defined as the representations of blogs in six types of behaviors (browse, post, comment, add favorites, vote down and vote up).
In our experiments, blogs are represented by the vector of 200 dimensions. Finally, the users are represented by vectors of 1,200 dimensions (200 dimensions – six types of behaviors).
3.4.3 Evaluation Metrics and Baselines
To evaluate our proposed model, the following performance metric is considered:
where M is the cardinality of the data set, Li is the predicted labels and L*i is the ground truth. For each user, the score is 1, 2/3 or 1/3, if the 3, 2, or 1 label(s) are predicted correctly. The higher the score, the better performance we achieve.
We compare the following methods for user labeling:
Top r labels (TL): This method tags each user with the most r popular labels in the training set. In our data set, r = 3.
Independent binary classifications (IBC): This is the first phase in our model, which tags the unlabeled users with the labels corresponding to the highest r probabilities predicted by independent binary classifications.
Softmax classification (SC): This is the second phase in our model, where the inputs are the features defined in Section 3.4.2 instead of probabilities, and the neural network with softmax loss is leveraged. Finally, the top r labels are assigned to each user.
3.4.4 Performance Analysis
We perform the experiments on the data set described in Section 3.4.1, and the results are based on the test set with 1,055 users.
Table 5 shows the proposed method achieves better performance than other methods. Since IBC does not consider the correlations between labels, the probabilities predicted by the independent classifiers are not comparable, while the correlations are captured in the second phase of our model. The parameters in SC are far more than the number of training set, causing overfittings of the training data.
4. USER GROWTH VALUE PREDICTION
In this task, we need to predict the growth value of users for the next year based on their posts and behaviors in the past year. The growth value for each user has been regularized in [0, 1], where 0 represents the loss of a user.
4.1 Architecture
The task of user growth value prediction can be easily formalized as a typical regression task in machine learning, i.e., given that feature xi contains the statistics of behaviors and posts of user i, predict his/her future growth value yi. The whole architecture to solve this task is shown in Figure 4, which consists of features layer, primary model and model ensemble.
4.2 Features
4.2.1 Statistics of User Behaviors
Frequency of behavior: We count the number of behaviors triggered by a user during a certain period, i.e., the number of posts the user published, viewed, commented on, collected, voted up and voted down, the number of private messages the user received and the number of private messages the user sent to others. This type of features reflect how active a user is during a certain time, e.g., one quarter, which has also been used in customer lifetime value prediction [28].
Change of behavior frequency: We calculate the change of behavior frequency. In particular, we subtract the frequency of posts by the user during the last two quarters from the frequency of posts by the user during the last one quarter.
The last active time: We find the last time of a certain behavior of a user, such as writing posts, commenting on posts, viewing posts and so on. These features also reflect the activities of the user.
The proportion of a certain type of behavior: We calculate the proportion of a certain type of behavior over all behaviors during a certain period, i.e., one quarter. These features characterize the preference of a user, e.g., prefer writing articles to browsing others' articles.
Number of retweeted posts: We notice that the posts created by a user and the posts retweeted by others have quite different influences, and thus should be treated differently when studying user activities and their posts' quality. We distinguish the original posts and retweeted posts based on the MinHash algorithm [29]. The main assumption is that if we have two similar posts, we regard the oldest one as the original post and the others as retweeted posts. Specifically, we calculate a 64-byte hash value for each post. It would be costly if we consider all pairs of posts. Consequently, we use the first 16-byte of hash value as key. We only compare the hash value when the keys of two posts are the same. We count the number of original posts published by the user during one quarter as a feature. This feature can reflect the quality and creativity of the user to a certain extent.
User points in the past year: We calculate the point of a user in the past year according to the CSDN user point rule published online. This feature can also reflect the quality of the user.
4.2.2 Statistics of Documents
Length of documents: We calculate the average length and variance of documents written by a user. These features reflect the characteristics of documents written by the user.
Feedback received by documents: We count the number of a user's behaviors, such as browsing, commenting, voting up or voting down and feedback the user received for his or her posts during a certain period, e.g., one quarter. This type of features can reflect the quality of documents written by the user.
4.2.3 Representation of User-document
We also want to extract some useful information from the user-document interaction matrix. We construct six types of interaction matrices, i.e., user-document browse matrix, user-document comment matrix, user-document vote-up matrix, user-document vote-down matrix, use-document collection matrix, user-document all behaviors matrix. The element in a matrix represents the frequency of a user's certain behavior during the past year. We decompose these matrices with non-negative matrix factorization (NMF) [30] and use the user representation as features. These features reflect the similarity of user behaviors in a transformed space.
4.2.4 Representation of User-user
Matrix factorization Similar to representation of user-document, we construct two types of user-user matrices, i.e., matrix of following relations and matrix of sending private messages. We also use NMF to decompose these two matrices (i.e., each row corresponds to a user vector) and use the user vectors as features.
PageRank We construct a user-user interaction network to reflect the influence between users. If user A browses an article written by user B, then there exists an edge likely from A to B. We run the PageRank algorithm [14] on this network to calculate the importance of each user.
4.3 Primary Model
We adopt four primary models with significant difference, i.e., Multi-layer Perception (MLP), Support Vector Regression (SVR), XGBoost and Random Forest to predict user growth value. For each primary model, we employ the sampling strategy to keep a model's generality (Figure 5). Specifically, after sampling k features for K times and sampling n samples for N times, we obtain N * K different training data sets for each primary model, resulting in N * K different trained models. When performing the final prediction, we combine all the N * K results before bagging them with the median value. Next, we give a simple instruction of each primary model and their performance, where the evaluation score is defined in Equation (6):
where vi is the true user growth value, vi* is the predicted user growth value by models and N is the number of users to be predicted.
4.3.1 Multi-layer Perception (MLP)
We use the framework of tensorflow [31] to perform this model, which can easily modify the optimized loss to be the same with the evaluation score. Since we only have 1,000 samples, which can easily cause overfitting, our MLP model only contains the input layer and output layer. The performance of this primary model achieves a prediction score of 0.736. To ensure the effectiveness of extracted features, we calculate the variance of each feature weight among N * K trained models, and filter out the features with higher variance of model weight (average weight/variance >30%). We assume that the features with higher variance of model weight are not truly effective features. We finally preserve 41 features in this primary model and the final prediction score reaches 0.752 (Table 6), which indicates a significant improvement compared with the one without feature filtering.
Performance of models.
Model . | Score . |
---|---|
Random Forest | 0.737 |
XGBoost | 0.750 |
MLP | 0.752 |
SVR | 0.754 |
Semi-supervised regression-MLP | 0.753 |
Semi-supervised regression-SVR | 0.755 |
Stacking | 0.764 |
Model . | Score . |
---|---|
Random Forest | 0.737 |
XGBoost | 0.750 |
MLP | 0.752 |
SVR | 0.754 |
Semi-supervised regression-MLP | 0.753 |
Semi-supervised regression-SVR | 0.755 |
Stacking | 0.764 |
4.3.2 Support Vector Regression (SVR)
Since the evaluation score does not have the second order derivative, the next three primary models cannot optimize toward the evaluation score. We adopt an approximated loss function, i.e.,
where vi is the true user growth value, vi* is the predicted user growth value. SVR [32] calculate the loss only when
where ∊ is a super parameter that needs to be manually adjusted. We use grid search to find the best parameter for this primary model, resulting in the prediction score = 0.754 (Table 6). This primary model achieves similar prediction performance with the MLP model.
4.3.3 XGBoost
XGBoost [23] is an optimized variant of Gradient Boosting Decision Tree, which is much faster than the traditional one. We also use grid search to find the best super parameters, which reach a score of 0.750.
4.3.4 Random Forest
Random Forest regression model is composed of many decision trees and uses the average result of each tree as the final output [33]. Similarly, we use grid search to find the best super parameters, and reach a prediction score of 0.737. This primary model's performance is not so competitive, which may be due to the complexity of this model, causing great difficulty when tuning parameters.
4.4 Model Ensemble
Since we only have 1,000 samples with labels, this also poses a great challenge to the learning results of our model. We propose two ways to overcome this challenge: use semi-supervised regression to extend the samples with labels, and adopt the stack algorithm to ensure the generality of our model.
4.4.1 Semi-Supervised Regression
According to the proposed idea of co-training for the regression problem [34, 35], we choose the two best models, i.e., MLP and SVR, to extend the samples with labels. The specific description of the proposed semi-supervised regression algorithm is shown in Algorithm 1. When running this algorithm, we set the maximum iterations T equal to 500, and the maximum number of samples K that the cache pool can hold equals 200. However, it is quite slow to run the semi-supervised regression, averagely adding 10 samples with labels after one day. Because of time constraints, we extend only the data set to 1,066 samples with labels. After extending the data set, both the performance of MLP and SVR have been slightly improved, whose evaluation score has been increased by 0.001, respectively (Table 6).
Algorithm 1. Semi-supervised regression algorithm.
Input:
MLP_label: Samples with label for MLP MLP_unlabel: Samples without label for MLP
SVR_label: Samples with label for SVR SVR_unlabel: Samples without label for SVR
T: Maximum iterations K: Maximum number of samples that the cache pool can hold
Process:
Repeat for T iterations:
#train MLP with MLP_label
MLP = Train_model (MLP_label)
#predict the label of SVR_unlabel with MLP
SVR_unlabel_pre = Predict (MLP, SVR_unlabel)
#select K samples from SVR_unlabel_pre as the cache pool for SVR
SVR_unlabel_cache = Sampling (SVR_unlabel_pre, K)
# train SVR with SVR_label
SVR = Train_ model (SVR_label)
#predict the label of MLP_unlabel with SVR
MLP_unlabel_pre = Predict (SVR, MLP_unlabel)
# select K samples from MLP_unlabel_pre as the cache pool for MLP
MLP_unlabel_cache = Sampling (MLP_unlabel_pre, K)
#select sample from cache pool for MLP to extend the samples with label for MLP for sample_i in MLP_unlabel_cache:
new_MLP =Train_model (sample_i + MLP_label)
Loss_i = model_loss (sample_i + MLP_label)
selected_MLP_i = arg min (Loss_i)
#select sample from cache pool for SVR to extend the samples with label for SVR for sample_i in SVR _unlabel_cache:
new_SVR =Train_model (sample_i + SVR_label)
Loss_i = model_loss (sample_i + SVR_label)
selected_SVR_i = arg min (Loss_i)
4.4.2 Stacking
Stacking [36] is a type of ensemble learning, which can combine primary models to generate more robust results. Here we adopt the above-mentioned four single models, i.e., MLP, SVR, Random Forest, XGBoost, and Semi-Supervised Regression-MLP and Semi-Supervised Regression-SVR which have been extended to the training set, as our primary models. We train another SVR based on these models' output as features. The final model after stacking achieves the best performance, reaching a prediction score which equals 0.764.
5. CONCLUSIONS AND FUTURE WORK
This paper describes our work for User Profiling Technology Evaluation Campaign in SMP Cup 2017. In the keyphrase extraction task, we treat keyphrase extraction as a classification problem and use the XGBoost model to predict the top three keyphrases. By this method, we finally obtain the best score among all the participants. In the task of user tagging, two challenges are tackled in our two-phase model. In the first phase, which is independent binary classifications, we encode the high dimensional features into the low dimensional representations in a supervised way. Due to insufficient labelled data, our model has fewer parameters than traditional softmax-classifer model and avoids overfitting to some extent. In the second phase, neural network with softmax as loss function assigns the most likely labels to users. The second phase models the correlations between labels implicitly, and makes the probabilities predicted by the independent binary classifiers comparable. In the task of user growth value prediction, we overcome the lack of training data sets via introducing a semi-supervised regression model and the stacking technique. Experiments demonstrate the effectiveness of our framework for user growth value prediction with an accuracy of 0.764.
In the future, we will further improve the performance on user profiling. For keyphrase extraction, more state-of-the-art models modeling the representations of documents can be considered to extract more features, e.g., LSTM, hierarchical attention models and so on. For user tagging, we can further explicitly exploit the structures between each class of users, e.g., the taxonomy of labels, to improve the performance of multi-label classification. For user growth value prediction, methods based on point process can be employed to capture the characteristics of user growth value with time going by.
AUTHOR CONTRIBUTIONS
All of the authors contributed equally to the work. H. Shen (shenhuawei@ict.ac.cn, corresponding author) is the coordinator of the research group, and designs the whole framework for user profiling. G. Xing (xingguoliang@ict.ac.cn) and X. Yue (yuexinyu@ict.ac.cn) are the co-contributors for the task of keyphrase extraction. H. Gao (gaohao@ict.ac.cn) and B. Xu (xubingbing@ict.ac.cn) are the co-contributors for the task of user tagging. Q. Cao (caoqi@ict.ac.cn) and K. Cen (cenketing@ict.ac.cn) are the co-contributors for the task of user growth value prediction. All authors contributed to writing and revising this manuscript.
ACKNOWLEDGEMENTS
The work is supported by the National Natural Science Foundation of China (NSFC) under grant numbers 61472400, 91746301 and 61802371. H. Shen is also funded by K.C. Wong Education Foundation and the Youth Innovation Promotion Association of the Chinese Academy of Sciences.