AOL4PS: A Large-scale Data Set for Personalized Search

Personalized search is a promising way to improve the quality of Websearch, and it has attracted much attention from both academic and industrial communities. Much of the current related research is based on commercial search engine data, which can not be released publicly for such reasons as privacy protection and information security. This leads to a serious lack of accessible public data sets in this field. The few publicly available data sets have not become widely used in academia because of the complexity of the processing process required to study personalized search methods. The lack of data sets together with the difficulties of data processing has brought obstacles to fair comparison and evaluation of personalized search models. In this paper, we constructed a large-scale data set AOL4PS to evaluate personalized search methods, collected and processed from AOL query logs. We present the complete and detailed data processing and construction process. Specifically, to address the challenges of processing time and storage space demands brought by massive data volumes, we optimized the process of data set construction and proposed an improved BM25 algorithm. Experiments are performed on AOL4PS with some classic and state-of-the-art personalized search methods, and the experiment results demonstrate that AOL4PS can measure the effect of personalized search models.


INTRODUCTION
The search engine is one of the primary ways that people obtain useful information from the Internet. When given a query, search engine ranks documents according to the matching degree between the query and the document. Generally, the search engine without personalization always returns the same results for the same query from different users and ignores their varied hidden interests. However, the fact is that for the same query, the real intentions of different users are often different. This is especially obvious when the query is ambiguous. For example, for the query "Giant", some users want to find the information about the Giant Bike, while some other users want to access the content related to a film named Giant, and still some other users want to learn about the English word giant. That is to say, non-personalized search engines cannot distinguish such queries.
Personalized search, designed to return a personalized list of documents for users, is one of the approaches to the query ambiguity problem as described above. With the popularity of the Internet and big data, personalized search has become an important technology in search engines.
Although personalized search is receiving increasing attention and interest from researchers, accessing data sets for personalized search is not easy for many of them. Moreover, despite a large volume of data sets available in the field of information retrieval, most of them are not suitable for the study of personalized search. The reasons for this include: (1) the data sets from commercial search engines are not public; (2) the data sets lack such necessary information as that of long-term click behaviors of users, unified identifier for users or raw text of queries and documents; (3) the data sets have not been processed in a uniform way.
The first case in point is LETOR [1], a benchmark data collection for information retrieval, which lacks the time information of user historical behaviors. The second is SogouQ [2], query logs for short, from Sogou search engine, which lacks unified identifiers for users. Some available public data sets, e.g., Yandex and SEARCH17 [3], have no raw text of queries or documents. For another famous data collection, AOL query logs [4,5], there are no publicly available personalized search data sets.
In this paper, based on the work of [6], we proposed a complete and detailed data set construction process to construct a data set named AOL4PS in the field of personalized search from AOL query logs. When generating candidate documents for queries, we proposed an improved BM25 algorithm to improve computing efficiency and reduce storage space in the process. In addition, the statistics of AOL4PS are provided and the experiments are conducted on the AOL4PS to test the validity.
The remainder of the paper is structured as follows. Section 1 introduces the meaning of personalized search and the lack of data sets in personalized search. Section 2 reviews the current status of data sets in personalized search. Section 3 describes the complete process of data set construction, and the content and statistics of the proposed data set named AOL4PS. Section 4 analyzes the distribution of AOL4PS and describes how experiments are conducted to test its validity. Finally, we conclude our work in Section 5.
https://www.kaggle.com/c/yandex-personalized-web-search-challenge Data sets without raw text. In 2013, Yandex released a large-scale anonymized data set for "Yandex Personalized Web Search Challenge". Similar with Yandex, Nguyen et al. [3] exposed a data set named SEARCH17 in 2019. Yandex data set consists of information on anonymized user identifiers, queries, query terms, URLs, URL domains, and clicks. Although Yandex provides a large-scale data set, its anonymous identifier processing of queries and URLs prevents researchers from accessing the raw text. Because of the lack of raw text of queries and documents, machine learning models which are not based on text information are widely used on the Yandex data set [15,16]. However, the Yandex data set is not suitable in the era of deep learning, where text rich in semantic information is significantly needed to enhance the performance of various tasks in natural language processing through representation learning.
AOL query logs. In 2006, American Online (AOL) released the AOL query logs which are suitable for information retrieval, query recommendation, and personalized search. One of the most important advantages of AOL query logs is that they contain the original corpus of queries and documents. Compared with Yandex and SEARCH17 both without raw text, the AOL query logs are more suitable for deep learning based-methods. Carman et al. [17,18] improved topic models to handle the personalized search task. They did not re-rank the list of candidate documents but generated a new ranked list for users. Tyler et al. [19] proposed a re-ranking method which utilizes re-finding behaviors recognition for personalized information retrieval. Each query from the AOL query logs is resubmitted to the AOL search engine and the returned documents are crawled to generate candidate documents. Ahmad et al. [6] used the AOL query logs for document ranking and query suggestion. To address the problem that AOL query logs do not contain candidate documents from the search engine, they crawled the titles of documents and generated candidate documents using the BM25 algorithm.
In previous studies, the processing of AOL query logs is not sufficiently clear for the construction of personalized search data sets. The existing personalized search data sets based on AOL query logs in different scales are almost publicly unavailable [4,5]. It is because there is no complete and clear data set

AOL4PS: A Large-scale Data Set for Personalized Search
construction process and no publicly available personalized search data set, and there is a gap to fill in the research field of personalized search. Therefore, a unified approach to the construction of personalized search data set based on AOL query logs is necessary and significant. This paper presents the processing of AOL query logs to construct a personalized search data set which can be available to the public and test its validity on several classic and state-of-the-art personalized search models.

Content of AOL Query Logs
In 2006, AOL Search released a collection of user query logs that include a large number of queries from 657,426 users over a three-month period [20]. The original content of AOL query logs is shown in Table 1.
In particular, AnonID is an anonymous user ID for privacy protection. For example, the following is an entry taken from the AOL query logs: "142 westchester.gov 2006-03-20 03:55:57 1 http://www. westchestergov.com", which denotes the user with ID 142 submitted a query of "westchester.gov" on 2006-03-20 and clicked on http://www. westchestergov.com, which was the result ranked #1. The query issued by the user, case shifted with most punctuations removed The time at which the query was submitted for search ClickURL If the user clicked on a search result, the rank of the items on which he or she clicked is listed. If the user clicked on a search result, the domain portion of the URL in the clicked result is listed.
The statistical results of AOL query logs are shown in Table 2.

Data Construction Method
AOL query logs only contain the records of user clicked documents for the queries, and do not have the records of the candidate documents returned by the search engine. As a result, we generated the list of candidate documents for each query and annotated the documents that users are satisfied with. Since the AOL query logs are very large in scale and thus consume much time and space in the process of data set

AOL4PS: A Large-scale Data Set for Personalized Search
construction, in the case of limited hardware resources, it is very unlikely to meet the time requirements of computation through traditional data processing. Therefore, we proposed an improved BM25 algorithm to solve this bottleneck problem of time consumption in data processing.

Data Preprocessing
Data preprocessing is to remove the wrong data and redundant data. Although the logs are structured, the log file is not a database, and the logs were not checked for completeness when they were generated, so there is no guarantee that the logs generated by the server is correct and complete. In addition, data redundancy is caused when log files are merged. A commercial search engine usually has many servers to handle a large number of query requests and the same record may appear twice or more times in the final merged query logs.

Data Crawling
We crawled the text of documents since the AOL query logs contain only the URLs of documents but not the text content. The purposes of crawling the text content are two-folded: first, in the process of data set construction, the relevance scores are calculated based on the matching degree of the text between queries and documents; second, in personalized search models, text content of both queries and documents is an important kind of feature.
Following [6], we crawled the title of Web pages that correspond to documents. For example, the corresponding title of "www.orlandosentinel.com" is "Orlando news, weather, sports, business | Orlando Sentinel". Crawling titles of the documents has two benefits: first, titles are more similar to queries than body texts [21]; second, using titles instead of body texts can significantly reduce the storage capacity.
The titles crawled from Web are divided into the following four categories by their text content ( Table 3). The first category includes the titles that contain valid information; the second category contains those with no crawled text, being NAN or the white space character for example; the third category is composed of the titles with only invalid information, such as "404 Not Found", "403 Forbidden", and "502 Bad Gateway"; the fourth category comprises the titles in non-English characters such as Chinese, Japanese, and Russian.

AOL4PS: A Large-scale Data Set for Personalized Search
Since the AOL query logs were released in 2006, many of the URLs no longer exist or the content has been updated. About 35.72% of the documents appear to have the same domain name with the title, e.g., "http://www.clanboyd.info" has the same domain name with the title "clanboyd.info". Considering that the textual content is missing in many of the documents, we take the domain names of the documents as their textual content.

Data Cleaning
The goal of data cleaning is to preserve the English and numeric content of texts (here texts include queries and documents). The text cleaning methods we use include normalization, tokenization, word segmentation, lemmatization and stemming.
Data Cleaning Process. The flow of data cleaning is divided into two stages: in the first stage, normalization, tokenization, and word segmentation are performed in sequence; in the second stage, according to the needs of different tasks, lemmatization or stemming is performed. For example, when calculating the similarity between queries and documents, stemming is performed to minimize the number of words and make it easier to match queries with documents. When calculating the representation of queries and documents, lemmatization is performed to reduce the memory and retain as much semantic information as possible.
Data Cleaning Effect. We use word coverage rate to evaluate the effect of data cleaning. We define the word coverage rate as Equation (1): where N represents the number of words in queries and documents of AOL query logs, and n represents the number of words in GloVe6b , an open source data set that includes 400,000 words in total. For the words extracted by stemming and lemmatization, the word coverage rate is 72.32% and 93.52%, respectively. Calculating the word coverage rate can be helpful to judge the effect of data cleaning. The rates over 70% prove that the data cleaning is effective. Stemming can reduce the number of words, but tends to generate tokens instead of words and might lose the original meaning of words. Lemmatization retains more different forms of words and thus has richer semantic information.

Document Similarity Calculation
Introduction to BM25. BM25 [22] is a classic information retrieval algorithm which is widely used by many mature search engines, such as Lucene and Elasticsearch . BM25 is proposed based on probabilistic retrieval model and used to evaluate the similarity between queries and documents with the relevance Data Intelligence

AOL4PS: A Large-scale Data Set for Personalized Search
scores, which are obtained through calculation by measuring the matching degree of the words between queries and documents. We generated a series of candidate documents for queries by ranking the relevance scores. BM25 scores are calculated through Equation (2): where Q stands for the query to be retrieved, the words of query Q are {q 1 , q 2 , …}, n represents the number of words in query Q, d is a document that search engine returned, f i (q i , d) represents the frequency of word q i in d, dl is the length of d, and avgdl is the average length of all documents. The function of parameters k and b is to adjust the effect of document length on the similarity between queries and documents, and k = 2 and b = 0.75. IDF(q i ) stands for the inverse document frequency of q i , calculated through Equation (3): where N is the number of documents in AOL query logs and n(q i ) is the number of documents containing the retrieval word q i . Improved BM25 Algorithm. BM25 is an efficient and stable algorithm in information retrieval.
Following [6], we used BM25 to compute the relevance scores between queries and documents, and then we generated the candidate documents for all queries after ranking the relevance scores. Although many open source tools of natural language processing have provided interface of BM25, its calculation time does not meet the requirements for large-scale data. As shown in Table 2, AOL query logs have millions of queries and documents, and the magnitude of the calculation of the relevance scores is at the level of billions. In order to solve the problem of too long calculation time, we proposed an improved BM25 algorithm using matrix to improve the efficiency of computation. The improved BM25 algorithm accelerate the original algorithm by at least 6 times, where the relevance scores are calculated through Equation (4): represents the length of all documents. avgdl is the average length of all documents, and ql, wl and dl represent the length of query, length of word and length of document, respectively.
Block Matrix Multiplication. The improved BM25 algorithm we proposed above is a matrix implementation of the original BM25 algorithm. Although the use of matrices for computing can save time, the storage of matrices is a critical problem demanding solution. As shown in Table 2, the number of queries exceeds 3 million, the number of documents exceeds 1 million and the number of the words is about

AOL4PS: A Large-scale Data Set for Personalized Search
80,000. Such large matrix requires hundreds of GB memory to store. The average length of queries is about 4-word-long and the length of documents is about 7-word-long, which indicate that the above matrices are very sparse and sparse matrices can be used to reduce the storage space. Only the matrices X T , IDF and F are sparse matrices, but the relevance scores that are obtained by the improved BM25 algorithm is a dense matrix. The problem of insufficient memory still occurs in calculation. As a result, we used block matrix multiplication to reduce the memory usage. In addition, we sorted the relevance scores for extracting related documents, which was also optimized via matrix multiplication.

Extract Related Documents.
After calculating the relevance scores by the improved BM25 algorithm, we extracted related documents from all documents for queries. Following [23], we treated the top 1,000 documents as relevant documents and the remaining documents as irrelevant. Under conditions when only relevant document of a query is clicked by users, the query is categorized as valid; otherwise the query is removed as invalid. This is mainly because we want to focus on the queries which have related documents. In this way, we deleted the queries that do not have related documents and generated related documents for the queries we saved. We put the clicked documents in the center of the window and selected the documents within the window as candidate documents. In order to reduce the data storage and speed up the training process, we set the window size to 10.

SAT-click.
A SAT-click is simply defined as a click whose dwelling time on the document extends over a predefined time threshold or the last click on a search session [24,25,26]. Usually the time threshold is set as 30s. The SAT-clicks of users on documents are used as an indicator of user true interests.
Session Partition. Session refers to a series of queries issued by the same user over a short period of time. It is believed that the queries submitted by the user within a period of time and query intention within a session are generally related. Following [27], we first used 30 minute as the time interval for session partition. We next used the cosine similarity of successive query to partition sessions. The queries are represented by the TF-IDF weighted sum of the embedding vectors of words in each query. Following [6], the threshold of cosine similarity for session partition is set as 0.5.

Data Division
We divided AOL query logs into four categories: historical data, training data, validation data and test data. The historical data are used to create the user profile. The training data, validation data and test data are used to train, validate and test the personalized search models. Based on [28], we divided the historical and other data according to the query time. For the data of 12 weeks, we chose the data of the first 9 weeks as historical data. The data in the last 3 weeks are divided into training data, validation data and test data with a ratio of 4:1:1. In addition, we filtered out the users who did not click in the first 9 weeks or who had less than 6 records in the last 3 weeks.

Contents and Statistics of AOL4PS
The contents of the processed data set named AOL4PS are shown in Table 4. The Statistics of AOL4PS are shown in Tables 5 and 6. Table 4. Field and description of AOL4PS.

Field Description
AnonID An anonymous user ID number Query The query issued by the user QueryTime The time at which the query was submitted ClickPos The position in which the user clicks the candidate documents ClickDoc The document clicked by the user CandidateList A list of candidate documents returned by the search engine for the query SessionNo The session number of a user search action  Table 7 shows the comparison of AOL4PS with existing data sets Yandex and SEARCH17. The advantage of AOL4PS is that it has the original text information, such as the content of queries and the URL of documents, while there are no original text of queries and documents in Yandex and SEARCH17. In personalized search, a wider range of query time and more query records can help personalized search models to learn more accurate user interest features. At the same time, the moderate number of users and query records make the consumption of hardware resources and time resources in the process of model training more controllable. AOL4PS has advantages in personalized search due to its rich original content, a large number of users, long query time of users, and a large number of user query records.

Usage of AOL4PS
As the current query moves backward in user query sequence, all queries that appear before the current query are regarded as the historical queries of the current query. Therefore, for a user query sequence, multiple samples can be constructed. As shown in Figure 1, it is assumed that a user has 15 query records, and the number of historical queries is 9, the number of training queries is 4, and the number of validation queries and test queries is 1. Therefore, 4 training samples, 1 validation sample and 1 test sample can be constructed. In each sample, as the current query moves back in the user query sequence, the number of historical queries used to generate the user interest increases. In AOL4PS, each user query sequence can satisfy the construction of training, validation and test samples.

Distribution of Queries
In AOL4PS, more than 60% of the distinct queries are issued only once in the 3-month period and about 87% of the distinct queries are single-user queries. The statistics are similar with those given in [9] where the commercial query logs are used. About 46% of the queries in the test set appear in the training set. Furthermore, 35.75% of the repeated queries in the test set are repeatedly submitted by the same user.

Distribution of Query Click Entropy
In this paper, we utilized click entropy of query [29] to measure the ambiguity of query and the degree which the query needs to be optimized in personalization, as given in Equation (5): where q represents the query issued by a user, d represents the document returned from search engine, and ClickEntroy(q) is the click entropy of query q. P(d) is the collection of documents clicked on query q. P(q|d) is the percentage of the clicks on document d among all the clicks on query q, as given in Equation (6):

Clicks q d P p d
Clicks q where ( ) , , Clicks q d ⋅ represents the number of clicks on document d of query q and ( ) , , Clicks q ⋅ ⋅ represents the total number of clicks on query q.
As shown in Figure 2, we calculated click entropy of the queries that are asked by more than one person. The higher click entropy indicates that the more queries need to be optimized by personalized search in AOL4PS. It can be seen that the click entropy of more than 30% of the queries is larger than 0. Therefore, it is obvious that AOL4PS requires a personalized search optimization.  Figure 3 shows the distribution of the number of queries per session. More than 25% of the sessions contain at least two queries. This indicates that users sometimes submit several queries to fulfill an information need. This observation is consistent with that given in [9].

Distribution of Users
Figure 4(a) shows the distribution of historical days and Figure 4(b) shows the query times for users in search history. We found that the history of 68% of the users in test set over 25 days and about 68% of the users submit more than 50 queries during the historical period of users. Figure 5(a) shows the distribution of user query times in the 3-month period. More than 94% of the users sent at least 50 queries. Figure 5(b) shows the distribution of session numbers of users in the 3-month period. More than 70% of the users have at least 50 sessions. This indicates that AOL4PS which is built based on the long-term historical behaviors can provide sufficient data to learn the user interest features.

Evaluation Measures
We measured the personalized search accuracy by mean reciprocal rank, P@K and average click position.

MRR
Mean reciprocal rank (MRR) is the average of reciprocal ranks of all SAT-clicks, which is defined as Equation (7): where rank i is the rank of the first relevant document in the ranking list, and N is the number of documents.

P@K
The second evaluation metric used in this paper is P@K, which measures the accuracy of the top K documents for a given query. As we all know, users are usually interested in the first few documents in the list of candidate documents. The computation formula is as Equation (8): where n represents that there are n satisfied documents whose rank in candidate documents are in the first K, N is the number of documents, and in this paper we set K as 1.

Avg.Click
We further use the average click position (Avg.Click) [30] of the SAT-click to evaluate the effect of re-ranking. Lower Avg.Click represents better personalized search ranking, which is defined as Equation (9): where ClickPosition i is the position of SAT-click i in the ranking list, and N is the number of documents.

Comparison Methods and Experiment Settings
Personalized search is a typical application of learning to rank in information retrieval. Learning to rank methods can be generally classified into three categories, which are pointwise methods, pairwise methods, and listwise methods [31]. These methods transform the ranking task into regression task, classification task and so on. In AOL4PS, we labeled whether the user clicked on a document and the rank of the clicked document in the list of candidate documents. In personalized search, the ground truth is the relative ranks of two documents. In our experiments, we employed five classic or state-of-the-art personalized search models to test the effectiveness of AOL4PS. The following five methods are statistics-based methods (BM25, P-Click), listwise method (SLTB), pairwise methods (HRNN, GRADP), respectively.

BM25.
As we mentioned before, BM25 proposed by Robertson et al. [22] is a document ranking algorithm without personalization, which determines the relevance between queries and documents by the matching degree of text. The original ranking of all candidate documents in AOL4PS is derived from BM25. We set adjustment factor k as 1.5, b as 0.75. P-Click. Dou et al. [9] proposed the P-Click, which re-ranks the documents based on the number of clicks which a user makes under the same query. P-Click is bias to the repeated queries. We set the smoothing parameter b in the score calculation function as 0.5.

SLTB.
Bennett et al. [11] analyzed user interests by extracting more than 100 features from the short-term and long-term historical behaviors. All features are used to train a LambdaMart [32] model to generate a personalized ranking list. SLTB is one of the best machine learning-based models in personalized search. The following parameters are used to implement the model: number of leaves = 70, minimum instances in a leaf node = 2000, learning rate = 0.3, and number of trees = 50.

HRNN.
Ge et al. [28] used the hierarchical recursive neural network with query-aware attention to build the short-term and long-term user profile according to the current query dynamically. Then, documents are re-ranked based on the user profile and other relevant features. The following parameters are used to implement the model: word embedding size = 200, size of short-term interest vector = 200, size of long-term interest vector = 600, number of hidden units in attention MLP = 512, and learning rates = 1e -3 .

AOL4PS: A Large-scale Data Set for Personalized Search
GRADP. Zhou et al. [33] used the recurrent neural network and attention mechanism to capture the dynamicity and randomness of user interests. To build effective user profile, they extracted several query features including click entropy, topic entropy and so on. The following parameters are used to implement the model: word embedding size = 200, hidden size of GRU = 600, number of hidden units in attention MLP = 512, n in nCS = 2, n in nRS = 3 and learning rates = 1e -3 .

Experiment Results and Analysis
We split the whole AOL4PS into four sub data sets according to the number of query records of users and each sub data set contains over 3,000 users. Besides, to reduce the storage and time in experiments, we chose only 5 candidate documents from the whole 10 candidate documents. We evaluated the performances of different models on the four sub data sets of AOL4PS and the results are shown in Table 8.  Table 8, we observed that all personalized models (P-Click, SLTB, HRNN, and GRADP) perform well on the four sub data sets. Specifically, all personalized models significantly outperform non-personalized model (BM25) on MRR, P@1 and Avg.Click, showing the effect of personalization. SLTB outperforms P-Click, which means that the complex machine learning models are promising. P-Click is designed with only one feature, but SLTB is designed with more than 100 features. Machine learning models rely heavily on the quality and quantity of artificial designed features.
HRNN and GRADP outperform traditional machine learning models including P-Click and SLTB especially on data3 and data4, which are similar to how these models perform on another personalized search data set [28]. HRNN and GRADP are similar in structure, both of them are based on recurrent neural network and attention mechanism. HRNN is proposed to model the impact of short-and long-term historical behaviors on the current query. GRADP is proposed to solve the problem of dynamicity and randomness of queries. HRNN and GRADP can learn accurate user interest features from long-term historical behaviors, which shows the superiority of deep learning model in personalized search. Deep learning models rely on the modeling of user query sequences, while traditional machine learning models are heavily influenced by artificial designed features. From the experimental results, it can be seen that deep learning models have more advantages in the case of long query records, and have become the mainstream models of personalized search. Figure 6 shows the MRR statistics of different models on AOL4PS. The results of the BM25 algorithm are consistent on the four sub data sets of AOL4PS. SLTB steadily improves the original ranking but does not show an incremental trend on the four sub data sets. The results of P-Click, HRNN and GRADP on the four sub data sets are incremental, because the total number of user queries increases sequentially across the four sub data sets, and these models can capture user interests accurately based on more historical data. Therefore, for personalized search methods based on deep learning, more data help to obtain more accurate information of user interests. AOL4PS has a large number of users and query records, which can effectively test the effect of personalized search models. In summary, AOL4PS is a large data set for personalized search. First, the data set has more than 10,000 users and more than 1 million query records. Second, the history of user is long enough which can support personalized search models to learn accurate user interests. In conclusion, AOL4PS can effectively test the effectiveness of personalized search models.

CONCLUSION AND FUTURE WORK
This paper presents a solution to the lack of public data sets in the research field of personalized search. We proposed a complete and detailed data processing process based on AOL query logs. We constructed a large-scale data set with high quality, AOL4PS, for the personalized search. AOL4PS contains the data of large amount of users with long-term historical behaviors. In addition, we examined the performance of several typical personalized search models on AOL4PS, demonstrating the applicability and superiority of AOL4PS for personalized search task.
However, there are still many areas for improvement in our future work. First, when crawling document content, many documents were not crawled because their content had been updated or were not available, so we removed such documents from AOL4 PS. In the future, we can get document content through historical website information stored at the Internet Archive or from existing information retrieval data sets. Second, we constructed a list of candidate documents for each user click. In a real retrieval scenario, a