Abstract
As from time to time it is impractical to ask agents to provide linear orders over all alternatives, for these partial rankings it is necessary to conduct preference completion. Specifically, the personalized preference of each agent over all the alternatives can be estimated with partial rankings from neighboring agents over subsets of alternatives. However, since the agents' rankings are nondeterministic, where they may provide rankings with noise, it is necessary and important to conduct the certainty-based preference completion. Hence, in this paper firstly, for alternative pairs with the obtained ranking set, a bijection has been built from the ranking space to the preference space, and the certainty and conflict of alternative pairs have been evaluated with a well-built statistical measurement Probability-Certainty Density Function on subjective probability, respectively. Then, a certainty-based voting algorithm based on certainty and conflict has been taken to conduct the certainty-based preference completion. Moreover, the properties of the proposed certainty and conflict have been studied empirically, and the proposed approach on certainty-based preference completion for partial rankings has been experimentally validated compared to state-of-arts approaches with several datasets.
1. INTRODUCTION
In a preference completion problem, with a set of agents (users) and a set of alternatives (items), each agent (user) has his/her partial ranking over a subset of alternatives (items) and the goal of this problem is to infer each agent (user)'s personalized ranking or preference over all the alternatives (items) including those alternatives (items) the agent (user) has not yet handled. Obviously, from time to time it is impractical to ask agents to provide linear orders over all alternatives, especially in big data environments [1]. For example, perhaps the agent does not know the status of some alternatives because there are too many alternatives, which makes it hard for the agent to rank all of them. Or perhaps some alternatives are incomparable for a certain agent. All these situations mentioned above result in partial rankings, and it is necessary to introduce preference completion.
The preference completion problem has been applied to applications in many areas, such as social choice, and recommender system [2], which can be very useful in community detection [3, 4], or graph anomaly detection [5]. For example, in social choice, each voter (agent) can cast a ballot by a ranking over all candidates (alternatives), or a partial ranking over some candidates (alternatives). As for these partial rankings, it is necessary to form a ranking over all candidates by a certain voting rule. In a recommendation system, each user can rate some items. Then the task of the recommendation system is to predict the rate on the items that have not been rated by him/her. To satisfy this requirement, two common approaches including the matrix factorization approach and the neighborhood-based approach are introduced to handle the preference completion. The traditional algorithms on these two approaches are usually rating-oriented, while a recent line of work focuses on the ranking-oriented algorithms [6, 7] due to the drawbacks of the rating-oriented algorithms. In this paper, we focus on the ranking-oriented neighborhood-based approach.
Traditionally, in neighborhood-based preference completion, it is first to find the near neighbors of each agent and then aggregate these neighbors' rankings to produce the predicted preference by a certain voting rule [6]. However, this task has some inevitable issues. For example, an agent may exhibit irrational behaviors or provide rankings in a noise setting. To address this issue, many rating-oriented trust-based approaches have been proposed with additional contextual information. Meanwhile, the ranking-oriented approach has left much room for better research. Liu et al. [8] proposed an anchor-based algorithm with many other agents' ranking information leveraged to ignore the presence of randomness.
Here in this paper a certainty-based preference completion algorithm is proposed on the basis of Liu's [8] work. More precisely, after finding the k-nearest neighbors by the anchor-kNN algorithm Liu proposed, we use the certainty-based voting algorithm introduced in this paper to complete the preference (ranking) instead of using the traditional majority voting rule. The traditional majority voting rule tends to cause wrong judgment especially when both sides have close votes. In this case, a slight randomness even can cause different outcomes by the majority voting rule. For this reason, this paper introduces a certainty-based voting algorithm to deal with this problem. Importantly, when we take a vote on two alternatives, the certainty which measures the degree that the two alternatives can be preferred or comparable should be introduced. Only when the certainty value satisfies a defined threshold, we can go further to have three-way preference decision instead of assigning 0 or 1 for the two alternatives simply. Hence, the certainty-based voting algorithm avoids the wrong judgment when both sides have close scores or rankings made in a noise setting. In this paper, before formulating the certainty and presenting the certainty-based preference completion algorithm, we consider the certainty and preference space first to introduce the three-way preference between two alternatives.
Technically, in a ranking pool gathered from agents, the rankings including alternative pair A and B can be aggregated to form the preference between A and B. Mathematically, a bijection can be built from the ranking space to the preference space for alternative pair A and B. Here, the ranking space consists of all the partial rankings on A and B from agents, while the preference space consists of three-way preference between A and B, which includes
preference (prefer A to B, denoted as ),
dispreference (prefer B to A, denoted as ), and
uncertainty (no preference between A and B, denoted as ),
according to the trisecting and acting models of human cognitive behaviors [1, 9]. Thus, the following three situations are distinguished:
The agents prefer alternative A to alternative B, which can be confirmed by high preference , low dispreference , and low uncertainty .
The agents prefer alternative B to alternative A, which can be confirmed by low , high , and low .
The agents are uncertain about the preference between alternative pair A and B, i.e., A and B are unpreferred, which can be confirmed by low , low , and high .
It is obvious that when is low, the preference between A and B can be determined, i.e., A and B are preferable. Hence, the certainty of preference can be introduced to describe the trustworthiness of the preference, which is denoted as , and it can be calculated as = 1-. The certainty of preference can be taken as the subjective probability of the preference, following the proposition that the certainty is the degree of belief that an individual has on the preference [10]. Hence, in this paper, the certainty can be evaluated based on a well-built statistical measurement, which defines a bijection from ranking space to preference space, enabling the estimation on the pairwise preference with neighbors' partial rankings via mapping them to
(preference , dispreference , uncertainty ).
Our definition on certainty should capture the following key properties:
- Property 1: Certainty increases as the number of rankings between alternative pair A and B increases for a fixed ratio of rankings from A to B and rankings from B to A.
- Property 2: Certainty decreases as the extent of conflict increases in the partial rankings between alternative pair A and B.
Our main contributions in this paper can be summarized as follows:
As pointed out in [11], it is necessary and important to introduce the certainty and conflict of the preference between alternative pairs, and from time to time the certainty and conflict of the preference are more important than the preference itself. In this paper, a probability-based certainty and conflict are introduced under Properties 1 & 2, to describe the trustworthiness of the preference.
A certainty-based voting algorithm using the certainty and conflict is proposed for conducting the certainty-based preference completion in nondeterministic settings.
We empirically study the properties of the proposed approach, and experimentally validate the proposed approach compared to the state-of-the-art approaches with several datasets.
This paper is organized as follows. Section 2 reviews existing works on the Plackett-Luce model, Kendall-Tau distance and anchor-kNN algorithm. In Section 3, a bijection has been built from ranking space to preference space, and certainty and conflict of alternative pairs have been evaluated based on a well-built statistical measurement. In Section 4, a certainty-based voting algorithm has been taken to conduct the preference completion with the certainty and conflict. In addition, Section 5 studies empirically the properties of the proposed approach about certainty and conflict. Moreover, Section 6 has been experimentally validated compared to the state-of-the-art approaches with several datasets. Finally, Section 7 summarizes this paper and presents the future work.
2. BACKGROUND
2.1 Plackett-Luce Model
Given a set of m alternatives and a set of n agents, let y(y1, y2, …, ym) denotes the latent features of alternatives and x(x1, x2, …, xn) denotes the latent features of agents. Agent i's ranking Ri is determined by a statistical model for ranking data. Hence, as a widely-used statistical model, the Plackett-Luce model [12, 13] is adopted to generate the rankings of agents. In this paper, each alternative is assigned a positive value named utility. The greater this utility is, the more likely its corresponding alternative is ranked at a higher position [14]. In [14], the realized utility for every alternative j on agent i is determined by
where θ(χi, yj) is agent I's expected utility on alternative j and can be determined by the closeness of the latent feature χi, and yj, measured by θ(χj, yj) = exp(-||xi - yj||2), and εi,j is a zero mean independent random variable that follows a Gumbel distribution. When the realized utilities set ui(ui1, ui2, …, uim) of agent i is obtained, agent i ranks the alternatives in a decreasing order according to the realized utilities. After repeating this for n times, synthetic datasets of all the agents can be generated for experiments. For more details, please refer to the following Algorithm 1.
Sampling from Plackett-Luce Model.
2.2 Kendall-Tau Distance
Given two agents' rankings R1 and R2 over the same alternatives, the Kendall-Tau distance can be introduced to measure the similarity of R1 and R2, which is the total number of disagreements in pairwise comparisons between alternatives in the linear rankings. For alternative j in Ri, R¡(j) represents the position in R¡. For example, with a ranking of alternatives represented by R¡, if j in R¡ is the top-ranked alternative, then Ri(j) = 1. The normalized Kendall-Tau distance between R1 and R2 is
where I(v) is an indicator that is set to be 1 if the argument v is true; otherwise, it is set to be 0.
Moreover, if the rankings have not shared completely the same alternatives, the intersection of the two alternative sets can be taken for computing the normalized Kendall-Tau distance.
2.3 Anchor-kNN Algorithm
Before the introduction of the anchor-kNN proposed in [8], we first present the idea of KT-kNN, which simply uses the Kendall-Tau distance to find the agent's neighbors. If the Kendall-Tau distance between two rankings R¡ and Rj is small, the latent feature of the agents x¡ and xj should be close, i.e., the two agents have a similar opinion on alternatives.
As the KT-kNN algorithm has not considered that agents' preferences may be nondeterministic or agents' rankings are made in noise setting, different from KT-kNN, anchor-kNN uses other agents' (named as anchors) ranking data to determine the closeness of two agents rather than considering the two agents' rankings only. The anchor-kNN develops a feature Fi,j, for agents i and j to represent the Kendall-Tau distance between Ri and Rj, i.e., Fi,j = NK(Ri, Rj). Then for measuring the closeness of two agents denoted as Di,j, we use the sum of the difference between Fi,t and Fj,t to find the k-nearest neighbors, where t is the third agent that belongs to all the other agents except agents i and j.
3. CERTAINTY AND PREFERENCE SPACE
In this section, let us present some preliminary definitions first. For an arbitrary alternatives pair A and B, the certainty can be adopted to describe the trustworthiness of the preference between A and B. Technically, following [15], a Probability-Certainty Density Function (PCDF) can be introduced to capture the subjective probability of the ranking. However, unlike [15], following [16] and [17], in this paper certainty is defined based on the PCDF to satisfy Properties 1 & 2.
3.1 Ranking Space
The ranking space consists of all the weighted partial rankings on the alternative pair A and B from agents, including
the rankings where A is ranked ahead of B with weight for the ranking , and nABdenotes the accumulated weight of rankings , represented by ,
the rankings where B is ranked ahead of A with weight for the ranking , and nBA denotes the accumulated weight of rankings , represented by , and
the unordered ones where A and B are not comparable with weight for the ranking , and denotes the accumulated weight of rankings , represented by . Obviously, we have , and .
Moreover, the weight for means the quality of ranking . Without additional knowledge, we assign to be 1.
Definition 1. Ranking space
3.2 Preference Space
Traditionally, the uncertainty is usually ignored, and sometimes dispreference has not been taken into account as well, which leads to some disturbing results shown in empirical study section. According to the trisecting and acting models of human cognitive behaviors [9, 18], the preference space consists of three-way preference between alternatives, which includes
preference (prefer A to B),
dispreference (prefer B to A), and
uncertainty (no preference between A and B).
Definition 2. Preference space
3.3 Certainty of Rankings in Alternative Pairs
The Bayesian inference [19, 20] here is adopted to update the probability with the available contextual information about the rankings in alternative pairs, i.e., update the prior distribution to the posterior distribution [21, 22]. Currently, the offline Bayesian inference has been utilized in this paper. The Bayesian inference can also be applied to online/streaming scenario [23, 24].
Let xAB, xBA and be the probability of rankings , and , respectively, where and . In addition, , and thus we then have .
Without any additional information, the prior distribution f(X|O) is a uniform distribution. As the cumulative probability of a distribution within [0,1] equals 1, the density of a PCDF has the mean value 1 within [0,1], and this makes f(X|O) = 1.
Then, the certainty can be determined by the deviations of posterior distribution from the prior distribution, i.e., uniform distribution. Hence, we have the following definition about certainty.
Definition 3. The certainty of rankings can be estimated as
where is to remove the double counting of the deviations.
From this definition, we have .
3.4 Conflict of Rankings in Alternative Pairs
The conflict can be determined by the relative difference between weighted rankings nAB and nBA, as in [17]. More specifically,
there is the largest conflict, when weighted rankings nAB = nBA;
there is the smallest conflict, when weighted rankings nAB = 0 or nBA = 0.
Hence, we have the following definition about conflict.
Definition 4. The conflict cAB of rankings can be estimated as
From this definition, we have cAB = cBA.
3.5 Bijection from Ranking Space to Preference Space
With Definitions 1, 2, 3 and 4, the following definition can be introduced.
Definition 5. The bijection from ranking space to preference space can be estimated as
4. CERTAINTY-BASED PREFERENCE COMPLETION
This section proposes the certainty-based preference completion approach. The framework of our approach is shown in Figure 1. It includes two processes. One is to find the k-nearest neighbors for user i with the anchor-kNN algorithm Liu [8] proposed. The other one is to conduct a linear ranking for user i over all alternatives. In this section, we focus on the latter one. As for the latter one, with the neighbors' partial ranking, a certainty-based voting algorithm is introduced to estimate pairwise preference for all pair alternatives, and then these pairwise preferences can form a linear ranking for the user i.
Certainty-based preference completion process.
4.1 Certainty-based Voting Algorithm
First, let us introduce a definition.
Definition 6. With preference space , the following conclusions can be obtained:
if uncertainty , alternatives A and B are unpreferred;
if ,
- if , user ¡ prefers A to B;
- if , user ¡ prefers B to A;
- otherwise, A and B are unpreferred;
where ε1 and ε2 are thresholds to rule out the fuzziness of comparison.
In the existing work, with the rankings of neighbors obtained by k-nearest neighbors algorithm, common voting rules①, such as majority voting, can be taken to estimate pairwise preference for conducting the preference completion.
In contrast, in this paper, we use a certainty-based voting rule with certainty and conflict to obtain pairwise preference. The certainty and conflict measure the trustworthiness that the pair alternatives can be preferred or comparable. If the certainty satisfies a defined threshold, we can then evaluate the degree that the user i prefers one to another denoted by and . Then, only if the difference between the two-way preference has reached a value, we can make a preference decision on the two alternatives. Technically, for the alternative pair A and B with and , a preference decision between A and B can be made. The process for estimating pairwise preference is also shown in Algorithm 2. We apply this algorithm on all alternative pairs, and then we get all the pairwise preferences.
Certainty-based voting algorithm for estimating pairwise preference.
4.2 Greedy Order Algorithm
Next, let us combine all the pairwise preferences to form a linear ranking over all alternatives. One possible approach is the greedy order algorithm [25]. This algorithm follows a greedy idea: the order algorithm always picks the alternative that currently has the maximum potential value in the alternatives pool I and ranks it above all the other remaining items. Here, for item ¡, the potential value v¡ is equal to . This value aggregates all the pairwise preferences obtained in the previous subsection and represents the preference for item ¡ among all the neighbors' rankings. Then it deletes the picked one from the alternatives pool and updates the potential values of the remaining items by removing the effects of the picked one. Repeat the picking process until the alternatives pool is empty, and then a linear ranking for user ¡ is produced. See Algorithm 3.
Greedy order algorithm.
5. EMPIRICAL STUDIES ON PROPERTIES OF CERTAINTY
In this section, we study the properties of certainty and conflict in our proposed model.
5.1 Increasing Rankings with Fixed Conflict
Figure 2 plots how certainty varies with weighted rankings nAB and under fixed conflict cAB.
Certainty increases with NAB + nAB when and is fixed.
This should confirm Property 1.
Theorem 1. As for fixed and , the certainty increases with nAB + nBA.
Proof: Let , and
Then we have
As in [17], x1, x2, x3, x4 can be defined, such that f(x1) = f(x2) = f(x3) = f(x4) = 1 and
where x1, x2, x3, and x4 are functions of β. Then
where
Following Lemma 9 in [17], we have
With Equation (13), we have
This confirms the results of Theorem 1.
5.2 Increasing Conflict with Fixed Rankings
Figure 3 plots how certainty varies with weighted rankings nAB and under the fixed summation of nAB + nBA and the fixed . This should confirm Property 2.
Certainty is concave when is fixed, and the minimum occurs at nAB = nBA.
Theorem 2. As for fixed , the certainty is decreasing with nAB ≤ nBA, and increasing with nAB ≥ nBA.
Proof: The details of validation process can be omitted here, as it is similar to one in the proof of Theorem 1. More specifically, with removing the absolute sign and then differentiating it, it can be proved that the derivation is negative for nAB ≤ nBA, and positive for nAB ≥ nBA.
6. EXPERIMENTS
In this section, we examine the empirical performance of the certainty-based preference completion algorithm. In the experiments, we compare our certainty-based preference completion algorithm with the common majority voting algorithm [8] and the classic collaborative filtering algorithm (CF) [26]. Both our certainty-based preference completion algorithm and majority voting algorithm use the anchor-kNN algorithm to find k-nearest neighbors' rankings and utilize these rankings to conduct the preference completion of the target user. While the collaborative filtering algorithm is a rating-oriented algorithm different from the other two. It computes user's similarity to find user's neighbors, and uses their ratings to generate item prediction.
6.1 Datasets
The experiments adopt two forms of datasets to evaluate algorithms' performance.
One type of dataset is the synthetic one created by the sampler using a Plackett-Luce model with Algorithm 1. The produced synthetic dataset has over 20,000 rankings from agents on the set of 20 alternatives. Each ranking follows a Gumbel distribution.
The other type of dataset is the Flixster dataset that collects the movie ratings by users with social trust. It has over 8,000,000 ratings on over 2,000 movies. For the experiments, we convert the ratings to rankings, and select over 9,000 rankings on over 50 movies.
6.2 Evaluation Metrics
We evaluate the performance on three metrics: (a) Prediction error, (b) Spearman correlation coefficient, (c) Kendall rank correlation coefficient. The first one measures the quality of the predicted ranking, and the others measure the degree of correlation on the predicted ranking with the original one. Please refer to Pearson [27] and Liu et. al. [2] for more details.
Evaluation Metric 1: This evaluation metric estimates the accuracy on the predicted ranking with the original true one.
where M is the maximum of the pairwise error, Yi,j,k = 1 means in predicted ranking, alternative user i prefers alternative j to alternative k and Xi,j,k = 1 represents alternative user i prefers alternative j to alternative k in original ranking. I-(v) equals 1 when v < 0, and equals 0 otherwise.
Evaluation Metric 2: The Spearman correlation coefficient measures the difference of the position for every alternative in predicted ranking and the original one to evaluate the similarity between the predicted ranking and the original one. The greater its value, the more precise our predicted ranking.
to simplify, we have
where di represents the difference on the position of alternative i with the predicted ranking and the original one.
Evaluation Metric 3: The Kendall rank correlation coefficient is very similar to the above evaluation Metric 2, except that it uses the Kendall distance to measure the correlation:
where the symbol in Equation (20) has the same meaning with the evaluation Metric 1, Ix represents the alternatives set in original ranking, and Iy represents the alternatives set in predicted ranking.
6.3 Experimental Results on Synthetic Dataset and Flixster Dataset
In this section, we conduct the experiments on a synthetic dataset and the Flixster dataset. With the evaluation metrics separately, the comparison results with different approaches can be presented. The prediction error measures the difference in pairwise preference with the predicted ranking and original ranking. The goal is to reduce the prediction error as far as possible. While the Spearman correlation coefficient and the Kendall rank correlation coefficient measure the similarity between the predicted ranking and the original ranking. We expect the values on these two evaluation metrics can be higher possibly.
(a) Synthetic dataset
As shown in Figure 4, it is very clear that the prediction error tends to be smaller when using certainty-based algorithm than the CF algorithm and the majority voting algorithm. In addition, the two ranking-oriented approaches outperform the rating-oriented approach. For one thing, the ranking contains more preference relation information over alternatives than rating score, and thus it may be easier and more accurate in finding the user's neighbors and completing preference. As a result, the ranking-oriented approach has a lower prediction error. For another, the comparison between the certainty-based voting algorithm and the majority voting algorithm shows the superiority of the certainty-based one. The preference completion algorithm with certainty considered does reduce the effect of randomness.
Figure 5(a) shows the performance of Spearman correlation coefficient. On this evaluation metric, the certainty-based voting algorithm performs better than the other two algorithms. This is because our approach with preference space and certainty considered can filter out those pair preferences which have close votes and have lower certainty. This behavior causes the predicted rank much more trustworthy.
Figure 5(b) shows the performance of Kendall rank correlation coefficient. We can get a similar conclusion with the Spearman correlation coefficient in Figure 5(a), so we do not repeat explanation here.
Prediction error on synthetic dataset: x-axis denotes the number of neighbors. Plots show the prediction error. For this evaluation metric, smaller values are better.
Performance on synthetic dataset: x-axis denotes the number of neighbors. Plots show the Spearman correlation coefficient (Spearman CC) and Kendall rank correlation coefficient (Kendall CC). For both evaluation metrics, higher values are better.
Roughly speaking, from the experiments on the synthetic dataset, we verify the effectiveness of our proposed certainty-based preference completion algorithm.
(b) Flixster dataset The performance of the three approaches is examined on a real-world dataset, Flixster dataset, which contains the rating information. Because the proposed algorithm and the majority voting algorithm both use the anchor-kNN algorithm which need ranking data instead of rating data, we need to convert rating data to ranking data first.
As shown in Figure 6, when the number of neighbors, k > 300, our approach outperforms the other two and the ranking-oriented method still performs better than the rating-oriented method. While when k < 300, the result does not perform as expected. A possible reason may be that the process of converting ranking data to rating data inevitably brings errors on the pairwise preference. With more neighbors considered, our proposed algorithm shows its superiority. Thus, the prediction error descends when the number of neighbors grows.
In Figure 7(a), as we can observe, the certainty-based approach outperforms the other two approaches significantly. This shows a consistent result with the experiments on the synthetic dataset.
Figure 7(b) shows the a similar performance with Figure 7(a).
Prediction error on Flixster dataset: x-axis denotes the number of neighbors. Plots show the prediction error. For this evaluation metric, smaller values are better.
Performance on Flixster dataset: x-axis denotes the number of neighbors. Plots show the Spearman correlation coefficient (Spearman CC) and Kendall rank correlation coefficient (Kendall CC). For both evaluation metrics, higher values are better.
In general, with the experiments on the synthetic dataset and Flixster dataset, we can come to a conclusion that the experiments validate our proposed certainty-based preference completion algorithm.
7. CONCLUSION AND FUTURE WORK
Due to the fact that the agents' rankings are nondeterministic, where they may provide their rankings under noisy environments, it is necessary and important to conduct the certainty-based preference completion. Hence, in this paper firstly, for alternative pairs a bijection has been built from the ranking space to the preference space, and its certainty and conflict have been evaluated based on a well-built statistical measurement Probability-Certainty Density Function. Then, a certainty-based voting algorithm based on the certainty and conflict has been taken to conduct the preference completion. More specifically, the ranking with high certainty and low conflict can be obtained with the proposed algorithm to conduct the preference completion. Moreover, the properties of the proposed approach about certainty and conflict have been studied empirically, and the proposed approach has been experimentally validated compared to the state-of-the-art approaches with several datasets.
As in real applications, the data is usually unbalanced [28], i.e., some alternative pairs have a lot of rankings, while others only have a few rankings. In our future work, we will propose algorithms to handle unbalanced preference completion both effectively and efficiently.
AUTHOR CONTRIBUTIONS
All authors including L. Li ([email protected]), M.H. Xue ([email protected]), Z. Zhang (zanzhang@ hfut.edu.cn), H.H. Chen ([email protected]), and X.D. Wu ([email protected]) took part in writing the paper. In addition, L. Li designed the algorithm and experiments, and provided the funding; M.H. Xue designed and conducted experiments, and analyzed the data; Z. Zhang analyzed the data.
ACKNOWLEDGEMENTS
This work has been supported by the National Natural Science Foundation of China (No. 62076087, No. 61906059 & No. 62120106008) and the Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of the Ministry of Education of China under grant IRT17R32.
The first author would like to thank his wife Jun Zhang, his parents and friends during his fight with lung adenocarcinoma. “I leave no trace of wings in the air, but I am glad I have had my flight.”
common voting rules may include positional scoring rules, maximin, and Bucklin. For more details, please refer to [21].