Abstract

The Bradley-Terry model is a statistical representation for one's preference or ranking data by using pairwise comparison results of items. For estimation of the model, several methods based on the sum of weighted Kullback-Leibler divergences have been proposed from various contexts. The purpose of this letter is to interpret an estimation mechanism of the Bradley-Terry model from the viewpoint of flatness, a fundamental notion used in information geometry. Based on this point of view, a new estimation method is proposed on a framework of the em algorithm. The proposed method is different in its objective function from that of conventional methods, especially in treating unobserved comparisons, and it is consistently interpreted in a probability simplex. An estimation method with weight adaptation is also proposed from a viewpoint of the sensitivity. Experimental results show that the proposed method works appropriately, and weight adaptation improves accuracy of the estimate.

1.  Introduction

The Bradley-Terry (BT) model (Bradley & Terry, 1952) is a basic tool for representing user preference or ranking data from pairwise comparisons of many items. Let us consider, for example, rating a large number of wine brands. In general, it is difficult to establish a complete ranking of all the brands; however, for any given pair, we may choose a preferred one after comparatively tasting them. In such a situation, results are obtained as pairwise comparisons, as shown in Table 1. For instance, in a comparison of brands A and B, six tasters prefer brand A, and eight tasters prefer brand B. We can stochastically interpret the result that A is chosen 6 times in 14 comparisons. The BT model aims at estimating the overall ranking of items (such as wine brands) based on the pairwise preference data. Statistical ranking models related to the BT model are useful for studying the mechanism of human preference in the brain and have attracted the attention of researchers over many years (Plackett, 1975; Huang, Weng, & Lin, 2006; Hino, Fujimoto, & Murata, 2010). BT models have also been used for a formulating classification problems (Hastie & Tibshirani, 1998; Zadrozny, 2002). Estimation methods of the BT model have been discussed from several contexts, most of them based on the sum of weighted Kullback-Leibler (KL) divergences (Hastie & Tibshirani, 1998; Huang et al., 2006). However, their objective functions are somewhat contrived, because the functions are focused simultaneously on different domains of probability distributions.

Table 1:
Pairwise Comparison Results of Wine Brands.
Not Preferred Brand
Preferred
BrandABCDE
— 
— 
— 
— 
— 
Not Preferred Brand
Preferred
BrandABCDE
— 
— 
— 
— 
— 

We first interpret an estimation mechanism of the BT model from the viewpoint of information geometry (Amari & Nagaoka, 2000). In our formulation, observations for the BT model form a kind of flat data manifold; flatness and data manifold are fundamental notions used in information geometry. With these notions, we reorganize the KL-based estimation method within a framework of the em algorithm (Amari, 1995), an information-geometrical alternative of the EM algorithm (McLachlan & Krishnan, 1996). Then an estimation process of the BT model is geometrically interpreted as a sequence of projections in a probability simplex. An estimation result based on our formulation is intuitive and tractable because it is derived consistently in a discussion of a single space of probability distributions given by a simplex. In the context of a classification problem, a similar attempt has been proposed by Ikeda (2006) to integrate binary classifiers. Our formulation is applicable not only to binary models but also to other multiclass models associated with similar flat data manifolds.

In addition, we discuss the selection of a set of weights for the divergence, which is very important to avoid overfitting. Based on sensitivity, is an important concept in robust statistics (Hampel, Ronchetti, Rousseeuw, & Stahel, 1986), we introduce an estimation method with weight adaptation. In general, the property of divergence associated with the estimation is deeply related to the low-quality observations of the estimator, for example, an extremely small data set or a data set with outliers (Beran, 1977; Basu, Harris, Hjort, & Jones, 1998; Murata, Takenouchi, Kanamori, & Eguchi, 2004; Fujimoto & Murata, 2007). Thus, another interpretation of weight adaptation is given from the perspective of the η-divergence (Takenouchi & Eguchi 2004; Takenouchi, Eguchi, Murata, & Kanamori, 2008)—a type of the Bregman divergence (Bregman, 1967) known to be robust against small sample sets and outliers. We experimentally show that our estimation method with weight adaptation works appropriately and improves the accuracy of estimating.

This letter is composed as follows. In section 2, we briefly explain the BT model and show its geometrical interpretation. In section 3, we introduce important ideas used in information geometry and propose an estimation method based on flatness and projections. We address a problem of weight selection and propose a modified estimation algorithm with weight adaptation in section 4. In section 5, we set out experimental results about our estimation procedure. Our concluding remarks are given in section 6.

2.  Bradley-Terry Model and Its Geometrical Interpretation

In this section, we introduce the Bradley-Terry (BT) model and show its geometrical structure in a probability simplex.

2.1.  The Bradley-Terry Model.

Let Λ = {i} be a finite index set of items and let Q be a parameter set:
formula
Note that Q is also identified as a categorical distribution. We will use the symbol |Λ| to denote the cardinality of the set Λ, which indicates the number of components in it. Consider a trial in which a user evaluates two items i, j ∈ Λ, and chooses a preferred one from i and j according to Q. Let ij denote an event that the item i is chosen in this pairwise comparison of i and j. In the conventional BT model, the probability of the event ij is given by
formula
2.1
Now assume that multiple users individually compare i and j. Let nij and nji be the number of observed events ij and ji, respectively. If nij + nji comparisons are independent, equation 2.1 suggests that the empirical ratio of θi to θj is given by
formula
Assume that evaluations for different pairs are also independent. Then the log likelihood of the BT model for pairwise comparisons by multiple users is given as
formula
2.2
where ∑ij is the sum over pairs i, j ∈ Λ except i = j. Generally the estimate is obtained as a solution of the following optimization problem,
formula
2.3
and fast algorithms are given in, for example, Hastie and Tibshirani (1998) and Huang et al. (2006).

2.2.  An Estimation of the BT Model on a Probability Simplex from the Geometrical Viewpoint.

We reformulate the BT estimation from the viewpoint of information geometry. Let be a space of categorical distributions that is given by
formula
Figure 1a shows a geometrical interpretation of for Λ = {1, 2, 3}, that is, a 2-simplex in the three-dimensional space equivalent to a regular triangle. In this letter, the simplex is also called the model manifold.
Figure 1:

Two- and 3-simplices for probability. (a) Two-simplex of Λ = {1, 2, 3}. (b) Observed ratio for the BT model (Λn0 = {3}, Λn1 = {1} and Λn2 = {2}). (c) Three-simplex of Λ = {1, 2, 3, 4}. (d) Observed ratio of the BT type (Λn0 = {2, 3}, Λn1 = {4}, Λn2 = {1}).

Figure 1:

Two- and 3-simplices for probability. (a) Two-simplex of Λ = {1, 2, 3}. (b) Observed ratio for the BT model (Λn0 = {3}, Λn1 = {1} and Λn2 = {2}). (c) Three-simplex of Λ = {1, 2, 3, 4}. (d) Observed ratio of the BT type (Λn0 = {2, 3}, Λn1 = {4}, Λn2 = {1}).

Next, we consider a set of probabilities P = {πi}i∈Λ for expressing pairwise comparisons in . A pairwise comparison indicates the observed ratio; for example, a comparison result (n12, n21) for 1≻2 and 2≻1 indicates a restriction for probabilities in the BT model, that is,
formula
2.4
We let be a subspace of , which satisfies the observed ratio, that is,
formula
2.5
A constrained submanifold is uniquely defined with a pairwise comparison of any i and j, and in the following, is called the data manifold.

As shown in Figure 1, the model manifold and the data manifold are visualized as a |Λ| − 1 simplex and a hyperplane in it. Figure 1b shows an example of equation 2.4 for Λ = {1, 2, 3}: is shown as a line in the triangle that is drawn from (0, 0, 1) onto the edge between (1, 0, 0) and (0, 1, 0). In the case of |Λ|>3, we need a careful treatment for visualization. For example, Figure 1c shows the model manifold in the case of |Λ| = 4, which is given as a 3-simplex, and Figure 1d shows an observed ratio π14 = n14:n41, which is given as a cutting surface of a regular tetrahedron. Although Figure 1c no longer shows coordinates of the parameter set Q directly, Figure 1d shows an observed ratio by distances between vertices. Henceforth, we use the 3-simplex and its cutting plane for a representation of |Λ| = 4 case when it is necessary.

In the BT model, comparisons are obtained from all pairs, and each pairwise comparison gives a corresponding data manifold. In the following discussion, let n = 1, …, N be the renumbering of pairs {(i, j)}, and let be hyperplanes. If the observed data set has consistency,
formula
then the estimate is obtained as a unique intersection point of all the hyperplanes in where all the observed ratios exactly hold. Figure 2a shows an example of the case of |Λ| = 3. In practical situations, however, the observed ratios lack strict consistency in general, that is,
formula
2.6
Equation 2.6 suggests that all the hyperplanes do not intersect at a single point as in Figure 2b. In these inconsistent cases, the optimal model should satisfy constraints of pairwise comparisons as much as possible; a good estimate for the BT model is obtained as the nearest point from N hyperplanes in the simplex. To determine an estimate , we need a measure of closeness between a hyperplane and a point Q on the simplex . Here, we introduce the Kullback-Leibler (KL) divergence as a measure. Since elements of are identified as probability distributions, the KL divergence between points P = {πi} and Q = {θi} is given as
formula
Then we define the KL divergence between a hyperplane and a point Q as
formula
2.7
Figure 2:

Intuitive interpretation of an estimator. (a) Consistent estimate in the ideal case: all the hyperplanes intersect at a unique point. (b) Typical case of inconsistent observed ratios: all the hyperplanes do not intersect at a unique point. (c) A segment of iterative estimation. A dotted arrow shows the nearest point on each data manifold from the current estimate, and solid arrows show the updated estimate.

Figure 2:

Intuitive interpretation of an estimator. (a) Consistent estimate in the ideal case: all the hyperplanes intersect at a unique point. (b) Typical case of inconsistent observed ratios: all the hyperplanes do not intersect at a unique point. (c) A segment of iterative estimation. A dotted arrow shows the nearest point on each data manifold from the current estimate, and solid arrows show the updated estimate.

Now we illustrate an estimate for multiple pairwise comparisons in a geometrical approach. Let wn be a weight for , defined as
formula
Intuitively speaking, wn indicates the relative importance of the data manifold in the estimation.1 There are some plausible candidates for a specific value of wn—for example, the ratio of observations for construction of the nth data manifold or for all n. In section 4, we discuss meanings of weights in detail. An objective function for the parameter estimation on the probability simplex is expressed as a weighted average of the KL divergences between and Q as follows:
formula
2.8
The minimizer of equation 2.8 with respect to Q is obtained by solving the following nested optimization problem:
formula
2.9
The relationship between equations 2.3 and 2.9 is discussed in section 2.3.

Since the minimizer of the nested optimization problem is hard to obtain directly, we try to estimate it by an iterative process, as shown in Figure 2c. Intuitively, we find the closest point on each data manifold from a current estimate on a simplex (shown as dotted arrows in the figure), update the estimate by concatenating those closest points (shown as solid arrows), and repeat these steps until convergence. In the next section, we generalize the problem and provide a formal definition of this method to solve the nested optimization given by equation 2.9.

2.3.  Comparison with a Conventional Estimation Setup.

Before describing optimization steps, we discuss the difference between the conventional setup and ours, which are given by equations 2.3 and 2.9, respectively.

Table 2 shows an example of pairwise comparisons found in Hastie and Tibshirani (1998). The table also shows the results derived with the majority rule; ij means that the relation holds. Apparently there is no unique ranking that satisfies all majority rules. Here, we try to obtain a unique ranking through a relative relation in with the BT setting. Hastie and Tibshirani (1998), give the estimate of equation 2.2 with as . The estimate of equation 2.9 with the same weights is . In Table 2, we show whether majority rules are satisfied or violated by and . In this case, the ranking in is more consistent than that in , because the former supports 1≻2 but the latter does not. The example shows that our method works better than the conventional one.

Table 2:
Example of Pairwise Comparisons.
ijMajority Rule
0.56 0.44 1≻2 × √ 
0.51 0.49 1≻3 √ √ 
0.60 0.40 1≻4 √ √ 
0.96 0.04 2≻3 √ √ 
0.44 0.56 2≺4 × × 
0.59 0.41 3≻4 × × 
ijMajority Rule
0.56 0.44 1≻2 × √ 
0.51 0.49 1≻3 √ √ 
0.60 0.40 1≻4 √ √ 
0.96 0.04 2≻3 √ √ 
0.44 0.56 2≺4 × × 
0.59 0.41 3≻4 × × 

Notes: √ and × indicate whether θij holds when ij is given by majority rule. and .

Differences of estimation results are geometrically explained by differences of objective functions. In the conventional setup estimating the BT model, an objective function is given as equation 2.2. Let Pn(b) and Qn(b) be Bernoulli distributions for an index pair n = (i, j) associated with the observed data and the model Q:
formula
These distributions give ratios between i and j. Hence, they are indicated as points on the corresponding edge of the simplex as shown in Figure 3. Figure 3a shows the distribution Pn(b), which is uniquely determined as a point on the edge between i and j based on . In Figure 3b, Qn(b) is shown as a similar point on the edge based on a corresponding gray line that passes through the given point Q. Assume that the weight wn for a comparison of a pair n = (i, j) is given by the number of observations as ; the maximizer of equation 2.2 is equivalent to the minimizer of the sum of weighted KL divergences for evaluating Pn(b) and Q(b), given as
formula
2.10
The conventional method roughly evaluates the consistency of the parameter only at the edges of the probability simplex and does not exploit a property that the parameter set Q is identified as a |Λ|-dimensional categorical distribution. Figure 3c depicts this geometrical intuition. An arrow on an edge of the simplex indicates the KL divergence from Pn(b) to Qn(b), and equation 2.10 minimizes the weighted sum of these arrow lengths.
Figure 3:

Intuitive difference of objective functions. Objective functions are expressed as the weighted sum of arrow lengths. (a) Pn(b) is shown as a point on the edge derived with . (b) Qn(b) is shown as a point on the edge derived with Q. (c) Arrows on the edges of the simplex show DKL(Pn(b), Qn(b)). (d) Arrows from lines to a point show .

Figure 3:

Intuitive difference of objective functions. Objective functions are expressed as the weighted sum of arrow lengths. (a) Pn(b) is shown as a point on the edge derived with . (b) Qn(b) is shown as a point on the edge derived with Q. (c) Arrows on the edges of the simplex show DKL(Pn(b), Qn(b)). (d) Arrows from lines to a point show .

Figure 3d shows an interpretation of our objective function given by equation 2.8. Our method minimizes the weighted sum of KL divergences from all the data manifolds to the point Q. These interpretations show that our proposed setup naturally evaluates the point Q as a |Λ|-dimensional categorical distribution, though the conventional setup evaluates only the distributions {Qn(b)}Nn=1 projected on the corresponding edges. Therefore, our method is expected to be more appropriate to obtain the estimate as a |Λ|-dimensional categorical distribution (e.g., ranking models) and multiclass categorization by using binary classifiers.

After introducing some additional notations, we discuss their difference again in section 6.

3.  Estimation Based on Flat Data Manifolds

In this section, we generalize the setup of the BT model and propose an estimation method based on geometrical interpretation. To achieve these goals, we introduce information geometrical notions such as flatness and projection.

3.1.  Generalization of Data Manifolds.

We generalize a data manifold for a one-versus-one pairwise comparison defined by equation 2.5 in order to describe more complex comparisons in which ratios between groups of probabilities are constrained. Let us assume that N different comparisons are observed, and in the nth comparison, items are divided into Mn groups and compared by these groups. Let Λnm ⊆ Λ be a set of observed indices in the mth group of the nth comparison, which satisfies
formula
and let be a set of all the observed indices. We also assume uniqueness of the order of index subsets; that is, given two subsets Λnm and , m < m′ holds for m, m′ = 1, …, Mn when holds. Let Λn0 be a set of unobserved indices, namely, Λn0 = Λ − Λn+; hence, a set satisfies . For example, a comparison between any i and j>i in the original BT model is described as Λn1 = {i}, Λn2 = {j}, and Λn0 = Λ − {i, j} in this expression.
Let {αnm} be an Mn-dimensional vector representing observed ratios, given by
formula
Under the condition that total probabilities in index sets Λn+ are proportional to {αnm},
formula
categorical distributions compose a generalized data manifold, which is written as
formula
3.1
Note that here we introduce a parameter ψn to represent the proportion of observed items in .

The generalized data manifold introduced by equation 3.1 has the following property:

Theorem 1.
m-Flatness of generalized data manifold: Let P = {πi}i∈Λ and P′ = {π′i}i∈Λ be two points in a |Λ| − 1 simplex, and R(t) is an internally dividing point of P and P′, given as
formula
If P, P′ belong to , then R(t) also belongs to for 0 ⩽ t ⩽ 1.
Proof.
P, P′, and R(t) have the following property with the definition of given by equation 3.1:
formula

This property is called m-flatness in the literature on information geometry (Amari & Nagaoka, 2000; Murata et al., 2004); thus, generalized data manifolds are called m-flat in this letter.

With these notations, we can extend the setup for the BT model to other types of m-flat data manifolds as follows:

Example 1:
One-Versus-One Type. For the one-versus-one type pairwise comparison used in the BT model, an observed ratio is defined by a set of index subsets {Λnm} and a ratio set {αnm}. In the case of equation 2.4 with |Λ| = 3, {Λnm} and {αnm} are given as
formula
Its geometrical interpretation is shown in Figure 1b.
Example 2:
Group-Versus-Group Type. The group-versus-group type of comparison proposed by Huang et al. (2006) is expressed by grouping some indices into one category of . Let Λni≻Λnj be an event that the item set Λni is chosen in the comparison of two item sets Λni and Λnj. Figure 4a is a typical data manifold for this type of comparison, defined as
formula
where nij is the number of observed event Λni≻Λnj. Figure 4b is another example of group-versus-group comparison in the case of |Λ| = 4, defined as
formula
For example, if one plays tennis in many doubles with several partners, the results of the matches are observed as group-versus-group types of comparisons. We can evaluate relative individual skills in playing tennis by estimating Q = {θi}.

In Ikeda (2006), the estimation algorithm is proposed to handle these data manifolds where Λn0 = ∅ for n = 1, …, N. Our setup permits having Λn0 ≠ ∅ and gives a generalized viewpoint of Ikeda (2006).

Figure 4:

Examples of m-flat manifolds. (a) An m-flat manifold defined by Λn0 = ∅, Λn1 = {1}, and Λn2 = {2, 3}. (b) An m-flat manifold defined by Λn0 = ∅, Λn1 = {1, 2}, and Λn2 = {3, 4}. (c) An m-flat manifold defined by Λn0 = {1}, Λn1 = {2}, Λn2 = {3}, and Λn3 = {4}. (d) Two m-flat manifolds defined by Λn0 = {3}, Λn1 = {1}, and Λn2 = {2}.

Figure 4:

Examples of m-flat manifolds. (a) An m-flat manifold defined by Λn0 = ∅, Λn1 = {1}, and Λn2 = {2, 3}. (b) An m-flat manifold defined by Λn0 = ∅, Λn1 = {1, 2}, and Λn2 = {3, 4}. (c) An m-flat manifold defined by Λn0 = {1}, Λn1 = {2}, Λn2 = {3}, and Λn3 = {4}. (d) Two m-flat manifolds defined by Λn0 = {3}, Λn1 = {1}, and Λn2 = {2}.

Example 3:
Multiple Groups. Typical models for comparison between multiple groups, including the so-called Pendergrass model (Pendergrass & Bradley, 1960) or the Plackett-Luce model (Plackett, 1975; Luce, 1959), can also be explained with m-flat data manifolds. For example, the Pendergrass model for a triplet is formulated as a multistage BT model as
formula
The first term indicates the pairwise comparison result of i≻{j, k}, the second term indicates that of jk, and the equation shows that the data manifold of ijk is given as a submanifold of the data manifold i≻{j, k}. For example, in the case of Λ = {1, 2, 3, 4} and (i, j, k) = (2, 3, 4), {Λnm} and {αnm} for 2≻3≻4 is given as follows,
formula
where nijk is the number of observed event ijk. Figure 4c shows an intuitive interpretation for this observation.
Example 4:
Home Field Advantage. The home field advantage model, given in Agresti (2002), is a generalization of the BT model for representing a home field advantage. Here, we introduce another idea for representing the difference between “home” and “away.” In our formulation, the results of home teams and those of away teams can be expressed as different m-flat manifolds for a common {Λnm}. Assume that Λ = {1, 2, 3}, and {Λnm} is given as
formula
Then two manifolds that represent the results of home and away for {Λnm} are defined with and , given as
formula
where nhij is the number of observed event ij in the case where i indicates a home team and naij is that in the case where i indicates an away team. Their geometrical interpretations are shown in Figure 4d.

We stress that these models have been proposed based on equation 2.2 and are also naturally handled in our framework.

3.2.  Pythagorean Relation on a Probability Simplex.

To discuss the optimization problem in equation 2.9, we first show a property of the KL divergence called the Pythagorean relation.

Theorem 2.
Pythagorean theorem (Amari & Nagaoka, 2000): Let P = {θi}i∈Λ, Q = {θ′i}i∈Λ, and R = {θ′i}i∈Λ be in . If the inner product of two |Λ|-dimensional vectors RP and log(Q) − log(P) has the property given as
formula
3.2
then the following relation holds:
formula
3.3
Proof.
For any ,
formula
holds. Then we have equation 3.3 if equation 3.2 holds.

The Pythagorean relation defines an orthogonality at the point P in a simplex. Here, we introduce another type of subspace in a probability simplex based on the orthogonality:

Definition 1.
e-flat subspace in a probability simplex: Let be an m-flat manifold in . Assume that a subspace includes a point and is orthogonal to , that is,
formula
Then the subspace is called e-flat.
Note that and are orthogonal at P, and for any and , the orthogonal relation equation 3.3 holds. An intuitive interpretation of is shown in Figure 5. For a given m-flat data manifold , a set of e-flat subspaces constructs the following structure,
formula
which is called the orthogonal foliation (Murata et al., 2004). The structure indicates that there exists an e-flat subspace in for an arbitrary point . Conversely, it indicates that an arbitrary point has a unique point such that holds.
Figure 5:

Pythagorean relation in 2- and 3-simplices. The relation defines an orthogonality at . Note that we can consider an e-flat subspace for any in any dimensional simplex .

Figure 5:

Pythagorean relation in 2- and 3-simplices. The relation defines an orthogonality at . Note that we can consider an e-flat subspace for any in any dimensional simplex .

Now we discuss the inner minimization in equation 2.7. We introduce a minimization problem called the e-projection.

Definition 2.
e-projection: Let be a subspace of . The minimization problem,
formula
3.4
is called the e-projection (Amari, 1995; Amari & Nagaoka, 2000).

In general, the existence of a unique solution for minimization problem 3.4 is not guaranteed; however, if is m-flat, the following property holds:

Theorem 3.

e-projection onto an m-flat manifold (Amari, 1995): If the manifold is m-flat, the minimization problem 3.4 has a unique solution.

Proof.

Let P be a point in that satisfies . Such a point P is uniquely determined from the orthogonal foliation structure. For any point , DKL(R, Q) ⩾ DKL(P, Q) holds because of the Pythagorean theorem and the positivity of the KL divergence DKL(R, P) ⩾ 0. This proves the theorem.

By using m-flat data manifolds given by equation 3.1, we can obtain a unique point for each manifold by the e-projection from a certain point .

3.3.  Decomposition of Distributions.

To derive a concrete form of the e-projection on a simplex , we first introduce the decomposition of distributions and . Since we discuss a specific henceforth, we omit the subscript n in {Λnm}, {αnm}, Mn, and ψn as
formula
for notational simplicity. Let
formula
be the sums of probabilities over the given index subset Λm and
formula
3.5
be sums over the observed index subsets Λ+ = {Λm}Mm=1. Note that p0 + p+ = 1 and q0 + q+ = 1 hold. Here we define the following decomposed distributions:
Definition 3.
Decomposed distributions on : Given {Λm}, is decomposed as
formula
where
formula
3.6
formula
3.7
formula
3.8
The within-group distribution Q(w)m is a |Λm|-dimensional vector for an index subset Λm where holds, the between-group distribution Q(b) is an M-dimensional vector for the set of index subsets {Λm} except m = 0 where ∑Mm=1θ(b)m = 1 holds, and the rated or unrated distribution Q(r) is a Bernoulli distribution for Λ0 and the rest, where θ(r)0 + θ(r)+ = 1 holds. Since degrees of freedom of Q, Q(w)m, Q(b), and Q(r) are |Λ| − 1, |Λm| − 1, M − 1, and 1, respectively, and ∑Mm=0m| = |Λ| holds, decomposed distributions {Q(w)m}, Q(b), and Q(r) are nonredundant expression of Q. Thus, these decomposed distributions are uniquely defined by any and a set of index subsets {Λm} and have the following relation:
formula
3.9

Furthermore, we define similar decomposition for as follows:

Definition 4.
Decomposed distributions on : Given {Λm}, is decomposed as
formula
where
formula
3.10
formula
3.11
formula
3.12
In this case, the between-group distribution P(b) and the rated or unrated distribution P(r) are denoted with {αm} and ψ. Note that the degree of freedom of P(b) is 0 because {αm} is given and that of P is |Λ| − M, so that these decomposed distributions {P(w)m}, P(b), and P(r) are also nonredundant. And πiP is expressed with elements of decomposed distributions equations 3.10 to 3.12 as
formula
3.13
With decomposed distributions given by equations 3.6 to 3.8 and 3.10 to 3.12, the KL divergence DKL(P, Q) is decomposed as follows:
Theorem 4.
Decomposition of KL divergence: Let be |Λ|-dimensional probability vectors where P = {πi}i∈Λ and Q = {θi}i∈Λ. Given {Λm}, the KL divergence between P and Q is decomposed as
formula
Proof.
With equations 3.9 and 3.13, DKL(P, Q) is deformed as
formula
3.14

With theorem 4, the following corollary is obtained in the case of :

Corollary 1.
Decomposed KL divergence for an m-flat data manifold: Let and be |Λ|-dimensional probability vectors. Given {Λm}, the KL divergence between P and Q is decomposed as
formula
3.15
Proof.

With equations 3.5, 3.7, and 3.11, equation 3.14 is deformed as equation 3.15.

3.4.  Projections in Probability Simplex.

In this section, we discuss concrete forms of optimization steps for our objectives.

First, we focus on the inner minimization of equation 2.9, that is, the minimization of DKL(P, Q) with respect to for fixed Q. The second term of equation 3.15, DKL(P(w)m, Q(w)m), can be zero because arbitrary {π(w)i|m} can be chosen for any given Λm, and DKL({αm}, {θ(b)m}) = DKL(P(b), Q(b)) in equation 3.15 is a constant because P(b) = {αm} is fixed. Then, knowing that DKL(P(r), Q(r)) = DKL({1 − ψ, ψ}, {q0, q+}), the estimate ψ is given by the solution of
formula
namely,
formula
3.16
where δ = exp(−DKL(P(b), Q(b))). With equations 3.1 and 3.16, the e-projection is given by
formula
3.17

Figure 6a shows an example of m-flat data manifolds on |Λ| = 3 and a point Q. Theorem 3.2 shows that e-projections from Q onto determine the N points shown as Figure 6b.

Figure 6:

Intuitive interpretation of projections. (a) An example of m-flat manifolds and Q. (b) e-projection from Q onto . (c) is estimated based on N points.

Figure 6:

Intuitive interpretation of projections. (a) An example of m-flat manifolds and Q. (b) e-projection from Q onto . (c) is estimated based on N points.

Next we focus on the outer minimization of equation 2.9 with a set of N fixed points . We introduce another minimization problem as follows:

Definition 5.
m-projection: The minimization problem,
formula
3.18
is called the m-projection (Genest & Zidek, 1986; Amari, 2007; Murata & Fujimoto, 2009).
The solution of equation 3.18 is calculated by minimizing equation 2.8 with respect to Q under the constraint of ∑i∈Λθi = 1, which is given as a saddle point of the following Lagrangian,
formula
where Λ is a Lagrange multiplier. Because ∑Nn=1wn = 1, a saddle point is uniquely calculated as
formula
which means in equation 2.9 is obtained as a weighted average of N points in . Figure 6c shows an intuitive image.

3.5.  The em Estimation.

The estimate for the nested optimization problem 2.9 is not directly obtained; therefore, we use two projections introduced in equations 3.4 and 3.18 iteratively. The procedure is known as the em algorithm (Amari, 1995).
formula

The detailed e-step and m-step are described as follows:

In the e-step, Pn(t + 1) is derived by the e-projection, which is given as
formula
where δ(t) = exp(−DKL(Pn(b), Qn(b)(t))). In the m-step, Q(t + 1) is calculated by the m-projection, and it is given as
formula

Next, we give the convergence property of the proposed algorithm.

3.6.  Convergence Property.

In general, the em estimation converges only to a local optimum depending on an initial parameter set, which is called the local minima problem. Hence, we have to choose an appropriate initial parameter set to obtain the global minimum. However, the solution obtained with algorithm 1 converges to a unique and global minimum under a mild condition because the objective function F(Q) is convex, as shown below.

First, we define parallel m-flat subspaces as follows:

Definition 6.

Parallel m-flat subspace: Let and be m-flat subspaces in with {Λm}Mm=0. Then subspaces and are called mutually parallel.

We also introduce an extended parameter set ,
formula
for the following assumption.
Assumption 1.

Data manifolds hold the following three conditions:

  1. At least two elements in are not mutually parallel.

  2. , where | · | indicates the number of components in a set.

  3. for all where is the closure of .

Next, we show the convexity of the KL divergence:

Theorem 5.
Convexity of the KL divergence: Given two points and in , let , , P* = {π*i} be three points in an m-flat subspace , given as
formula
where 0 < t < 1. For any two points , the KL divergence has the following convexity:
formula
3.19
Equality holds iff both and are in an m-flat subspace , which is parallel to .
Proof.
The left-hand side of equation 3.19 has the following inequality,
formula
3.20
formula
3.21
In equation 3.20, equality holds iff holds. And the relation given in equation 3.21 has been proved in Cover and Thomas (2006) with the log sum inequality, and equality holds iff holds for all i ∈ Λ.
Now, let and be values given by between-group distributions, defined as
formula
where
formula
With equation 3.17, the equality condition in equation 3.21 is given as follows:
formula
Since and , the following equations are derived:
formula
These equations indicate that the equality in equation 3.21 holds iff and for m = 0, …, M simultaneously hold. These conditions mean that both and are in an m-flat manifold with the same {Λm} as . Under the same conditions, equality in equation 3.20 also holds because they satisfy for all i ∈ Λ. Therefore, the condition of equality in equation 3.19 is given with a parallel manifold: both and are in an m-flat subspace , which is parallel to .

Now we give the following lemma with regard to F(Q):

Lemma 1.

Convexity of F(Q): Under conditions 1 and 2 in assumption 1, F(Q) given by equation 2.8 is strictly convex with regard to Q for any {Pn}.

Proof.
With the convexity of the KL divergence and positivity of wn, we have the following convexity for any different points ;
formula
where 0 < t < 1. From theorem 5, the equality holds iff Q and Q′ are in an m-flat subspace that is parallel to all the elements in . Note that points also satisfy the equality, and so the strict convexity is ensured by eliminating cases given by the condition . This completes the proof.

Finally, we give the following theorem:

Theorem 6.

Convergence property of algorithm 1: If assumption 1 holds, the converged result of algorithm 1 is the unique and global minimizer of F(Q).

Proof.

From lemma 1, conditions 1 and 2 in assumption 1 ensure the strict convexity of F(Q). And condition 3 in assumption 1 apparently guarantees that the minimizer is in . With the convergence property of the em algorithm, the proof is complete.

Therefore, we assume the conditions given in assumption 1 in this letter.

4.  Weight Adaptation

In this section, we discuss the role of weights {wn} and a way to adapt them in our proposed framework.

4.1.  Weight Adaptation.

Figure 2 shows intuitive examples of consistent and inconsistent observations. In the usual case of Figure 2b, selection of a set of weights {wn} is important because it dominates the quality of the estimate . Typical examples of a weight wn for comparison between i and j are:
  • • 

    wn = 1/N: a uniform weight

  • • 

    wn = c(nij + nji): the number of observations

  • • 

    , where and : the weight used in Hastie and Tibshirani (1998),

where c is a normalization constant to ensure ∑Nn=1wn = 1. Now we discuss the tuning of weights for reliable estimation. For example, we can formulate the weight tuning problem as follows:
formula
4.1
Since simultaneous optimization with regard to {Pn}, Q, and {wn} is difficult, we consider minimizing {wn} for given {Pn} and Q in this problem. This type of minimization is implemented by using linear programming with the em estimation result. An iterative sequence of equations 2.8 and 4.1 will minimize the objective function ∑Nn=1wnDKL(Pn, Q). However, the obtained weight set {wn} will be sparse in this case; only a few elements in {wn} will be large, and the rest will be close to zero. With such a weight set, the estimate Q has low reliability because it is sensitive to the noise for constructing . Therefore, we propose an adaptive weight tuning method for our em estimation from a viewpoint of sensitivity.
Suppose that a vector , which is used for defining a data manifold , is contaminated with an Mn-dimensional unit vector Δm, whose components are all zero except the mth element, as follows,
formula
where 0 < ε ≪ 1. Given any weights (e.g., {wn = 1/N}Nn=1), converged points and are obtained after a sequence of em steps. We consider an influence of contamination Δm at point , which measures an impact of contamination on the estimator and is intuitively given as
formula
where
formula
Since has |Λ| − 1 nonredundant parameters and πn|Λ| = ∑i∈Λ−{|Λ|}πni holds, we define a |Λ| − 1 dimensional column vector IF(m) for approximating the influence by using equation 3.17 with fixed δ:
formula
4.2
With equation 4.2, we can calculate the information-standardized sensitivity (Hampel et al., 1986), given by
formula
4.3
where is a (|Λ| − 1) × (|Λ| − 1) Fisher information matrix calculated at on , whose components are given by
formula
To calculate the concrete form of the inverse matrix, let D be a (|Λ| − 1) × (|Λ| − 1) diagonal matrix with in the (i, i) element and u be a |Λ| − 1 dimensional column vector such that all the elements are one. With these notations, we calculate the concrete form of by using the Woodbury identity:
formula
4.4
With equation 4.4, we obtain the following:
formula
Intuitively the sensitivity shows unreliability derived from the worst-case scenario. For example, IF(m)i tends to be large when the data manifold is very close to a boundary of the probability simplex. In this case, some of the components in the vector αn are extremely small values, and or in equation 4.2 could be largely affected by the location of . Our idea is that for stable estimation, weights {wn} should be determined so as to make overall sensitivity of as small as possible. We define the sensitivity at as a function of {wn}:
formula
4.5

Our final goal is to obtain minimizers {Pn}, Q, and {wn} for multicriteria equations 2.8 and 4.5 simultaneously. This type of optimization is known as the multiobjective optimization problem, and it has several Pareto solutions. Since we currently do not have any criteria to choose one of Pareto solutions, we attempt to obtain a preferable weight set {wn} from the viewpoint of equation 4.5 and minimizers {Pn} and Q in equation 2.8.

In our method, we first obtain em estimates {Pn} and Q with initial weights. Then we minimize equation 4.5 subject to the constraint ∑Nn=1wn = 1 with these estimates. The optimal point is given as a saddle point of the following Lagrangian,
formula
4.6
which is a solution of the following equation,
formula
4.7
where Λ is a Lagrange multiplier. From equation 4.7, the minimizer is given by
formula
4.8
where . Then we update {Pn}, Q by the em estimation with a new weight set {wn} again. With equation 4.8, we formulate a modified estimation method, weight adaptation:
formula

At each weight adaptation step in algorithm 2, the updated weight {wn} is uniquely obtained. Also at each em estimation step, the estimates {Pn} and Q are uniquely obtained. It is difficult to discuss the convergence property of this type of algorithm for a multiobjective optimization problem; however, in our experiments, weights stably converge to unique values from various initial weights within several iterations. In the numerical experiments shown in the next section, we use uniform weights as initial weights.

4.2.  Weight Adaptation from a Viewpoint of Divergence.

Now we consider an effect of the adapted weights from the viewpoint of employing other types of divergences instead of KL divergence. In section 4.1, we derived an influence of Δm in , denoted as IF(m), for calculating sensitivity. The sensitivity of the estimate is known to be deeply related to the divergence used for estimation. Here, we introduce a type of the Bregman divergence (Murata et al., 2004) called the η-divergence (Takenouchi & Eguchi, 2004; Takenouchi et al., 2008). The η-divergence between and is given by
formula
Let and be two distributions derived from P and Q; both ρη(P) and ρη(Q) approach the uniform distribution as η goes to one. Then the η-divergence has a property:
formula
4.9
With this property, we can interpret adapted weights in the objective function equation 2.9 by substituting values {ηn = 1 − wn}Nn=1 as follows:
formula
In Takenouchi and Eguchi (2004), the empirical distribution ρη(P) is assumed to be a probability distribution P that is contaminated with 0/1 noises with an error probability η, and a probability distribution Q is estimated with ρη(P). In our method, the distributions ρη(P) and ρη(Q) do not have explicit meanings like the effect of errors. However, we can give them another interpretation, as shown in Figure 7. Assume that Q′ = {(1 − η)θi}i∈Λ is a point on a shrunk simplex , which is defined as
formula
and P′ = {(1 − η)πi}i∈Λ is a point on a manifold in , defined as
formula
Figure 7:

Interpretation of weight from a viewpoint of the η-divergence. (a) An arrow from P to Q indicates DKL(P, Q) with weight 1 − η. (b) An arrow from P′ to Q′ indicates DKL(P′, Q′) in a shrunk simplex. Note that (1 − η)DKL(P, Q) = DKL(P′, Q′) holds for these figures.

Figure 7:

Interpretation of weight from a viewpoint of the η-divergence. (a) An arrow from P to Q indicates DKL(P, Q) with weight 1 − η. (b) An arrow from P′ to Q′ indicates DKL(P′, Q′) in a shrunk simplex. Note that (1 − η)DKL(P, Q) = DKL(P′, Q′) holds for these figures.

From equation 4.9, the weighted KL divergence (1 − η)DKL(P, Q) in the original simplex is equivalent to the KL divergence DKL(P′, Q′) in a shrunk simplex . Therefore, the proposed weight adaptation method has an aspect of selecting appropriate divergence for each data manifold from the viewpoint of its sensitivity. Since elements of IF(m) of Dηη(P), ρη(Q)) are obviously (1 − η) times those of DKL(P, Q) from equation 4.9, an unreliable Pn that has a large ISSn value is evaluated by the η-divergence with small (1 − η) value, and vice versa.

5.  Experimental Results

In this section, we compare the proposed methods with the conventional method and discuss some candidates for weights in the BT estimation experiments on typical pairwise comparison settings.

We prepare three types of probability distributions P* used in Hastie and Tibshirani (1998):
formula
For each distribution, we set m-flat manifolds where , which are constructed based on all the one-versus-one observations, with αn for comparison of i and j as
formula
where zij is a standard normal variate.
At first, we show the results of our em estimation methods, given by algorithms 1 and 2. For algorithm 1, we prepare three types of weight for comparison between i and j, defined as
formula
where cu, co, and cv are constants to hold ∑Nn=1wn = 1. Note that wnu gives the uniform weight, wno gives an ideal ratio related to i and j in all the items, and wnv gives a weight introduced in Hastie and Tibshirani (1998). Figure 8 shows estimation results with medians and quartiles of for 500 trials with |Λ| = 3, 5, 10, 15, 20, respectively. As shown in the figure, the quality of the estimates based on our proposed method with the weights {wno} tends to be low. The results indicate that the estimation with invalid weights causes an overfitting problem, which means excessive emphasis on unreliable data manifolds, and it easily leads to inaccurate results. The figure indicates, however, that the weight adaptation method works appropriately in most cases, especially in the case of P*C, where few items with significantly large probabilities dominate observations and the other items have very low probabilities. Huang et al. (2006) mentioned that a typical situation in practical problems resembles the case of P*C. Our experimental results suggest that the weight adaptation method gives better estimates in such a case. This is because the adapted weights tend to have a lower value for a sensitive data manifold, which compares an item with very large probability with one with small probability.
Figure 8:

Plots of em estimation results with typical weights. Medians and quartiles of for 500 trials with various |Λ| are plotted. A2: algorithm 2 with S = 5; A1,u, A1,o, and A1,v: algorithm 1 with {wnu}, {wno}, and {wnv}, respectively.

Figure 8:

Plots of em estimation results with typical weights. Medians and quartiles of for 500 trials with various |Λ| are plotted. A2: algorithm 2 with S = 5; A1,u, A1,o, and A1,v: algorithm 1 with {wnu}, {wno}, and {wnv}, respectively.

Second, we show the comparison results of conventional estimation methods with {wnu}, {wno}, {wnv}, and algorithm 2 in the same setting. As shown in Figure 9, the conventional method with {wno} ends up with low-quality estimates as well as our algorithm 1. The results show that invalid weights in the conventional methods also cause the overfitting. In this respect, it seems that the weight adaptation method is possible to avoid overfitting, and the estimation accuracy of our proposed method is basically better than that of conventional methods.

Figure 9:

Plots of estimation results obtained by conventional methods and our proposed method. and for 500 trials are plotted. conv,u, conv,o and conv,v: conventional estimation with {wnu}, {wno}, and {wnv}, respectively.

Figure 9:

Plots of estimation results obtained by conventional methods and our proposed method. and for 500 trials are plotted. conv,u, conv,o and conv,v: conventional estimation with {wnu}, {wno}, and {wnv}, respectively.

Finally, we discuss our proposed method with weight adaptation from the viewpoint of robustness. Figure 10 shows a relation between estimation performance at |Λ| = 20 and a missing ratio of comparisons, that is, a certain ratio of manifolds in is randomly chosen and discarded. As shown in the figure, our weight adaptation seems a little sensitive to missing observations, but it is still better than the conventional methods, especially in the case of P*C.

Figure 10:

Plots of estimation results when some ratios of comparisons are randomly chosen and missed. and for 500 trials where |Λ| = 20 are plotted.

Figure 10:

Plots of estimation results when some ratios of comparisons are randomly chosen and missed. and for 500 trials where |Λ| = 20 are plotted.

6.  Discussion and Concluding Remarks

In this letter, we reinterpreted an estimation with the Bradley-Terry model and its related models in the space of categorical distributions and then proposed a new estimation method based on a framework of the information-geometric em algorithm. Our proposed method is different in its objective function from conventional methods. With the KL decomposition given by equation 3.15, the objective function equation 2.8 is rewritten as
formula
6.1
which is the essential part of equation 3.15. There are two important differences between equations 2.10 and 6.1. The first one is ψn in the first term of equation 6.1. Weight wn for the nth m-flat data manifold is adjusted by a hidden variable ψn, which is implicitly estimated in the em steps. The other difference is the second term of equation 6.1. This term evaluates the weighted sum of KL divergences between {Pn(r)} and {Qn(r)}. In this term, DKL(Pn(r), Qn(r)) indicates the difference between the ratio (1 − ψn):ψn defined by the e-projection from Q onto and the ratio of directly summed parameters . A geometrical interpretation of DKL(Pn(r), Qn(r)) in this term is shown in Figure 11. Intuitively, the KL divergence between categorical distributions is naturally decomposed into two different divergences: the divergence between distributions of compared items and the divergence between distributions associated with the ratio of uncompared items. Equation 6.1 evaluates these two kinds of consistency of the parameter set. The difference between equation 2.10 and 6.1 becomes large when point Q is far from . Therefore, point Q tends to be close to not only in the conventional way, but also from the viewpoint of estimating a set of hidden variables {ψn}.
Figure 11:

Intuitive interpretation of DKL(Pn(r), Qn(r)) in equation 6.10. (a) Pn(r) is given by the e-projection. (b) Qn(r) is a point in (precisely, a submanifold of ) that satisfies the ratio . (c) Intuitive representation of DKL(Pn(r), Qn(r))

Figure 11:

Intuitive interpretation of DKL(Pn(r), Qn(r)) in equation 6.10. (a) Pn(r) is given by the e-projection. (b) Qn(r) is a point in (precisely, a submanifold of ) that satisfies the ratio . (c) Intuitive representation of DKL(Pn(r), Qn(r))

In addition, we discussed the selection of weights for divergences in the objective function and proposed an estimation method with weight adaptation from a viewpoint of sensitivity in the simplex. We also considered the relation between weights and divergences with a concrete type of the Bregman divergence. Experimental results showed that the proposed method with weight adaptation works appropriately and improves the accuracy of the estimates. As Rätsch, Smola, and Mika (2002) noted, the BT-like models generally have a problem with selecting reliable data manifolds, a code optimization problem in the context of multiclass classification. We believe that the proposed weight adaptation method shows another way to alleviate this problem.

An important topic to extend our method is a formulation for observations of ties. When there are ties in pairwise comparisons, observed events in comparison between i and j are denoted as ij (i is chosen), ij (j is chosen), and i = j (tie in the comparison of i and j). In the conventional framework, there are some extensions for the BT model to treat ties by introducing new tie parameters (Rao & Kupper, 1967; Davidson & Beaver, 1977; Kuk, 1995). Here we sketch a possible way to treat ties in our framework. Let Q′ = {θ′i>0} be a modified parameter set and τ>0 be a tie parameter where ∑i∈Λθ′i + τ = 1 holds. We can represent them in an |Λ| probability simplex with a dummy item T as shown in Figure 12a. In this simplex, {Λnm} and {αnm} for the comparison of two items i, j ∈ Λ with tie observations is given as follows:
formula
With this type of data manifold, we can obtain estimates and for τ and Q. If we need to be a probability distribution, a kind of projection onto the reduced dimensional simplex will help it, as shown in Figure 12b. The conventional extensions also should be compared with our extension, though it remains work for the future.
Figure 12:

A simple way to represent ties. (a) Representation of a data manifold defined by Λn0 = {2}, Λn1 = {T}, Λn2 = {1}, and Λn3 = {3} in the extended simplex. (b) Projection of the estimate onto the original |Λ| − 1-dimensional simplex.

Figure 12:

A simple way to represent ties. (a) Representation of a data manifold defined by Λn0 = {2}, Λn1 = {T}, Λn2 = {1}, and Λn3 = {3} in the extended simplex. (b) Projection of the estimate onto the original |Λ| − 1-dimensional simplex.

There are some other ways to improve our method. The first is the convergence speed of the em estimation. The speed of convergence in the proposed method is slower than the conventional one, which is implemented with the fixed-point method used in Huang et al. (2006). It could be a problem when online estimation is needed for large |Λ|. To improve convergence speed, we need to devise an acceleration for the em algorithm, like the acceleration methods featured in McLachlan and Krishnan (1996) and Watanabe and Yamaguchi (2003) for the EM algorithm. The second is selection of weights and divergences. As we stated in section 4, if we have a criterion for choosing a Pareto solution, we should optimize the given multiobjective problem in the proper way. In addition, we also noted that weights and divergences in the objective function are mutually connected and deeply related to robustness. We expect that a more appropriate objective function can be designed in terms of robustness with, for example, the Bregman divergence, which is known to be a convenient class for m-flat data manifolds (Murata et al., 2004).

Note

1

A weight value itself does not have an essential meaning, so we can use wn = (0, ∞) in the same framework without loss of generality. We use the normalized weight only for convenience.

References

Agresti
,
A.
(
2002
).
Categorical data analysis
(2nded.).
Hoboken, NJ
:
Wiley
.
Amari
,
S.
(
1995
).
Information geometry of the EM and em algorithms for neural networks
.
Neural Networks
,
8
(
9
),
1379
1408
.
Amari
,
S.
(
2007
).
Integration of stochastic models by minimizing α-divergence
.
Neural Computation
,
19
,
2780
2796
.
Amari
,
S.
, &
Nagaoka
,
H.
(
2000
).
Methods of information geometry
.
New York
:
Oxford University Press
.
Basu
,
A.
,
Harris
,
I. R.
,
Hjort
,
N. L.
, &
Jones
,
M. C.
(
1998
).
Robust and efficient estimation by minimising a density power divergence
.
Biometrika
,
85
(
3
),
549
559
.
Beran
,
R.
(
1977
).
Minimum Hellinger distance estimates for parametric models
.
Annals of Statistics
,
5
(
3
),
445
463
.
Bradley
,
R. A.
, &
Terry
,
M. E.
(
1952
).
Rank analysis of incomplete block designs: I. The method of paired comparisons
.
Biometrika
,
39
(
3/4
),
324
345
.
Bregman
,
L. M.
(
1967
).
The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming
.
USSR Computational Mathematics and Mathematical Physics
,
7
,
200
217
.
Cover
,
T. M.
, &
Thomas
,
J. A.
(
2006
)
Elements of information theory
(2nd ed.).
Hoboken, NJ
:
Wiley
.
Davidson
,
R. R.
, &
Beaver
,
R. J.
(
1977
).
On extending the Bradley-Terry model to incorporate within-pair order effects
.
Biometrics
,
33
(
4
),
693
702
.
Fujimoto
,
Y.
, &
Murata
,
N.
(
2007
).
A modified EM algorithm for mixture models based on Bregman divergence
.
Annals of the Institute of Statistical Mathematics
,
59
(
1
),
3
25
.
Genest
,
C.
, &
Zidek
,
J. V.
(
1986
).
Combining probability distributions: A critique and an annotated bibliography
.
Statistical Science
,
1
(
1
),
114
148
.
Hampel
,
F. R.
,
Ronchetti
,
E. M.
,
Rousseeuw
,
P. J.
, &
Stahel
,
W. A.
(
1986
).
Robust statistics
.
Hoboken, NJ
:
Wiley
.
Hastie
,
T.
, &
Tibshirani
,
R.
(
1998
).
Classification by pairwise coupling
.
Annals of Statistics
,
26
(
2
),
451
471
.
Hino
,
H.
,
Fujimoto
,
Y.
, &
Murata
,
N.
(
2010
).
A grouped ranking model for item preference parameter
.
Neural Computation
,
22
(
9
),
2417
2451
.
Huang
,
T.-K.
,
Weng
,
R. C.
, &
Lin
,
C.-J.
(
2006
).
Generalized Bradley-Terry models and multi-class probability estimates
.
Journal of Machine Learning Research
,
7
,
85
115
.
Ikeda
,
S.
(
2006
).
Learning binary classifiers for multi-class problem
(
Tech. Rep. 1010
).
Hiroo, Japan
:
Institute of Statistical Mathematics
.
Kuk
,
A. Y. C.
(
1995
).
Modeling paired comparison data with large numbers of draws and large variability of draw percentages among players
.
Statistician
,
44
(
4
),
523
528
.
Luce
,
R. D.
(
1959
).
Individual choice behavior
.
Hoboken, NJ
:
Wiley
.
McLachlan
,
G. J.
, &
Krishnan
,
T.
(
1996
).
The EM algorithm and extensions
.
Hoboken, NJ
:
Wiley-Interscience
.
Murata
,
N.
, &
Fujimoto
,
Y.
(
2009
).
Bregman divergence and density integration
.
Journal of Math-for-Industry
,
1
,
97
104
.
Murata
,
N.
,
Takenouchi
,
T.
,
Kanamori
,
T.
, &
Eguchi
,
S.
(
2004
).
Information geometry of -boost and Bregman divergence
.
Neural Computation
,
16
,
1437
1481
.
Pendergrass
,
R.
, &
Bradley
,
R.
(
1960
).
Ranking in triple comparisons
. In
O. Olkin, S. Ghurye, W. Hoeffding, W. Madow, & H. Mann
(Eds.),
Contributions to probability and statistics
(pp.
331
351
).
Palo Alto, CA
:
Stanford University Press
.
Plackett
,
R. L.
(
1975
).
The analysis of permutations
.
Applied Statistics
,
24
(
2
),
193
202
.
Rao
,
P. V.
, &
Kupper
,
L. L.
(
1967
).
Ties in paired-comparison experiments: A generalization of the Bradley-Terry model
.
Journal of the American Statistical Association
,
62
(
317
),
195
204
.
Rätsch
,
G.
,
Smola
,
A. J.
, &
Mika
,
S.
(
2002
).
Adapting codes and embeddings for polychotomies
. In
S. Becker, S. Thrün, & K. Obermayer
(Eds.),
Advances in neural information processing systems
,
15
(pp.
513
520
).
Cambridge, MA
:
MIT Press
.
Takenouchi
,
T.
, &
Eguchi
,
S.
(
2004
).
Robustifying Adaboost by adding the naive error rate
.
Neural Computation
,
16
(
4
),
767
787
.
Takenouchi
,
T.
,
Eguchi
,
S.
,
Murata
,
N.
, &
Kanamori
,
T.
(
2008
).
Robust boosting algorithm against mislabeling in multi-class problems
.
Neural Computation
,
20
,
1596
1630
.
Watanabe
,
M.
, &
Yamaguchi
,
K.
(Eds.). (
2003
).
The EM algorithm and related statistical models
.
New York
:
Marcel Dekker
.
Zadrozny
,
B.
(
2002
).
Reducing multiclass to binary by coupling probability estimates
. In
T. G. Dietterich, S. Becker, & Z. Ghahramani
(Eds.),
Advances in neural information processing systems
,
14
(pp.
1041
1048
).
Cambridge, MA
:
MIT Press
.