Human evaluation plays an important role in NLP, often in the form of preference judgments. Although there has been some use of classical non-parametric and bespoke approaches to evaluating these sorts of judgments, there is an entire body of work on this in the context of sensory discrimination testing and the human judgments that are central to it, backed by rigorous statistical theory and freely available software, that NLP can draw on. We investigate one approach, Log-Linear Bradley-Terry models, and apply it to sample NLP data.
Human evaluation is a key aspect of many NLP technologies. Automatic metrics that correlate with human judgments have been developed, especially in Machine Translation, to relieve some of the burden. Neverthess, Callison-Burch et al. (2007) note in their meta-evaluation that in MT they still “consider the human evaluation to be primary.” Whereas MT has traditionally used a Likert scale score for the criteria of adequacy and fluency, this meta-evaluation noted that these are “seemingly difficult things for judges to agree on”; consequently, asking judges to express a preference between alternative translations is increasingly used on the grounds of ease and intuitiveness. Further, where the major empirical results of a paper are from automatic metrics, it is still useful to supplement them: As two examples, Collins, Koehn, and Kucerova (2005) and Lewis and Steedman (2013), in addition to a metric-based evaluation, present human judgments of preferences for their systems with respect to a baseline (Fig. 1). For results in published work, the reader is typically left to draw inferences from the numbers. For the data in Figure 1, is there a strong preference for the non-baseline system overall, or do null preferences count against that? Is anything about the results statistically significant? There has been work in various areas of NLP in assessing statistical significance of human judgment results. However, to our knowledge, the field has not taken advantage of a body of work dedicated to analyzing human preferences—predominantly in the context of sensory discrimination testing, and consequent consumer behavior—which is supported by a great deal of statistical theory. It is linked to the mixed-effect models that are increasingly prominent in psycholinguistics and elsewhere, it has associated freely available R software, and it permits questions like the following to be asked: Can we say that the judges are expressing a preference at all, as opposed to no preference? Is there an effect from judge disagreement or inconsistency?
We describe our sample data (Section 2), sketch a classical non-parametric approach (Section 3) and discuss the issues that arise from this, and look at some of the approaches used in MT (Section 4). We then (Section 5) introduce ideas from human sensory preference testing, where we review log-linear Bradley-Terry models of preferences, and apply this to our data, including discussion of ties, of subject effects, and of multiple pairwise comparisons.
2. Two Data Sets
Single Pairwise Comparison. Our basic single pairwise comparisons are those presented in Figure 1(a) and (b). Figure 1(c) contains the counts we will be using in later analysis: We refer to counts in favor of the new system by n+, those in favor of the baseline by n−, and those reflecting no preference as n0; the Lewis and Steedman results were over 87 pairwise judgments. We add some further artificial data to illustrate how the Log-Linear Bradley-Terry (LLBT) models of Section 5 behave in accordance with intuition for data where the conclusion should be clear. These comprise a distribution with a moderate preference for + over − and not too many null preferences (ModPref), a distribution of equal preferences over all three categories (EqualPref), a distribution with mostly null preferences and equal n+ and n− (NoPref), and a distribution with very few null preferences (StrongPref).
Multiple Pairwise Comparison. As noted in Section 1, there has been a trend to using human preference judgments, particularly in the workshops on statistical machine Translation from Callison-Burch et al. (2007) onwards. Schemes have included asking humans to rank random samples of five translations, each from a different system. Vilar et al. (2007) propose using binary rather than n-ary rankings, arguing that this is a natural and easy task for judges. Here we present some artificial data of pairwise (binary) rankings to illustrate the techniques we discuss in Section 5, although these techniques can be extended to n-ary comparisons. In our example, there are four systems A, B, C, D and four judges J1, J2, J3, J4. The judges have pairwise ranked 240 translation pairs from systems x and y, indicating whether the translation of x is better than y (x ≻ y), worse than y (x ≺ y), or similar in quality to y (x = y); see Table 1. An overall impression, totalling all pairwise first preferences for each system (Section 4), gives a ranking of systems A–D–B–C. It can also be seen that there is little in the way of undecidedness, and also that judge J3 differs from the general judge opinion in pairwise ranking of AD and BC.
3. Classical Non-Parametric Methods
A classical approach to evaluating preferences is the non-parametric sign test Sprent and Smeeton (2007). The first issue in applying this test here is ties, or expressions of no preference—these are often ignored when the proportion of ties is small, but for our typical examples of Figure 1, this is not true. Randles (2001) observes, regarding the approach most widely recommended by textbooks of just ignoring ties, that “the constrained number of possible p values and its ‘elimination of zeroes’ has caused concern and controversy through the years.” Randles (2001) and Rayner and Best (2001, chapter 2), reviewing several approaches to handling ties, both advocate splitting ties in various ways depending on the problem setting, for (in Randles's characterization) “it is desirable that zeros have a conservative influence on declaring preference, but not to the same degree as negative responses.” The key point is that modeling of ties explicitly can be important, although there is no consensus on how this should be done; no approach apart from ignoring ties appears to be in widespread use. The second issue with the sign test is that of multiple judges, where data points are related (e.g., the same items are given to all judges). The Friedman test Section 7.3.1 Sprent and Smeeton (2007) can be viewed as an extension that can be applied to multiple subjects ranking multiple items (see Bi 2006, Section 5.1.3, for an example). However, Francis, Dittrich, and Hatzinger (2010) note that
[the Friedman test] simply examines the null hypothesis that the median ranks for all items are equal, and does not consider any differences in ranking between respondents. . . . Moreover, if the Friedman test rejects the null hypothesis, no quantitative interpretation, such as the odds of preferring one item over another, is provided. [Further, this] fail[s] both to consider the underlying psychological mechanism for ranking, and to formulate correct statistical models for this mechanism.
4. Methods in Machine Translation
Human evaluation in NLP is a pervasive issue, but here we focus on MT and its shared tasks. The 2007 shared task Callison-Burch et al. (2007) was the first to investigate a range of approaches that specifically included ranking of n translations, from best to worst, allowing ties (which were ignored); from this they defined an aggregate “rank,” “the average number of times that a system was judged to be better than any other system in the sentence ranking evaluation.” They assessed inter-annotator agreement, and—with a key goal of the meta-evaluation being to find the automatic evaluation metric that best matched human evaluations—calculated Spearman's rank correlation coefficient between the two types of assessment. The 2008 shared task Callison-Burch et al. (2008) took the same approach, but noted that in ranking, “[h]ow best to treat these is an open discussion, and certainly warrants further thought,” in particular because of ties “further complicating matters.” Pado et al. (2009) modified the system-level predictions approach to become “tie-aware,” and noted that that this “makes a considerable practical difference, improving correlation figures by 5–10 points.” At around the same time Vilar et al. (2007) examined the use of pairwise comparisons in MT evaluation. They pose the problem as one where, given an order relationship is-better-than between pairs of systems, the goal is to find an ordering of all the systems: They see this as the fundamental computer science problem of sorting. They define an aggregate evaluation score for comparing systems, estimating expected value and standard error for hypothesis testing. However, in aggregating this way information about ties is lost.
Bojar et al. (2011) critique the earlier WMT evaluations, citing issues with the ignoring of non-top ranks (noted in Section 3 herein also), with ties and also with interannotator agreement. Lopez (2012) extends the analysis of Bojar et al. and casts the problem as “finding the minimum feedback arc set in a tournament, a well-known NP-complete problem.” He advocates using the pairwise rankings themselves, rather than aggregate statistics like Vilar et al. (2007), and aims to minimize the number of violations among these. Koehn (2012) evaluates empirically the approaches of both Bojar et al. (2011) and Lopez (2012), with a focus on determining which systems are statistically distinguishable in terms of performance, defining confidence bounds for this purpose.
Hopkins and May (2013) recently advocated a focus on finding the extent to which particular rankings could be trusted. They proposed a model based on Item Response Theory (IRT), which underlies many standardized tests. They draw an analogy with judges assessing students on the basis of an underlying distribution of the student's ability, with items authored by students having a quality drawn from the student's ability distribution. They note in passing that a Gaussian parameterization of their IRT models resembles Thurstone and Bradley-Terry models; this leads us to the topic of Section 5.
Overall, then, there are ongoing discussions about what kind of analysis is appropriate for preference judgments. Some of this involves moderately heavy-duty computation for bootstrapping; this is suitable for large-scale WMT evaluations with dozens of competing systems, but perhaps less so for the scenarios we envisage in Section 1. Moreover, examining what techniques other fields have developed could be useful, especially when they come with ready-made, easy-to-use tools for smaller-scale evaluation.
5. Preferences and Log-Linear Bradley-Terry Methods
The statistical analysis of human perception and preferences dates back at least to the psychophysics work of German physiologist E. H. Weber in the nineteenth century. A progression from the way humans perceive differences between physical stimuli to more general analysis of human preferences has occurred particularly in the context of investigating consumer behavior—dealing with questions like whether there is a definite preference for a food with a particular type of ingredient, for example—and this is now a fully fledged area of research. Sources like Lawless and Heymann (2010) give overviews of the field and relevant statistical techniques. The earliest generally cited models for pairwise comparisons are the Thurstone model Thurstone (1927) and the closely related Bradley-Terry (BT) model Bradley and Terry (1952); these have connections to the IRT models, widely used in analyzing responses to questionnaires, which Hopkins and May (2013) drew on. Here we only look at BT models.
In a basic BT model, the probability that object j (Oj) is preferred to object k (Ok) from a set of J objects in a particular pairwise comparison jk is given by p(Oj ≻ Ok | πj, πk) = for all j ≠ k, where πj and πk are non-negative “worth” parameters describing the location of the object on the scale of preferences for some attribute. For n objects, there will be pairwise comparisons.
Log-Linear Models. It is now standard to fit BT models as log-linear models for example Agresti (2007), which allows them to be treated in a uniform way with much of modern statistical analysis. Log-linear models are a variety of generalized linear models (GLM), as is, for example, the logistic regression used throughout NLP. GLMs consist of a random component that identifies the response variable Y and selects a probability distribution for it; a systematic component that specifies some linear combination of the explanatory variables xi; and a link function g(·) applied to the mean μ of Y relating μ to this linear combination. They thus have the form g(μ) = α + β1x1 + … + βkxk. For log-linear models, the response variables are counts that are assumed to follow a Poisson distribution, and the link function is g(μ) = log(μ) (compare logistic regression's ). As an example, Y might be counts of people who hold some belief, and the various xi might be gender, socioeconomic status, and so forth. GLMs are a key tool for modern categorical data analysis, Agresti (2007, p. 65) noting that using models rather than the non-parametric approaches of Section 3 has several benefits:
The structural form of the model describes the patterns of association and interaction. The sizes of the model parameters determine the strength and importance of the effects. Inferences about the parameters evaluate which explanatory variables affect the response variable Y, while controlling effects of possible confounding variables. Finally, the model's predicted values smooth the data and provide improved estimates of the mean of Y at possible explanatory variable values.
In a log-linear model, intuitive log-odds interpretations of making one response relative to another can be derived from the parameters. (Typically, software chooses a reference parameter and other parameter values are relative to that.) Statistical significance scores and standard errors can be calculated for these parameters. In addition, GLMs allow for testing of model fit. There are various model choices (e.g., should we include ties? should we include terms representing interactions?) and goodness-of-fit tests can assess the alternatives (see, e.g., Agresti 2007, Section 7.2.1). The model with a separate parameter for each cell in the associated contingency table is called the saturated model, and fits the data perfectly, making it a suitable comparator for alternatives. Deviance is a likelihood ratio statistic comparing a proposed model to the saturated one, allowing a test of the hypothesis that parameters not included in the model are zero, via goodness of fit tests; large test statistics and small p-values provide evidence of model lack of fit.
In addition to the theoretical reasons for using LLBTs for modeling pairwise comparisons, a key benefit is the availability of packages in R for doing the modeling. Two candidates allowing a variety of sophisticated models are by Turner and Firth (2012) and Hatzinger and Dittrich (2012); we use the latter as the current version of the former does not handle ties. We first apply the model described by Equations (1) to the single pairwise data with ties from Section 2 using R. We refer the reader to the associated data bundle 1 for the full output; we only excerpt it in the discussion below. Immediately following is a snippet of the R output for the ModPref data from Figure 1. o1 is the variable for the + category, o2 for the – category, g1 for the null preferences.
In the R output, o2 is the reference object, with parameter value set to zero; the negative value of the estimate for g1 combined with its statistical significance says that there is a strong tendency for an expression of preference. The positive value of the o1 parameter and its significance indicate that the + group is strongly preferred: The odds in favor of this group with respect to the – group is exp(2 × 0.2778) = 1.74 to 1. Relating this to the description of the data in Section 2, then, there is a strong preference for translations by the proposed system relative to the baseline, even taking into account null preferences. The LLBT model confirms that even small data sets like this can produce meaningful and statistically significant results. For the other artificial preference data of Figure 1, the parameters behave as expected: for EqualPref, parameter estimates are all zero, signifying that they all have the same odds; for NoPref, the positive g1 indicates a strong tendency towards no preference; for StrongPref, the negative g1 indicates a strong tendency towards some preference, but with + or – equally likely. Note that all of these are saturated models: there are three objects and three parameters, so the model fits perfectly (indicated also by zero residual deviance). When we apply them to the real count data of Figure 1 (c), the results indicate that for the Collins et al. data there is a weak to moderate tendency not to choose (g1 estimate 0.303, p = 0.0432), but, given that, there is a significant (0.0001) preference in favor of the reordered system. For the Lewis and Steedman results, the model gives similar results, albeit with a much stronger disposition to null preferences. In the data bundle we also carry out the sign test ignoring ties for each data set for comparison; it gives the same results in each case for the relation of + than –, but does not allow an evaluation of the effect of ties.
We now apply the model described by Equations (1) to the multiple pairwise data of Table 1. In the R output, the four systems A, B, C, D correspond to objects o1, o2, o3, o4, and g1 again to null preferences. As per the overview of the MT data in Section 2, there is little undecidedness (large negative g1). The coefficients show that object o1 (system A) is most preferred, followed by o4 (D), then o2 (B) and o3 (C). Note also that in this case, the model is not saturated: There is a non-zero residual deviance. As mentioned, log-linear models can be compared in terms of goodness of fit: Dittrich, Hatzinger, and Katzenbeisser (1998) and Dittrich and Hatzinger (2009) discuss this in some detail for LLBT models. Chi-squared statistics can be used to assess goodness of fit based on the residual deviance; the degrees of freedom (d.f.) equal the number of cell counts minus the number of model parameters; both deviance and d.f. are given in the R output. For this data deviance is 30.646 on 8 d.f., whereas by contrast if the ties (g1) are left out, it is 221.22 on 9 d.f. A chi-squared test would establish the goodness of fit for each model; but even without consulting the test it can be seen that leaving out the one parameter related to ties (1 d.f.) gives a seven-fold increase in deviance, so clearly inclusion of ties produces a much better model.
As do Dittrich and Hatzinger (2009), we define a reference group, with the 's representing the ordering for that group; the orderings for other groups are obtained by adding the 's specific to group l to the 's for the reference group. μ(jk)l and are again “nuisance” parameters, the latter representing the main effect of the subject covariate measured on the lth level; 's are the (useful) subject-object interaction parameters describing the effect of the subject covariate on the preference for object j (similarly and object k). We apply the model described by Equations (2) to the multiple pairwise data, with the subject covariate SUBJ with four levels (one per judge Ji of Table 1). There are a few complexities in interpreting the output, beyond the scope of this article to discuss but covered in Dittrich, Hatzinger, and Katzenbeisser (1998). The broad interpretations to draw from the output are that interactions o1:SUBJ3 and o2:SUBJ3 are large and significant, and contribute to the model, unlike any others. These correspond to the different pairwise rankings given by judge J3 to system A (relative to D) and to B (relative to C): This is how subject effects are indicated in these LLBT models.
There are many other extensions to these models. Cattelan (2012) gives a state-of-the-art overview of such extensions across a range of approaches, with an emphasis on dependent data. We only note two extensions here that are incorporated into prefmod and relevant to NLP. With categorical object covariates, items can be grouped as well, to investigate effects of grouping there, for example, different origins for translation sources. With non-pairwise rankings, judges can rank over more than two elements, as in the standard WMT evaluations, although this needs a special treatment in the models.
We have looked at the sort of (pairwise) preference data that is encountered often in NLP. A particular characteristic of NLP data is that ties or undecided results may be frequent, and there is often a concern with inter-judge consistency. Reviewing classical non-parametric approaches, we note the opinion that it is important to model ties, and also note that approaches to looking at subject (judge) effects have several issues, such as a lack of quantitative interpretation of results. Among NLP approaches, especially within MT, new techniques are still being derived, which could benefit from views from outside the field. What we present are techniques from the field of sensory preference evaluation, where there has been a long history of development by statistics researchers. Recently, log-linear models have attracted attention. Applying them to sample data, we find that they provide the sort of information and uniform framework for analysis that NLP researchers could find useful. Given both extensive theoretical underpinings and freely available statistical software, we recommend LLBT models as a potential tool.
Department of Computing, Macquarie University, NSW 2109, Australia. E-mail: firstname.lastname@example.org.