Comparing Bayesian Models of Annotation

The analysis of crowdsourced annotations in natural language processing is concerned with identifying (1) gold standard labels, (2) annotator accuracies and biases, and (3) item difficulties and error patterns. Traditionally, majority voting was used for 1, and coefficients of agreement for 2 and 3. Lately, model-based analysis of corpus annotations have proven better at all three tasks. But there has been relatively little work comparing them on the same datasets. This paper aims to fill this gap by analyzing six models of annotation, covering different approaches to annotator ability, item difficulty, and parameter pooling (tying) across annotators and items. We evaluate these models along four aspects: comparison to gold labels, predictive accuracy for new annotations, annotator characterization, and item difficulty, using four datasets with varying degrees of noise in the form of random (spammy) annotators. We conclude with guidelines for model selection, application, and implementation.


Introduction
The standard methodology for analyzing crowdsourced data in NLP is based on majority voting (selecting the label chosen by the majority of coders) and inter-annotator coefficients of agreement, such as Cohen's κ (Artstein and Poesio, 2008). However, aggregation by majority vote implicitly assumes equal expertise among the annotators. This assumption, though, has been repeatedly shown to be false in annotation practice (Poesio and Artstein, 2005;Passonneau and Carpenter, 2014;Plank et al., 2014b). Chanceadjusted coefficients of agreement also have many shortcomings-for example, agreements in mistake, overly large chance-agreement in datasets with skewed classes, or no annotator bias correction (Feinstein and Cicchetti, 1990;Passonneau and Carpenter, 2014).
Research suggests that models of annotation can solve these problems of standard practices when applied to crowdsourcing (Dawid and Skene, 1979;Smyth et al., 1995;Raykar et al., 2010;Hovy et al., 2013;Passonneau and Carpenter, 2014). Such probabilistic approaches allow us to characterize the accuracy of the annotators and correct for their bias, as well as accounting for item-level effects. They have been shown to perform better than non-probabilistic alternatives based on heuristic analysis or adjudication (Quoc Viet Hung et al., 2013). But even though a large number of such models has been proposed (Carpenter, 2008;Whitehill et al., 2009;Raykar et al., 2010;Hovy et al., 2013;Passonneau and Carpenter, 2014;Felt et al., 2015a;Kamar et al., 2015;Moreno et al., 2015, inter alia), it is not immediately obvious to potential users how these models differ or, in fact, how they should be applied at all. To our knowledge, the literature comparing models of annotation is limited, focused exclusively on synthetic data (Quoc Viet Hung et al., 2013) or using publicly available implementations that constrain the experiments almost exclusively to binary annotations (Sheshadri and Lease, 2013).

Contributions
• Our selection of six widely used models (Dawid and Skene, 1979;Carpenter, 2008;Hovy et al., 2013) covers models with varying degrees of complexity: pooled models, which assume all annotators share the same ability; unpooled models, which model individual annotator parameters; and partially pooled models, which use a hierarchical structure to let the level of pooling be dictated by the data.
• We carry out the evaluation on four datasets with varying degrees of sparsity and annotator accuracy in both gold-standard dependent and independent settings.
• We use fully Bayesian posterior inference to quantify the uncertainty in parameter estimates.
• We provide guidelines for both model selection and implementation.
Our findings indicate that models which include annotator structure generally outperform other models, though unpooled models can overfit. Several open-source implementations of each model type are available to users.

Bayesian Annotation Models
All Bayesian models of annotation that we describe are generative: They provide a mechanism to generate parameters θ characterizing the process (annotator accuracies and biases, prevalence, etc.) from the prior p(θ), then generate the observed labels y from the parameters according to the sampling distribution p(y|θ). Bayesian inference allows us to condition on some observed data y to draw inferences about the parameters θ; this is done through the posterior, p(θ|y).
The uncertainty in such inferences may then be used in applications such as jointly training classifiers (Smyth et al., 1995;Raykar et al., 2010), comparing crowdsourcing systems (Lease and Kazai, 2011), or characterizing corpus accuracy (Passonneau and Carpenter, 2014).
This section describes the six models we evaluate. These models are drawn from the literature, but some had to be generalized from binary to multiclass annotations. The generalization naturally comes with parameterization changes, although these do not alter the fundamentals of the models. (One aspect tied to the model parameterization is the choice of priors. The guideline we followed was to avoid injecting any class preferences a priori and let the data uncover this information; see more in §3.)

Implementation of the Models
We implemented all models in this paper in Stan (Carpenter et al., 2017), a tool for Bayesian Inference based on Hamiltonian Monte Carlo. Although the non-hierarchical models we present can be fit with (penalized) maximum likelihood (Dawid and Skene, 1979;Passonneau and Carpenter, 2014), 8 there are several advantages to a Bayesian approach. First and foremost, it provides a mean for measuring predictive calibration for forecasting future results. For a well-specified model that matches the generative process, Bayesian inference provides optimally calibrated inferences (Bernardo and Smith, 2001); for only roughly accurate models, calibration may be measured for model comparison (Gneiting et al., 2007). Calibrated inference is critical for making optimal decisions, as well as for forecasting (Berger, 2013). A second major benefit of Bayesian inference is its flexibility in combining submodels in a computationally tractable manner. For example, predictors or features might be available to allow the simple categorical prevalence model to be replaced with a multilogistic regression (Raykar et al., 2010), features of the annotators may be used to convert that to a regression model, or semi-supervised training might be carried out by adding known gold-standard labels (Van Pelt and Sorokin, 2012). Each model can be implemented straightforwardly and fit exactly (up to some degree of arithmetic precision) using Markov chain Monte Carlo methods, allowing a wide range of models to be evaluated. This is largely because posteriors are much better behaved than point estimates for hierarchical models, which require custom solutions on a permodel basis for fitting with classical approaches (Rabe-Hesketh and Skrondal, 2008). Both of these benefits make Bayesian inference much simpler and more useful than classical point estimates and standard errors. Convergence is assessed in a standard fashion using the approach proposed by Gelman and Rubin (1992): For each model we run four chains with diffuse initializations and verify that they converge to the same mean and variances (using the criterionR < 1.1).
Hierarchical priors, when jointly fit with the rest of the parameters, will be as strong and thus support as much pooling as evidenced by the data. For fixed priors on simplexes (probability parameters that must be non-negative and sum to 1.0), we use uniform distributions (i.e., Dirichlet(1 K )). For location and scale parameters, we use weakly informative normal and half-normal priors that inform the scale of the results, but are not otherwise sensitive. As with all priors, they trade some bias for variance and stabilize inferences when there is not much data. The exception is MACE, for which we used the originally recommended priors, to conform with the authors' motivation.
All model implementations are available to readers online at http://dali.eecs. qmul.ac.uk/papers/supplementary_ material.zip.

Evaluation
The models of annotation discussed in this paper find their application in multiple tasks: to label items, characterize the annotators, or flag especially difficult items. This section lays out the metrics used in the evaluation of each of these tasks.  Table 1: General statistics (I items, N observations, J annotators, K classes) together with summary statistics for the number of annotators per item (J/I) and the number of items per annotator (I/J) (i.e., Min, 1st Quartile, Median, Mean, 3rd Quartile, and Max).

Datasets
We evaluate on a collection of datasets reflecting a variety of use-cases and conditions: binary vs. multi-class classification; small vs. large number of annotators; sparse vs. abundant number of items per annotator / annotators per item; and varying degrees of annotator quality (statistics presented in Table 1). Three of the datasets-WSD, RTE, and TEMP, created by Snow et al. (2008)-are widely used in the literature on annotation models (Carpenter, 2008;Hovy et al., 2013). In addition, we include the Phrase Detectives 1.0 (PD) corpus (Chamberlain et al., 2016), which differs in a number of key ways from the Snow et al. (2008) datasets: It has a much larger number of items and annotations, greater sparsity, and a much greater likelihood of spamming due to its collection via a game-with-a-purpose setting. This dataset is also less artificial than the datasets in Snow et al. (2008), which were created with the express purpose of testing crowdsourcing. The data consist of anaphoric annotations, which we reduce to four general classes (DN/DO = discourse new/old, PR = property, and NR = non-referring).
To ensure similarity with the Snow et al. (2008) datasets, we also limit the coders to one annotation per item (discarded data were mostly redundant annotations). Furthermore, this corpus allows us to evaluate on meta-data not usually available in traditional crowdsourcing platforms, namely, information about confessed spammers and good, established players.

Comparison Against a Gold Standard
The first model aspect we assess is how accurately they identify the correct ("true") label of the items. The simplest way to do this is by comparing the inferred labels against a gold standard, 575 using standard metrics such as Precision / Recall / F-measure, as done, for example, for the evaluation of MACE in Hovy et al. (2013). We check whether the reported differences are statistically significant, using bootstrapping (the shift method), a non-parametric two-sided test (Wilbur, 1994;Smucker et al., 2007). We use a significance threshold of 0.05 and further report whether the significance still holds after applying the Bonferroni correction for type 1 errors. This type of evaluation, however, presupposes that a gold standard can be obtained. This assumption has been questioned by studies showing the extent of disagreement on annotation even among experts (Poesio and Artstein, 2005;Passonneau and Carpenter, 2014;Plank et al., 2014b). This motivates exploring complementary evaluation methods.

Predictive Accuracy
In the statistical analysis literature, posterior predictions are a standard assessment method for Bayesian models (Gelman et al., 2013). We measure the predictive performance of each model using the log predictive density (lpd), that is, log p(ỹ|y), in a Bayesian K-fold cross-validation setting (Piironen and Vehtari, 2017;. The set-up is straightforward: we partition the data into K subsets, each subset formed by splitting the annotations of each annotator into K random folds (we choose K = 5). The splitting strategy ensures that models that cannot handle predictions for new annotators (i.e., unpooled models like D&S and MACE) are nevertheless included in the comparison. Concretely, we compute In Equation (1), y (−k) andỹ k represent the items from the train and test data, for iteration k of the cross validation, while θ (k,m) is one draw from the posterior.

Annotators' Characterization
A key property of most of these models is that they provide a characterization of coder ability. In the D&S model, for instance, each annotator is modeled with a confusion matrix; Passonneau and Carpenter (2014) showed how different types of annotators (biased, spamming, adversarial) can be identified by examining this matrix. The same information is available in HIERD&S and LOGRNDEFF, whereas MACE characterizes coders by their level of credibility and spamming preference. We discuss these parameters with the help of the metadata provided by the PD corpus.
Some of the models (e.g., MULTINOM or ITEMDIFF) do not explicitly model annotators. However, an estimate of annotator accuracy can be derived post-inference for all the models. Concretely, we define the accuracy of an annotator as the proportion of their annotations that match the inferred item-classes. This follows the calculation of gold-annotator accuracy (Hovy et al., 2013), computed with respect to the gold standard. Similar to Hovy et al. (2013), we report the correlation between estimated and gold annotators' accuracy.

Item Difficulty
Finally, the LOGRNDEFF model also provides an estimate that can be used to assess item difficulty. This parameter has an effect on the correctness of the annotators: namely, there is a subtractive relationship between the ability of an annotator and the item-difficulty parameter. The "difficulty" name is thus appropriate, although an examination of this parameter alone does not explicitly mark an item as difficult or easy. The ITEMDIFF model does not model annotators and only uses the difficulty parameter, but the name is slightly misleading because its probabilistic role changes in the absence of the other parameter (i.e., it now shows the most likely annotation classes for an item). These observations motivate an independent measure of item difficulty, but there is no agreement on what such a measure could be.
One approach is to relate the difficulty of an item to the confidence a model has in assigning it a label. This way, the difficulty of the items is judged under the subjectivity of the models, which in turn is influenced by their set of assumptions and data fitness. As in Hovy et al. (2013), we measure the model's confidence via entropy to filter out the items the models are least confident in (i.e., the more difficult ones) and report accuracy trends. 576 This section assesses the six models along different dimensions. The results are compared with those obtained with a simple majority vote (MAJVOTE) baseline. We do not compare the results with non-probabilistic baselines as it has already been shown (see, e.g., Quoc Viet Hung et al., 2013) that they underperform compared with a model of annotation.
We follow the evaluation tasks and metrics discussed in §4 and briefly summarized next. A core task for which models of annotation are used is to infer the correct interpretations from a crowdsourced dataset of annotations. This evaluation is conducted first and consists of a comparison against a gold standard. One problem with this assessment is caused by ambiguity-previous studies indicating disagreement even among experts. Because obtaining a true gold standard is questionable, we further explore a complementary evaluation, assessing the predictive performance of the models, a standard evaluation approach from the literature on Bayesian models. Another core task models of annotation are used for is to characterize the accuracy of the annotators and their error patterns. This is the third objective of this evaluation. Finally, we conclude this section assessing the ability of the models to correctly diagnose the items for which potentially incorrect labels have been inferred.
The PD data are too sparse to fit the models with item-level difficulties (i.e., ITEMDIFF and LOGRNDEFF). These models are therefore not present in the evaluations conducted on the PD corpus.

Comparison Against a Gold Standard
A core task models of annotation are used for is to infer the correct interpretations from crowdannotated datasets. This section compares the inferred interpretations with a gold standard.
Tables 2, 3 and 4 present the results. 9 On WSD and TEMP datasets (see Table 4), characterized by a small number of items and annotators (statistics in Table 1), the different model complexities result in no gains, all the models performing 9 The results for MAJVOTE, HIERD&S, and LOGRNDEFF we report match or slightly outperform those reported by Carpenter (2008) on the RTE dataset. Similar for MACE, across WSD, RTE, and TEMP datasets (Hovy et al., 2013).  equivalently. Statistically significant differences (0.05 threshold, plus Bonferroni correction for type 1 errors; see §4.2 for details) are, however, very much present in Tables 2 (RTE dataset) and 3 (PD dataset). Here the results are dominated by the unpooled (D&S and MACE) and partially pooled models (LOGRNDEFF, and HIERD&S, except for PD, as discussed later in §6.1), which assume some form of annotator structure. Furthermore, modeling the full annotator response matrix leads in general to better results (e.g., D&S vs. MACE on the PD dataset). Ignoring completely any annotator structure is rarely appropriate, such models failing to capture the different levels of expertise the coders have-see the poor performance of the unpooled MULTINOM model and of the partially pooled ITEMDIFF model. Similarly, the MAJVOTE baseline implicitly assumes equal expertise among coders, leading to poor performance results.

Predictive Accuracy
Ambiguity causes disagreement even among experts, affecting the reliability of existing gold standards. This section presents a complementary evaluation, namely, predictive accuracy. In a similar spirit to the results obtained in the comparison against the gold standard, modeling the ability of the annotators was also found to be essential for a good predictive performance (results presented   in Table 5). However, in this type of evaluation, the unpooled models can overfit, affecting their performance (e.g., a model of higher complexity like D&S, on a small dataset like WSD). The partially pooled models avoid overfitting through the hierarchical structure obtaining the best predictive accuracy. Ignoring the annotator structure (ITEMDIFF and MULTINOM) leads to poor performance on all datasets except for WSD, where this assumption is roughly appropriate since all the annotators have a very high proficiency (above 95%).

Annotators' Characterization
Another core task models of annotation are used for is to characterize the accuracy and bias of the annotators. We first assess the correlation between the estimated and gold accuracy of the annotators. The results, presented in Table 6, follow the same pattern to those obtained in §5.1: a better performance of the unpooled (D&S and MACE 10 ) and partially pooled models (LOGRNDEFF and HIERD&S, except for PD, as discussed later in §6.1). The results 10 The results of our reimplementation match the published ones (Hovy et al., 2013  are intuitive: A model that is accurate with respect to the gold standard should also obtain high correlation at annotator level. The PD corpus also comes with a list of selfconfessed spammers and one of good, established players (see Table 7 for a few details). Continuing with the correlation analysis, an inspection of the second-to-last column from Table 6 shows largely accurate results for the list of spammers. On the second category, however, the non-spammers (the last column), we see large differences between models, following the same pattern with the previous correlation results. An inspection of the spammers' annotations show an almost exclusive use of the DN (discourse new) class, which is highly prevalent in PD and easy for the models to infer; the non-spammers, on the other hand, make use of all the classes, making it more difficult to capture their behavior. 11   We further examine some useful parameter estimates for each player type. We chose one spammer and one non-spammer and discuss the confusion matrix inferred by D&S, together with the credibility and spamming preference given by MACE. The two annotators were chosen to be representative for their type. The selection of the models was guided by their two different approaches to capturing the behavior of the annotators. Table 8 presents the estimates for the annotator selected from the list of spammers. Again, inspection of the confusion matrix shows that, irrespective of the true class, the spammer almost always produces the DN label. The MACE estimates are similar, allocating 0 credibility to this annotator, and full spamming preference for the DN class.
In Table 9 we show the estimates for the annotator chosen from the non-spammers list. Their response matrix indicates an overall good performance (see diagonal matrix), albeit with a confusion of PR (property) for DN (discourse new), which is not surprising given that indefinite NPs (e.g., a policeman) are the most common type of mention in both classes. MACE allocates large credibility to this annotator and shows a similar spamming preference for the DN class.
This discussion, as well as the quantiles from Table 7, show that poor accuracy is not by itself a good indicator of spamming. A spammer like the one discussed in this section can obtain good performance by always choosing a class with high frequency in the gold standard. At the same time, a non-spammer may fail to recognize some true classes correctly, but be very good on others. Bayesian models of annotation allow captur-   ing and exploiting these observations. For a model like D&S, such a spammer presents no harm, as their contribution towards any potential true class of the item is the same and therefore cancels out. 12

Filtering Using Model Confidence
This section assesses the ability of the models to correctly diagnose the items for which potentially incorrect labels have been inferred. Concretely, we identify the items that the models are least confident in (measured using the entropy of the posterior of the true class distribution) and present the accuracy trends as we vary the proportion of filtered out items. Overall, the trends (Figures 7, 8 and 9) indicate that filtering out the items with low confidence improves the accuracy of all the models and across all datasets. 13 Figure 7: Effect of filtering on RTE: accuracy (y-axis) vs. proportion of data with lowest entropy (x-axis).

Discussion
We found significant differences across a number of dimensions between both the annotation models and between the models and MAJVOTE.

Observations and Guidelines
The completely pooled model (MULTINOM) underperforms in almost all types of evaluation and all datasets. Its weakness derives from its core assumption: It is rarely appropriate in crowdsourcing to assume that all annotators have the same ability.
The unpooled models (D&S and MACE) assume each annotator has their own response parameter. These models can capture the accuracy and bias of annotators, and perform well in all evaluations against the gold standard. Lower performance is obtained, however, on posterior predictions: The higher complexity of unpooled models results in overfitting, which affects their predictive performance.
The partially pooled models (ITEMDIFF, HIERD&S, and LOGRNDEFF) assume both individual and hierarchical structure (capturing population behavior). These models achieve the best of both worlds, letting the data determine the level of pooling that is required: They asymptote to the unpooled models if there is a lot of variance among the individuals in the population, or to the fully pooled models when the variance is very low. This flexibility ensures good performance both in the evaluations against the gold standard and in terms of their predictive performance.
Across the different types of pooling, the models that assume some form of annotator structure (D&S, MACE, LOGRNDEFF, and HIERD&S) came out on top in all evaluations. The unpooled models (D&S and MACE) register on par performance with the partially pooled ones (LOGRNDEFF and HIERD&S, except for the PD dataset, as discussed later in this section) in the evaluations against the gold standard, but as previously mentioned, can overfit, affecting their predictive performance. Ignoring any annotator structure (the pooled MULTINOM model, the partially pooled ITEMDIFF model, or the MAJVOTE baseline) leads generally to poor performance results.
The approach we took in this paper is domainindependent, that is, we did not assess and compare models that use features extracted from the data, even though it is known that when such features are available, they are likely to help (Raykar et al., 2010;Felt et al., 2015a;Kamar et al., 2015). This is because a proper assessment of such models would also require a careful selection of the features and how to include them into a model of annotation. A bad (i.e., misspecified in the statistical sense) domain model is going to hurt more than help as it will bias the other estimates. Providing guidelines for this feature-based analysis 580 would have excessively expanded the scope of this paper. But feature-based models of annotation are extensions of the standard annotation-only models; thus, this paper can serve as a foundation for the development of such models. A few examples of feature-based extensions of standard models of annotation are given in §7 to guide readers who may want to try them out for their specific task/domain.
The domain-independent approach we took in this paper further implies that there are no differences between applying these models to corpus annotation or other crowdsourcing tasks. This paper is focused on resource creation and does not propose to investigate the performance of the models in downstream tasks. However, previous work already used such models of annotation for NLP (Plank et al., 2014a;Sabou et al., 2014;Habernal andGurevych, 2016, image labeling (Smyth et al., 1995;Kamar et al., 2015), or medical (Albert and Dodd, 2004;Raykar et al., 2010) tasks.
Although HIERD&S normally achieves the best performance in all evaluations on the Snow et al. (2008) datasets, on the PD data it is outperformed by the unpooled models (MACE and D&S). To understand this discrepancy, note that the datasets from Snow et al. (2008) were produced using Amazon Mechanical Turk, by mainly highly skilled annotators; whereas the PD dataset was produced in a game-with-a-purpose setting, where most of the annotations were made by only a handful of coders of high quality, the rest being produced by a large number of annotators with much lower abilities. These observations point to a single population of annotators in the former datasets, and to two groups in the latter case. The reason why the unpooled models (MACE and D&S) outperform the partially pooled HIERD&S model on the PD data is that this class of models assumes no population structure-hence, there is no hierarchical influence; a multi-modal hierarchical prior in HIERD&S might be better suited for the PD data. This further suggests that results depend to some extent on the dataset specifics. This does not alter the general guidelines made in this paper.

Technical Notes
Posterior Curvature. In hierarchical models, a complicated posterior curvature increases the dif-ficulty of the sampling process affecting convergence. This may happen when the data are sparse or when there are large inter-group variances. One way to overcome this problem is to use a non-centered parameterization (Betancourt and Girolami 2015). This approach separates the local parameters from their parents, easing the sampling process. This often improves the effective sample size and, ultimately, the convergence (i.e., lowerR). The non-centered parameterization offers an alternative but equivalent implementation of a model. We found this essential to ensure a robust implementation of the partially pooled models. Label Switching. The label switching problem that occurs in mixture models is due to the likelihood's invariance under the permutation of the labels. This makes the models nonidentifiable. Convergence cannot be directly assessed, because the chains will no longer overlap. We use a general solution to this problem from Gelman et al. (2013): re-label the parameters, post-inference, based on a permutation that minimizes some loss function. For this survey, we used a small random sample of the gold data (e.g., five items per class) to find the permutation that maximizes model accuracy for every chain-fit. We then relabeled the parameters of each chain according to the chain-specific permutation before combining them for convergence assessment. This ensures model identifiability and gold alignment.

Related Work
Bayesian models of annotation share many characteristics with so called item-response and idealpoint models. A popular application of these models is to analyze data associated with individuals and test items. A classic example is the Rasch model (Rasch, 1993) which assumes that the probability of a person being correct on a test item is based on a subtractive relationship between their ability and the difficulty of the item. The model takes a supervised approach to jointly estimating the ability of the individuals and the difficulty of the test items based on the correctness of their responses. The models of annotation we discussed in this paper are completely unsupervised and infer, in addition to annotator ability and/or item difficulty, the correct labels. More details on item-response models are given in Skrondal and Rabe-Hesketh (2004) and Gelman and Hill (2007). Item-response theory has also been recently applied to NLP applications (Lalor et al., 2016;Martınez-Plumed et al., 2016;Lalor et al., 2017).
The models considered so far take into account only the annotations. There is work, however, that further exploits the features that can accompany items. A popular example is the model introduced by Raykar et al. (2010), where the true class of an item is made to depend both on the annotations and on a logistic regression model that are jointly fit; essentially, the logistic regression replaces the simple categorical model of prevalence. Felt et al. (2014Felt et al. ( , 2015b introduced similar models that also modeled the predictors (features) and compared them to other approaches (Felt et al., 2015a). Kamar et al. (2015) account for task-specific feature effects on the annotations.
In §6.2, we discussed the label switching problem (Stephens, 2000) that many models of annotation suffer from. Other solutions proposed in the literature include utilizing class-informative priors, imposing ordering constraints (obvious for univariate parameters; less so in multivariate cases) (Gelman et al., 2013), or applying different post-inference relabeling techniques (Felt et al., 2014).

Conclusions
This study aims to promote the use of Bayesian models of annotation by the NLP community. These models offer substantial advantages over both agreement statistics (used to judge coding standards), and over majority-voting aggregation to generate gold standards (even when used with heuristic censoring or adjudication). To provide assistance in this direction, we compare six existing models of annotation with distinct prior and likelihood structures (e.g., pooled, unpooled, and partially pooled) and a diverse set of effects (annotator ability, item difficulty, or a subtractive relationship between the two). We use various evaluation settings on four datasets, with different levels of sparsity and annotator accuracy, and report significant differences both among the models, and between models and majority voting. As importantly, we provide guidelines to both aid users in the selection of the models and to raise awareness of the technical aspects essential to their implementation. We release all models evaluated here as Stan implementations at http://dali.eecs.qmul.ac.uk/paper/ supplementary_material.zip.