## Abstract

In typical machine learning applications such as information retrieval, precision and recall are two commonly used measures for assessing an algorithm's performance. Symmetrical confidence intervals based on K-fold cross-validated t distributions are widely used for the inference of precision and recall measures. As we confirmed through simulated experiments, however, these confidence intervals often exhibit lower degrees of confidence, which may easily lead to liberal inference results. Thus, it is crucial to construct faithful confidence (credible) intervals for precision and recall with a high degree of confidence and a short interval length. In this study, we propose two posterior credible intervals for precision and recall based on K-fold cross-validated beta distributions. The first credible interval for precision (or recall) is constructed based on the beta posterior distribution inferred by all K data sets corresponding to K confusion matrices from a K-fold cross-validation. Second, considering that each data set corresponding to a confusion matrix from a K-fold cross-validation can be used to infer a beta posterior distribution of precision (or recall), the second proposed credible interval for precision (or recall) is constructed based on the average of K beta posterior distributions. Experimental results on simulated and real data sets demonstrate that the first credible interval proposed in this study almost always resulted in degrees of confidence greater than 95%. With an acceptable degree of confidence, both of our two proposed credible intervals have shorter interval lengths than those based on a corrected K-fold cross-validated t distribution. Meanwhile, the average ranks of these two credible intervals are superior to that of the confidence interval based on a K-fold cross-validated t distribution for the degree of confidence and are superior to that of the confidence interval based on a corrected K-fold cross-validated t distribution for the interval length in all 27 cases of simulated and real data experiments. However, the confidence intervals based on the K-fold and corrected K-fold cross-validated t distributions are in the two extremes. Thus, when focusing on the reliability of the inference for precision and recall, the proposed methods are preferable, especially for the first credible interval.

## 1  Introduction

There are multiple candidate models (i.e., algorithms) for a typical machine learning application and we need to choose one or several among many. In classification tasks with two classes of supervised learning, this is done by comparing the misclassification error, which is the sum of false positives and false negatives. However, as Yildiz, Aslan, and Alpaydin (2011) pointed out, misclassification error does not make a distinction between false positives and false negatives. Thus, many other performance measures have been proposed to evaluate candidate models, such as precision and recall. Precision and recall that are based on a binary contingency table are two measures that are commonly used in machine learning applications such as information retrieval (see Tables 1 and 2).

Table 1:
Contingency Table for a Two-Class Classification Problem.
Predicted PositiveClass NegativeSum
True Positive TP FN P
class Negative FP TN N
Sum
Predicted PositiveClass NegativeSum
True Positive TP FN P
class Negative FP TN N
Sum

Note: TP (resp. TN) is the number of true positives (resp. negatives) and FP (resp. FN) the number of false positives (resp. negatives).

Table 2:
Performance Measures.
NameFormula
Error (FP + FN)/(TP + FP + TN + FN)
Precision TP/(TP + FP)
Recall TP/(TP + FN)
F1 score 2TP/(2TP + FP + FN)
Sensitivity TP/(TP + FN)
Specificity TN/(FP + TN)
True positive rate TP/(TP + FN)
False positive rate FP/(FP + TN)
Matthews correlation coefficient
NameFormula
Error (FP + FN)/(TP + FP + TN + FN)
Precision TP/(TP + FP)
Recall TP/(TP + FN)
F1 score 2TP/(2TP + FP + FN)
Sensitivity TP/(TP + FN)
Specificity TN/(FP + TN)
True positive rate TP/(TP + FN)
False positive rate FP/(FP + TN)
Matthews correlation coefficient

In practice, to be able to eliminate the effect by chance (e.g., variance due to small changes in the training set), one typically does training and validation a number of times, possibly by various resampling methods such as cross-validation and bootstrap (Alpaydin, 1999; Bengio & Grandvalet, 2004; Dietterich, 1998; Efron & Tibshirani, 1993; Hastie, Tibshirani, & Friedman, 2001; Markatou, Tian, Biswas, & Hripcsak, 2005; Nadeau & Bengio, 2003; Wang, Wang, Jia, & Li, 2014; Yildiz, 2013). For example, after deriving K training and validation sets, classification algorithms are trained with the K training sets, and K confusion matrices are subsequently obtained based on the validation sets (Bengio & Grandvalet, 2004; Markatou et al., 2005; Moreno-Torres, Saez, & Herrera, 2012). Then the precision and recall values can be calculated based on the K confusion matrices from K-fold cross-validation, and these are commonly evaluated with two measures: the microaverage and the macroaverage. The so-called microaveraged precision (or recall) is computed based on the average of the corresponding elements of K confusion matrices, while macroaveraged precision (or recall) is the average of K precisions (or recalls) computed by each confusion matrix.

Traditionally, when applying a learning algorithm in machine learning, the focus is typically directed at the the single-point micro- and macroaveraged precision and recall values of the algorithm's performance from a K-fold cross-validation. However, as Wang, Li, Li, Wang, and Yang (2015) pointed out, point estimations are rather trivial and do not consider variations of the estimation. In response to this, symmetrical confidence intervals based on K-fold cross-validated t distributions have been proposed. As we confirmed through simulated experiments, however, these confidence intervals often exhibit lower degrees of confidence and short interval lengths (see section 4). This may easily lead to liberal inference results. When confidence intervals are used to compare the performance of two algorithms, for example, the results can be misleading insofar as they can imply that two algorithms are significantly different when in fact they are not.

Furthermore, a theoretical analysis of the posterior distributions of precision and recall in Goutte and Gaussier (2005) shows that they follow a beta distribution. As such, these distributions are always nonsymmetrical, owing to the occurrence of two different parameters in the beta distribution, as shown in Figure 1. Of course, when these two parameters are the same, a beta distribution is symmetrical, but this might not always occur because there will always be an unequal number of true positives (TPs) and false positives (FPs) (or false negatives, FNs) in practical applications. (See Goutte & Gaussier, 2005, and Wang et al., 2015.) Consider case finding for rare diseases as a practical example. In case finding, a good case-finding model may always have FP TP (due to class imbalance) and FN TP. Meanwhile, symmetrical confidence intervals may significantly affect the estimation accuracy of the confidence interval in some cases. This is because the values of precision and recall range between 0 and 1, whereas the symmetrical confidence interval can exceed the range of (Wang et al., 2015). Thus, the use of a symmetrical distribution, such as the commonly used t distribution, may be inappropriate for approximating the distribution of precision and recall, and this can result in large bias and a critically false conclusion.

Figure 1:

Density curves of beta distribution B(a, b) with different parameter combinations of a and b.

Figure 1:

Density curves of beta distribution B(a, b) with different parameter combinations of a and b.

To effectively measure the performance of an algorithm, it is crucial to construct faithful confidence (credible) intervals for precision and recall—that is, intervals with a high degree of confidence and a short interval length. In Bayesian statistics, credible intervals are analogous to confidence intervals in frequentist statistics. The degree of confidence of a credible interval is the probability of the inclusion of the true value in the credible interval. Interval length indicates the accuracy of the credible interval. Thus, in this study, two posterior credible intervals for precision and recall are constructed based on a K-fold cross-validated beta distribution.

The remainder of this study is organized as follows. Section 2 defines the standard precision and recall measures of an algorithm’s performance and then gives their (single-point) estimations based on a K-fold cross-validation. Two credible intervals based on K-fold cross-validated beta distributions proposed in this letter and confidence intervals based on K-fold and corrected K-fold cross-validated t distributions are described in section 3. Section 4 discusses the simulated and real data experiments that show how the confidence (credible) intervals behaves compare. Section 5 concludes the study.

## 2  Precision and Recall Measures of an Algorithm's Performance

In studies on a two-class classification problem of machine learning, the performance of the learning algorithm is always assessed with empirical measures, based on the TP, FP, true negative (TN), and FN values of a confusion matrix. In practice, a number of such measures have been developed depending on the type of error under consideration, including the precision value, the recall value, the F1 score, sensitivity, specificity, the TP rate, the FP rate, the receiver operating characteristic (ROC) curve, the area under the ROC curve (AUC), and the Matthews correlation coefficient as shown in Table 2 (Powers, 2011; Fawcett, 2006; Flach, 2003; Goutte & Gaussier, 2005; Lobo, Jimenez, & Real, 2008; Nadeau & Bengio, 2003; Wang et al., 2015; Yang & Liu, 1999). In this study, we focus on two important performance indicators in machine learning: precision and recall values.

Strictly speaking, the precision and recall values are estimations of the theoretical precision and recall measures for a specific practical application. Thus, we first discuss theoretical precision and recall measures.

### 2.1  Theoretical Precision and Recall Measures

Without loss of generality, in this study, we consider only the following two-class classification problems: each class is associated with a binary label , which accounts for the correctness of the class with respect to the task considered, and the classification algorithm produces a prediction z indicating whether it believes the class to be correct. Then precision may be defined as the probability that a class is positive (+) given that it is returned by the classification algorithm, while the recall is the probability that a positive class is returned (Goutte & Gaussier, 2005; Wang et al., 2015):
2.1
2.2

### 2.2  Precision and Recall Values Based on a Confusion Matrix

For a specific two-class classification problem, the experimental outcome may be conveniently summarized in a confusion matrix:
From these counts, one can obtain the empirical precision and recall values shown in Table 2:
2.3
2.4

It is obvious that the precision and recall values are estimations of the theoretical precision and recall measures.

### 2.3  Microaveraged Precision and Recall Values Based on a K-Fold Cross-Validation

In practice, in order to eliminate the effect by chance (e.g., variance due to small changes in the training set), a resampling method is always used. K-fold cross-validation is probably the simplest and most widely used resampling method. It uses all available examples as training and test examples; it mimics K training and test sets by using some of the data to fit the model and some to test it.

Formally, the data set S is split into K disjoint and equal-sized blocks, denoted as . Let Sk be the training set obtained by removing the elements in Tk from S; , and be the elements of confusion matrix returned by algorithm A trained on the set Sk and tested on Tk (briefly denoted as TPk, FPk, FNk, and TNk). The averaged confusion matrix based on their respective averages of KTPks, FPks, FNks, and TNks has the following form:
Then, from equations 2.3 and 2.4, the microaveraged precision and recall values based on a K-fold cross-validation can be obtained:
2.5
2.6

### 2.4  Macroaveraged Precision and Recall Values Based on a K-Fold Cross-Validation

The so-called macroaveraged precision (recall) value based on a K-fold cross-validation is the average of K precisions (recalls) computed by a confusion matrix obtained based on each Sk and Tk for .

If denoting pk as the precision value computed based on the kth confusion matrix (TPk, FPk, and rk as the corresponding recall, the macroaveraged precision and recall values based on a K-fold cross-validation are defined as the averages of precisions and recalls on K groups:
2.7
2.8
where are all identical for in rMacro. Then we have
Remark 1.

From above analysis, we can see that the macroaveraged and microaveraged recall values are identical. However, for the macroaveraged and microaveraged precision values, there is no similar conclusion.

## 3  Credible Intervals for Precision and Recall Measures

In this section, we present four credible and confidence intervals that can be used to infer precision and recall measures. The first two are the posterior credible intervals we propose, and the third and fourth confidence intervals have already been discussed in the literature. The first credible interval for precision (or recall) measure is provided by studying the posterior distribution of the precision (or recall) inferred by all data sets corresponding to K confusion matrices from a K-fold cross-validation. The second credible interval for precision (or recall) is constructed based on the average of K beta posterior distributions, in which each beta posterior distribution is inferred by a data set corresponding to a confusion matrix from K-fold cross-validation. For convenience, we provide several useful lemmas.

Lemma 1.
Observed , and counts follow a multinomial distribution with parameters , denoted by ,
where , . If rewriting be , be , we have the following properties:
• Property 1: Each component ni of D follows a binomial distribution for .

• Property 2. Each component ni of D conditioned on another component nj follows a binomial distribution for and .

• Property 3. The sum of ni and nj also follows a binomial distribution .

• Property 4. The distribution of ni given the number of returned objects is a binomial with parameters and for and .

The proof of lemma 2 and properties 1 to 4 can be found in Goutte and Gaussier (2005). Furthermore, Goutte and Gaussier (2005) revealed that the distributions of precision and recall have the following forms:

Lemma 2.
that is, (beta distribution), where is the prior distribution and , , is the prior parameter. A similar development yields the posterior distribution for the recall: .
Lemma 3.

Given two independent variables with binomial distributions and with identical parameter , the following property holds:

Lemma 4.
Let be random variables with common mean and the following covariance structure
Let be the correlation between Uk and and the sample mean; then ).

Lemmas 3 to 5 can be found in Goutte and Gaussier (2005) and Dietterich (1998), respectively.

### 3.1  Credible Intervals Constructed Based on Beta Posterior Distributions Inferred by K Data Sets

First, we consider the posterior distribution of precision inferred by the K data sets corresponding to K confusion matrices from a K-fold cross-validation. These data sets are denoted , and DK, where , FPk, and . By assuming that the Dks are independent for , lemma 6 can be obtained:

Lemma 5.

Provided that the Dks are independent for , the conditional random variables and have the same distribution, and they all follow a binomial distribution .

Proof.

From lemma 2 and property 1, we know that TPk follows for . Combined with the assumption of the independence of TPks, we have follows from lemma 4. For the variable , it is obvious that its distribution is from properties 1, 2, and 3. Then we have follows .

Thus, property 4 postulates that
3.1
Similarly, from property 4, we know that
for . Then, combining the independence assumption of Dks, we have
3.2

A similar conclusion can be obtained for recall. From equations 3.1 and 3.2 that we can see that and follow a binomial distribution .

Lemma 3 tells us that if we assume that p has a prior distribution of , we can infer the posterior distribution of p based on equations 3.1 and 3.2.

Proposition 1.

Provided that the Dks are independent for , the posterior distribution of precision is a beta distribution with parameters and , that is, . A similar development yields the posterior distribution for the recall: .

Proof.
Similar to lemma 3, based on equations 3.1 and 3.2, we can write the likelihood of p as
Inference on p can then be performed using Bayes rule:
That is, . If replacing by , a similar conclusion can be obtained for recall.
Remark 2.

Obtaining proposition 7 requires that Dks be independent. However, the training sets from any two independent partitions in a K-fold cross-validation contain common samples regardless of how the data set is split. In other words, the training sets are related. Furthermore, Bengio and Grandvalet (2004) pointed out that the correlations of training sets in a K-fold cross-validation should not be negligible. Thus, the s, s, and s are actually not independent. This results in parameters of and that are greater than the true parameters of them; that is, the true parameters are actually smaller than . For this, the precision and recall should follow the distributions of and with . Here, the problem is that is unknown and needs to be estimated appropriately. When the correlations of the s, s, and s are large, the tends to be small. By contrast, when the correlations of these variables are small, the becomes large. Intuitively, using the average in the interval as the value of is a natural selection, denoted as . Indeed, the average may not be the best choice; however, it provides a solution that is close to the best with a closed form and greatly saves computational cost. (See the discussion based on the simulated experiments in the next section.)

Thus, the resulting credible intervals, defined as and , for precision and recall measures based on the percentiles of the beta distribution are
3.3
3.4
where denotes the percentiles of beta distribution.

### 3.2  Credible Intervals Based on the Average of the K Beta Posterior Distributions

From lemma 3, we know that for a data set Dk corresponding to a confusion matrix from K-fold cross-validation, we have for , FPk, and . However, the posterior distribution of p depends exclusively on a fractional sample set Dk. To use all of the samples to infer the precision and recall, we might consider implementing the average of all . We might also seek to determine whether similarly follows a beta distribution.

If assuming that the Dks are independent of each other, the distribution of can be expressed as
3.5
where
and
are the probability density and distribution functions of random variable for , respectively. denotes the density function of the beta distribution with parameters of a and b.

From equation 3.5, we can see that despite the independence of the Dks, the distribution of pA is nevertheless complex and cannot be used directly to construct a credible interval. A straightforward method, however, is to approximate this distribution with a beta distribution, given that pA is an average of the K random variables following a beta distribution. Intuitively, its distribution should be close to a beta distribution:

Proposition 2.
Recalling that and , where and follow and , respectively, for , FPk, and , the distributions of pA and rA can be approximated with and , that is,
3.6
where
Proof.
Here, , and for , FPk, and . By equating the first and second moments of pA and the random variable following beta distribution, we have
On the other hand, we have
However, the variance of pA cannot simply be expressed as the average of the variances of the s. This is because the correlations between TPks and FPks from a K-fold cross-validation cannot be negligible, as already noted. Thus, from lemma 5, the variance of pA is written as
3.7
where , denotes the correlation of ps from different Dks:
According to the recommendation in Nadeau and Bengio (2003), the ratio of the test sample size to the total sample size should be adopted when estimating , that is, .
From this, one can show that
Similarly, we can develop the approximated distribution of rA, where
and
Based on the obtained , , , and , we have
Thus, the credible intervals based on the above beta distribution, defined as and for precision and recall measures, respectively, have the following forms:
3.8
3.9
Remark 3.

To further validate the approximate extent of the beta distribution to the true distribution of pA, the density functions of , and the true density function of pA are compared by the following simulated experiment, where and refer to and obtained when from equation 3.7 (i.e., with independent Dks). A similar comparison is also conducted for rA.

#### 3.2.1  Simulated Experiment 1

Density Functions of the True and Approximate Distributions forpAandrA. Considering a classification problem with two classes, we have , with . Here, we take , and , where and denote the five-dimensional vector with the elements of all 0 and 1; I5 denotes the five-order identity matrix, . The sample size is 200.

First, we can obtain the observed , and for with classification tree and support vector machine classifiers. The parameters , , , , and are then computed. Thus, the approximate density functions of pA and rA can be obtained based on the distributions , , , and . Their true density function is computed by kernel density estimation with gaussian kernel.

In this experiment, we provide results from the most commonly used case (i.e., ). However, under other conditions, such as or , similar conclusions can be obtained. Next, we compare the difference of (), () and (), where f refers to the density function.

From Figures 2 and 3, we can see that each of the three density curves has a similar shape for pA and rA regardless of whether a classification tree classifier or a support vector machine classifier is used. However, the density curves of and closely approximate the true densities of pA and rA. Here, both and are based on the independence assumption and express a considerable bias at their peak points with respect to the true distributions of pA and rA. This again suggests the need to correct the parameters of , and . By not correcting these parameters, a liberal credible interval will doubtless obtain. This observation further indicates that the approximate beta distribution is relatively simple and easily adopted when constructing credible intervals compared to the complicated true distributions of pA and rA.

Figure 2:

Density curves of true and approximate distributions for pA and rA with classification tree classifier.

Figure 2:

Density curves of true and approximate distributions for pA and rA with classification tree classifier.

Figure 3:

Density curves of true and approximate distributions for pA and rA with support vector machine classifier.

Figure 3:

Density curves of true and approximate distributions for pA and rA with support vector machine classifier.

### 3.3  Symmetrical Confidence Intervals Based on the K-Fold Cross-Validated t Distribution

Symmetrical confidence intervals (statistical test of significance) based on the normal or t distribution are widely used in the literature (Bisani & Ney, 2004; Keller, Bengio, & Wong, 2006; Nadeau & Bengio, 2003; Yang & Liu, 1999). The symmetrical confidence intervals based on the K-fold cross-validated t distribution at confidence level will look like
where is a mean estimator based on the average of the K-fold cross-validated estimators, is a variance estimator, and c is a percentile from Students t distribution with a degree of freedom of . Then the confidence intervals of precision and recall are written as
3.10
3.11
where , .

### 3.4  Symmetrical Confidence Intervals Based on the Corrected K-Fold Cross-Validated t Distribution

Bengio and Grandvalet (2004) showed that the correlation of test blocks cannot be ignored in computing the variance of K-fold cross-validation; otherwise, the variance will be grossly underestimated. Based on this, Grandvalet and Bengio (2006) obtained a corrected K-fold cross-validated t-test by correcting the variance of K-fold cross-validation. If we let be pMacro (or rMacro), be (or ), we can obtain the symmetrical confidence interval based on the corrected K-fold cross-validated t distribution:
3.12
3.13
where () is the ratio of the covariance of pks (rks) for and the variance of pMacro (rMacro). Grandvalet and Bengio (2006) suggested an empirical estimation of by conducting a large number of experiments.

## 4  Simulated Experiments for Comparison

In this section, we first demonstrate with a simulation that false conclusions proceed from the use of single-point micro- and macroaveraged precision and recall estimations to estimate precision and recall measures. It may be more suitable based on confidence (credible) interval to infer them. We then investigate the degree of confidence and the interval length of the four credible and confidence intervals based on K-fold cross-validation presented in this study for multiple classifiers on simulated and real letter recognition and MAGIC gamma telescope data sets. For a given problem, we generated 1000 independent data sets to fully take into account the effect of the randomness of the training set, as well as that of the test examples.

For comparison, we took (most commonly used in the literature) in K-fold cross-validation. We chose , the uniform prior, in the beta distribution. The sample sizes were and 1000 for simulated and real data sets. The confidence level , that is, .

### 4.1  Single-Point Estimations of Precision and Recall Based on Micro- and Macroaverages

The simulated data were generated in a manner similar to simulated experiment 1, but we took . The classifier was classification tree. The sample size was 200.

From Table 3, it is clear that the single-point pMacro value is higher than pMicro and that the values of rMacro and rMicro are equivalent. It is always said that the macroaverage is superior to the microaverage in the literature because higher precision and recall values are blindly desirable by the authors. However, as Goutte and Gaussier (2005) and Wang et al. (2015) noted, the point estimation does not consider the variance of the estimation, and thus it is prone to false conclusions. For example, in the case of , the confidence interval for precision based on the K-fold cross-validated t distribution inferred from pMacro was , which obviously includes the values of . This implies that even with a liberal confidence interval, it was difficult to make a distinction between pMacro and pMicro. In other words, the difference between pMacro and pMicro was not statistically significant, and this difference may result from random error. The fact that the conditional random variables and and have the same distribution also validated this point from a different perspective. Thus, it may be more suitable based on confidence (credible) interval to implement the inference for precision and recall. We next compare the degree of confidence and the interval length of four credible and confidence intervals of precision and recall for multiple classifiers on simulated and real data sets.

Table 3:
Single-Point Estimation of Precision and Recall in the Case of , , and with Different Combination of .
pMacpMic
(1,2) 0.768 0.759 0.758
(1,0.1) 0.919 0.915 0.929
(0.2,3) 0.689 0.683 0.682
pMacpMic
(1,2) 0.768 0.759 0.758
(1,0.1) 0.919 0.915 0.929
(0.2,3) 0.689 0.683 0.682

### 4.2  Comparison of Credible and Confidence Intervals on Simulated Data

The experimental setup in this section was similar to that of section 4.1, in which multiple combinations of , and were considered. The classifiers were a perceptron with one hidden layer, a classification tree, and a support vector machine with gaussian kernel.

Tables 4, 5, and 6 show the simulated results of the degree of confidence and interval length of four credible and confidence intervals based on K-fold cross-validation for precision and recall. First, we see that the confidence intervals based on a K-fold cross-validated t distribution exhibited a lower degree of confidence (below 95%) in almost all cases (in 28 of the 30 cases). For example, in six cases in Table 5, the degrees of confidence for this confidence interval of precision were 90.1%, 92.1%, 90.6%, 88.9%, 92.0%, and 87.9% for the classification tree classifier. In contrast, the degrees of confidence for credible intervals constructed based on the beta posterior distribution inferred by the K data sets corresponding to K confusion matrices from K-fold cross-validation all exceeded 95%. The confidence interval based on the corrected K-fold cross-validated t distribution elevated the degrees of confidence of those based on the K-fold cross-validated t distribution by correcting the variance of the t statistic.

Table 4:
Degrees of Confidence and Interval Lengths of Credible and Confidence Intervals for Precision and Recall Based on Perceptron Classifier.
Case: , Case: , Case: ,
DOC 99.9% 99.4% 98.3%
IL 0.256 0.099 0.184
DOC 99.8% 98.7% 63.7%
IL 0.237 0.099 0.196
DOC 91.7% 93.1% 89.9%
IL 0.172 0.074 0.144
DOC 99.5% 99.4% 98.2%
IL 0.314 0.135 0.263
DOC 97.4% 98.7% 98.7%
IL 0.254 0.101 0.234
DOC 97.7% 97.2% 94.3%
IL 0.225 0.101 0.213
DOC 93.4% 95.3% 95.3%
IL 0.247 0.094 0.216
DOC 99.4% 99.7% 99.8%
IL 0.452 0.172 0.395
Case: , Case: , Case: ,
DOC 99.9% 99.4% 98.3%
IL 0.256 0.099 0.184
DOC 99.8% 98.7% 63.7%
IL 0.237 0.099 0.196
DOC 91.7% 93.1% 89.9%
IL 0.172 0.074 0.144
DOC 99.5% 99.4% 98.2%
IL 0.314 0.135 0.263
DOC 97.4% 98.7% 98.7%
IL 0.254 0.101 0.234
DOC 97.7% 97.2% 94.3%
IL 0.225 0.101 0.213
DOC 93.4% 95.3% 95.3%
IL 0.247 0.094 0.216
DOC 99.4% 99.7% 99.8%
IL 0.452 0.172 0.395
Table 5:
Degrees of Confidence and Interval Lengths of Credible and Confidence Intervals for Precision and Recall Based on Classification Tree Classifier.
Case: , Case: , Case: ,
DOC 99.7% 99.6% 99.7%
IL 0.256 0.116 0.255
DOC 99.8% 99.6% 99.4%
IL 0.236 0.116 0.235
DOC 90.1% 92.1% 90.6%
IL 0.164 0.070 0.165
DOC 99.1% 99.2% 98.8%
IL 0.299 0.127 0.302
DOC 97.6% 97.1% 97.9%
IL 0.254 0.116 0.253
DOC 98.1% 97.5% 97.5%
IL 0.226 0.115 0.226
DOC 92.4% 92.9% 91.8%
IL 0.232 0.107 0.228
DOC 99.7% 99.5% 99.2%
IL 0.423 0.195 0.416
Case: ,  Case: ,  Case: ,
DOC 99.2% 98.7% 99.1%
IL 0.219 0.090 0.221
DOC 96.5% 99.3% 94.7%
IL 0.208 0.091 0.209
DOC 88.9% 92.0% 87.9%
IL 0.157 0.067 0.157
DOC 99.1% 99.2% 99.3%
IL 0.287 0.122 0.286
DOC 98.6% 98.5% 97.2%
IL 0.219 0.094 0.220
DOC 93.3% 96.2% 91.9%
IL 0.204 0.094 0.205
DOC 94.5% 93.8% 91.2%
IL 0.203 0.083 0.194
DOC 99.6% 99.6% 98.9%
IL 0.370 0.151 0.355
Case: , Case: , Case: ,
DOC 99.7% 99.6% 99.7%
IL 0.256 0.116 0.255
DOC 99.8% 99.6% 99.4%
IL 0.236 0.116 0.235
DOC 90.1% 92.1% 90.6%
IL 0.164 0.070 0.165
DOC 99.1% 99.2% 98.8%
IL 0.299 0.127 0.302
DOC 97.6% 97.1% 97.9%
IL 0.254 0.116 0.253
DOC 98.1% 97.5% 97.5%
IL 0.226 0.115 0.226
DOC 92.4% 92.9% 91.8%
IL 0.232 0.107 0.228
DOC 99.7% 99.5% 99.2%
IL 0.423 0.195 0.416
Case: ,  Case: ,  Case: ,
DOC 99.2% 98.7% 99.1%
IL 0.219 0.090 0.221
DOC 96.5% 99.3% 94.7%
IL 0.208 0.091 0.209
DOC 88.9% 92.0% 87.9%
IL 0.157 0.067 0.157
DOC 99.1% 99.2% 99.3%
IL 0.287 0.122 0.286
DOC 98.6% 98.5% 97.2%
IL 0.219 0.094 0.220
DOC 93.3% 96.2% 91.9%
IL 0.204 0.094 0.205
DOC 94.5% 93.8% 91.2%
IL 0.203 0.083 0.194
DOC 99.6% 99.6% 98.9%
IL 0.370 0.151 0.355
Table 6:
Degrees of Confidence and Interval Lengths of Credible and Confidence Intervals for Precision and Recall Based on Support Vector Machine Classifier.
Case: , Case: , Case:
DOC 99.6% 99.2% 98.5%
IL 0.254 0.073 0.162
DOC 99.6% 95.2% 72.8%
IL 0.234 0.075 0.172
DOC 90.1% 93.9% 92.2%
IL 0.163 0.055 0.117
DOC 99.1% 99.4% 99.2%
IL 0.297 0.100 0.213
DOC 97.9% 98.4% 97.4%
IL 0.253 0.073 0.162
DOC 97.4% 92.7% 71.7%
IL 0.226 0.076 0.172
DOC 91.7% 94.9% 91.3%
IL 0.222 0.062 0.136
DOC 99.1% 99.5% 99.0%
IL 0.405 0.113 0.248
Case: ,  Case: ,  Case: ,
DOC 99.9% 100% 99.8%
IL 0.204 0.092 0.204
DOC 98.6% 99.7% 98.9%
IL 0.196 0.092 0.193
DOC 91.1% 93.5% 91.4%
IL 0.144 0.061 0.127
DOC 99.3% 99.8% 99.1%
IL 0.262 0.111 0.230
DOC 99.4% 98.6% 98.7%
IL 0.194 0.078 0.147
DOC 93.9% 95.2% 56.7%
IL 0.190 0.080 0.165
DOC 93.6% 94.5% 90.7%
IL 0.169 0.066 0.120
DOC 99.6% 99.7% 98.3%
IL 0.309 0.121 0.219
Case: , Case: , Case:
DOC 99.6% 99.2% 98.5%
IL 0.254 0.073 0.162
DOC 99.6% 95.2% 72.8%
IL 0.234 0.075 0.172
DOC 90.1% 93.9% 92.2%
IL 0.163 0.055 0.117
DOC 99.1% 99.4% 99.2%
IL 0.297 0.100 0.213
DOC 97.9% 98.4% 97.4%
IL 0.253 0.073 0.162
DOC 97.4% 92.7% 71.7%
IL 0.226 0.076 0.172
DOC 91.7% 94.9% 91.3%
IL 0.222 0.062 0.136
DOC 99.1% 99.5% 99.0%
IL 0.405 0.113 0.248
Case: ,  Case: ,  Case: ,
DOC 99.9% 100% 99.8%
IL 0.204 0.092 0.204
DOC 98.6% 99.7% 98.9%
IL 0.196 0.092 0.193
DOC 91.1% 93.5% 91.4%
IL 0.144 0.061 0.127
DOC 99.3% 99.8% 99.1%
IL 0.262 0.111 0.230
DOC 99.4% 98.6% 98.7%
IL 0.194 0.078 0.147
DOC 93.9% 95.2% 56.7%
IL 0.190 0.080 0.165
DOC 93.6% 94.5% 90.7%
IL 0.169 0.066 0.120
DOC 99.6% 99.7% 98.3%
IL 0.309 0.121 0.219

However, the credible interval based on the average of the K beta posterior distributions from K-fold cross-validation returned somewhat ambivalent results. In 10 of the 30 cases, its degrees of confidence fell below 95%. One example is the situation represented by the case of for a perceptron classifier in Table 4. In this case, its degree of confidence was only 63.7%, which is far below 95%. This can be explained by the fact that this method merely adopts the average of the K results, and this average is significantly affected by a poor result. That is, this method is less robust than a credible interval constructed based on a beta posterior distribution inferred by the K data sets corresponding to K confusion matrices.

Indeed, the degree of confidence is not the only important consideration when choosing a statistical confidence (credible) interval. Another measure for the confidence (credible) interval is the interval length. From Tables 4, 5, and 6, we can see that the interval length of the confidence interval based on a K-fold cross-validated t distribution was the shortest among the four credible and confidence intervals. At the same time, however, it had the lowest degree of confidence.

It is thus important to consider how these two factors might be compromised. In general, when the degree of confidence is comparative, the confidence (credible) interval is always measured based on the interval length. That is, for a given degree of confidence, a fundamental principle for selecting the confidence (credible) interval is to select the one with the shortest interval length (Mao, Wang, & Pu, 2006; Shi, 2008; Shao, 2003). With an acceptable degree of confidence (above 95%), credible intervals of precision based on an average of the K beta posterior distributions had a shorter or comparable interval length compared to those based on the beta posterior distribution inferred by K data sets. Moreover, the interval lengths of these two credible intervals were both shorter than those based on the corrected K-fold cross-validated t distribution. Consider, for example, the case of in Table 6, classified using a support vector machine classifier. In this case, the interval length for the confidence interval of precision based on the corrected K-fold cross-validated t distribution was 0.262. However, the interval lengths for credible intervals based on the beta posterior distribution inferred by K data sets and the average of posterior distributions were 0.204 and 0.196, respectively.

In particular, when the sample size increased, there was little change to the degree of confidence for the credible and confidence intervals. However, their interval lengths decreased by approximately half.

Remark 4.

In the extreme case where precision and recall were 1, the degree of confidence was 0 with the proposed credible intervals based on a K-fold cross-validated beta distribution with a confidence level of . This was demonstrated in the case where , and for a support vector machine classifier. Such a situation obtained because the quantile of the beta distribution does not exceed 1 when . Thus, the credible intervals do not include the true value of 1.

In fact, in this special case, precision and recall are fixed, not random variables, and thus the credible interval has degenerated into the confidence interval of frequentist statistics. Furthermore, because the precision and recall values were all equal to 1 in all replicated experiments, the variance of the estimation was zero. Thus, the traditional symmetrical interval estimation will actually degenerate into a point estimation.

### 4.3  Comparison of Credible and Confidence Intervals on Real Data

Two data sets from the UCI database, letter recognition data and MAGIC gamma telescope data, were considered in this section (Frey & Slate, 1991; Heck, Knapp, Capdevielle, & Thouw, 1998). Letter recognition data for identifying the letters of the roman alphabet comprise 20,000 examples described by 16 features. The 26 letters represent 26 categories, similar to Nadeau and Bengio (2003) and Wang et al. (2014), who turned it into a two-class (A–M versus N–Z) classification problem. In the MAGIC gamma telescope data, depending on the energy of the primary gamma, 10 features are allowed to discriminate statistically those caused by primary gammas (signal) from the images of hadronic showers initiated by cosmic rays in the upper atmosphere (background). We sampled, with replacement, 200 (1000) examples from the 20,000 (13,376) examples available in the letter recognition and the MAGIC gamma telescope data, respectively. Repeating this 1000 times, we then computed the degrees of confidence and interval lengths of the four credible and confidence intervals based on 1000 sets of data obtained.

As with the simulated data, the credible intervals constructed based on the beta posterior distribution inferred by the K data sets corresponding to K confusion matrices from K-fold cross-validation for precision and recall resulted in a considerable degree of confidence in almost all cases, as shown in Tables 7, 8, 9, and 10. One exceptional to this obtained when with the perceptron classifier. In this case, their degrees of confidence for recall were merely 83.5% and 87.7% for the letter recognition and the MAGIC gamma telescope data, respectively.

Table 7:
Degrees of Confidence and Interval Lengths of Credible and Confidence Intervals for Precision and Recall at for Letter Recognition Data.
Classification DOC 99.1% 98.0% 90.6% 98.9% 98.2% 95.4% 93.6% 99.5%
tree classifier IL 0.238 0.221 0.168 0.306 0.237 0.215 0.220 0.403
Perceptron DOC 99.2% 97.7% 94.4% 99.8% 98.3% 93.9% 95.6% 99.9%
classifier IL 0.234 0.218 0.168 0.306 0.234 0.213 0.221 0.404
Support vector DOC 99.7% 99.5% 90.6% 99.2% 95.7% 94.1% 90.5% 99.4%
machine IL 0.237 0.221 0.154 0.282 0.235 0.215 0.212 0.388
classifier
Classification DOC 99.1% 98.0% 90.6% 98.9% 98.2% 95.4% 93.6% 99.5%
tree classifier IL 0.238 0.221 0.168 0.306 0.237 0.215 0.220 0.403
Perceptron DOC 99.2% 97.7% 94.4% 99.8% 98.3% 93.9% 95.6% 99.9%
classifier IL 0.234 0.218 0.168 0.306 0.234 0.213 0.221 0.404
Support vector DOC 99.7% 99.5% 90.6% 99.2% 95.7% 94.1% 90.5% 99.4%
machine IL 0.237 0.221 0.154 0.282 0.235 0.215 0.212 0.388
classifier
Table 8:
Degrees of Confidence and Interval Lengths of Credible and Confidence Intervals for Precision and Recall at for Letter Recognition Data.
Classification DOC 97.2% 96.6% 90.9% 98.7% 98.0% 89.7% 85.9% 99.1%
tree classifier IL 0.100 0.100 0.077 0.140 0.103 0.102 0.102 0.186
Perceptron DOC 96.0% 96.3% 93.3% 99.6% 83.5% 88.3% 87.3% 97.6%
classifier IL 0.096 0.096 0.086 0.157 0.095 0.094 0.131 0.240
Support vector DOC 98.4% 99.1% 83.5% 97.7% 96.7% 97.7% 91.7% 99.1%
machine IL 0.113 0.113 0.069 0.126 0.114 0.113 0.099 0.181
classifier
Classification DOC 97.2% 96.6% 90.9% 98.7% 98.0% 89.7% 85.9% 99.1%
tree classifier IL 0.100 0.100 0.077 0.140 0.103 0.102 0.102 0.186
Perceptron DOC 96.0% 96.3% 93.3% 99.6% 83.5% 88.3% 87.3% 97.6%
classifier IL 0.096 0.096 0.086 0.157 0.095 0.094 0.131 0.240
Support vector DOC 98.4% 99.1% 83.5% 97.7% 96.7% 97.7% 91.7% 99.1%
machine IL 0.113 0.113 0.069 0.126 0.114 0.113 0.099 0.181
classifier
Table 9:
Degrees of Confidence and Interval Lengths of Credible and Confidence Intervals for Precision and Recall at for MAGIC Gamma Telescope Data.
Classification DOC 98.9% 98.5% 87.8% 99.2% 98.6% 91.0% 94.2% 99.7%
tree classifier IL 0.226 0.213 0.165 0.302 0.229 0.211 0.209 0.381
Perceptron DOC 99.3% 98.0% 90.9% 99.0% 95.3% 84.5% 91.1% 99.1%
classifier IL 0.245 0.229 0.182 0.332 0.244 0.217 0.254 0.463
Support vector DOC 97.3% 97.9% 81.3% 96.2% 98.1% 95.2% 92.2% 99.2%
machine IL 0.240 0.224 0.164 0.300 0.241 0.219 0.208 0.379
classifier
Classification DOC 98.9% 98.5% 87.8% 99.2% 98.6% 91.0% 94.2% 99.7%
tree classifier IL 0.226 0.213 0.165 0.302 0.229 0.211 0.209 0.381
Perceptron DOC 99.3% 98.0% 90.9% 99.0% 95.3% 84.5% 91.1% 99.1%
classifier IL 0.245 0.229 0.182 0.332 0.244 0.217 0.254 0.463
Support vector DOC 97.3% 97.9% 81.3% 96.2% 98.1% 95.2% 92.2% 99.2%
machine IL 0.240 0.224 0.164 0.300 0.241 0.219 0.208 0.379
classifier
Table 10:
Degrees of Confidence and Interval Lengths of Credible and Confidence Intervals for Precision and Recall at for MAGIC Gamma Telescope Data.
Classification DOC 97.6% 97.0% 88.4% 98.6% 95.2% 95.5% 91.9% 99.2%
tree classifier IL 0.096 0.097 0.074 0.135 0.103 0.103 0.095 0.174
Perceptron DOC 98.4% 98.6% 93.5% 99.6% 87.7% 85.2% 93.3% 99.5%
classifier IL 0.105 0.106 0.083 0.152 0.105 0.103 0.139 0.253
Support vector DOC 99.6% 99.5% 90.3% 99.4% 99.3% 98.7% 93.3% 99.9%
machine IL 0.114 0.114 0.072 0.132 0.114 0.113 0.101 0.185
classifier
Classification DOC 97.6% 97.0% 88.4% 98.6% 95.2% 95.5% 91.9% 99.2%
tree classifier IL 0.096 0.097 0.074 0.135 0.103 0.103 0.095 0.174
Perceptron DOC 98.4% 98.6% 93.5% 99.6% 87.7% 85.2% 93.3% 99.5%
classifier IL 0.105 0.106 0.083 0.152 0.105 0.103 0.139 0.253
Support vector DOC 99.6% 99.5% 90.3% 99.4% 99.3% 98.7% 93.3% 99.9%
machine IL 0.114 0.114 0.072 0.132 0.114 0.113 0.101 0.185
classifier

For a precision measure, the credible interval based on the average of the K beta posterior distributions from K-fold cross-validation all had acceptable degrees of confidence with the two sample sizes in two real data sets. For recall, however, in 7 of 12 cases, the degree of confidence fell below 95%. Similarly, the confidence interval based on the K-fold cross-validated t distribution exhibited a degraded degree of confidence.

With an acceptable degree of confidence (above 95%), the credible interval based on the average of the K beta posterior distributions had the shortest interval length compared with the other confidence and credible intervals. In particular, when , with an acceptable degree of confidence, the intervals were of comparable length to credible intervals based on the beta posterior distribution inferred by the K data sets and based on the average of the K beta posterior distributions whether for the letter recognition data or for the MAGIC gamma telescope data. Specifically, the intervals based on the beta distribution were 71.4%, 61.1%, 89.6%, 71.1%, 69.1%, and 86.4% of the interval length of confidence intervals based on the corrected K-fold cross-validated t distribution for the classification tree, perceptron, and support vector machine classifiers in the letter recognition and the MAGIC gamma telescope data, respectively.

The results in Tables 7 to 10 from two real data sets showed that the interval length also decreased by half as the sample size changed from 200 to 1000. This implies that the sample size had a significant impact on the interval length of the confidence (credible) interval.

### 4.4  Average Ranks of Four Credible and Confidence Intervals

To further investigate this problem, we compared the average ranks of four credible and confidence intervals with regard to their degree of confidence and interval length in all 27 cases of simulated and real data sets. Table 11 showed the results based on the simulated data sets with 15 cases and real data sets with 12 cases.

Table 11:
Average Ranks of Four Credible and Confidence Intervals.
Rank
CaseDOC
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Average rank 1.6(1) 2.4(3) 3.9(4) 2(2)
IL
Average rank 2.5(3) 2.2(2) 1(1) 4(4)

DOC
Average rank 2.3(2) 3.1(3) 3.6(4) 1.0(1)
IL
Average rank 2.6(3) 1.8(2) 1.4(1) 4(4)
Rank
CaseDOC
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Average rank 1.6(1) 2.4(3) 3.9(4) 2(2)
IL
Average rank 2.5(3) 2.2(2) 1(1) 4(4)

DOC
Average rank 2.3(2) 3.1(3) 3.6(4) 1.0(1)
IL
Average rank 2.6(3) 1.8(2) 1.4(1) 4(4)

The average rank of the confidence interval based on the K-fold cross-validated t distribution was ranked first for interval length, but it ranked last among all four methods for the degree of confidence. By contrast, the confidence interval based on the corrected K-fold cross-validated t distribution ranked first for degree of confidence, but it ranked last for interval length. The two credible intervals proposed in this letter lay between the confidence intervals based on the K-fold and the corrected K-fold cross-validated t distributions. With an acceptable degree of confidence, the average ranks of our methods were first and second, and they were superior to the confidence interval based on the corrected K-fold cross-validated t distribution. The reason for this occurrence was the fact that the degrees of confidence of the confidence interval based on the K-fold cross-validated t distribution were all less than 95%.

### 4.5  Choice of

In the construction of the credible interval based on the beta posterior distribution inferred by the K data sets corresponding to K confusion matrices from K-fold cross-validation for precision and recall, the choice of is very important. Poor may affect the degree of confidence and interval length of the credible interval. Thus, in this section, we experimentally studied the changes in the degree of confidence and interval length as the changes of the values of the . Experimental results are in Table 12.

Table 12:
Changes of the Degree of Confidence and Interval Length as the Changes of the Values of the .
Letter Recognition Data, Classification Tree ClassifierLetter Recognition Data, Support Vector Machine ClassifierMAGIC Gamma Telescope Data, Classification Tree Classifier
DOCILDOCILDOCIL
= 0.10  100% 0.504 100% 0.501 100% 0.488
100% 0.503 100% 0.499 100% 0.490
= 0.45  99.8% 0.262 99.9% 0.261 99.8% 0.250
98.8% 0.261 96.4% 0.259 98.7% 0.252
= 0.55  99.1% 0.238 99.7% 0.237 98.9% 0.226
98.2% 0.237 95.7% 0.235 98.6% 0.229
0.65  98.6% 0.220 99.3% 0.220 98.4% 0.210
96.8% 0.219 94.0% 0.218 97.4% 0.213
0.75  98.2% 0.205 98.5% 0.205 97.5% 0.196
95.6% 0.204 92.1% 0.204 94.5% 0.198
0.85  97.6% 0.193 97.1% 0.193 96.8% 0.185
92.5% 0.192 89.6% 0.191 91.6% 0.187
0.95  96.3% 0.183 97.0% 0.183 95.8% 0.175
92.7% 0.182 84.9% 0.182 92.8% 0.177
95.5% 0.178 96.6% 0.178 93.8% 0.170
91.5% 0.178 87.6% 0.177 92.2% 0.172
Letter Recognition Data, Classification Tree ClassifierLetter Recognition Data, Support Vector Machine ClassifierMAGIC Gamma Telescope Data, Classification Tree Classifier
DOCILDOCILDOCIL
= 0.10  100% 0.504 100% 0.501 100% 0.488
100% 0.503 100% 0.499 100% 0.490
= 0.45  99.8% 0.262 99.9% 0.261 99.8% 0.250
98.8% 0.261 96.4% 0.259 98.7% 0.252
= 0.55  99.1% 0.238 99.7% 0.237 98.9% 0.226
98.2% 0.237 95.7% 0.235 98.6% 0.229
0.65  98.6% 0.220 99.3% 0.220 98.4% 0.210
96.8% 0.219 94.0% 0.218 97.4% 0.213
0.75  98.2% 0.205 98.5% 0.205 97.5% 0.196
95.6% 0.204 92.1% 0.204 94.5% 0.198
0.85  97.6% 0.193 97.1% 0.193 96.8% 0.185
92.5% 0.192 89.6% 0.191 91.6% 0.187
0.95  96.3% 0.183 97.0% 0.183 95.8% 0.175
92.7% 0.182 84.9% 0.182 92.8% 0.177
95.5% 0.178 96.6% 0.178 93.8% 0.170
91.5% 0.178 87.6% 0.177 92.2% 0.172

Table 12 shows that when increased, the degree of confidence and the interval length of the credible interval gradually decreased. In general cases, we opt to select an such that the credible interval has an accepted degree of confidence (larger than 95%) and a short interval length. However, the best cannot express a closed form because the correlations of the s, s, and s vary in different cases with different classifiers and data sets. For example, the best was for in the case of letter recognition data, , support vector machine classifier. However, the best s were 0.65 and 0.75 in the cases of letter recognition data, , classification tree classifier, and MAGIC gamma telescope data, , classification tree classifier, respectively. To determine the best , the entire interval from to 1 should be searched, an expensive computation. Considering this condition, we suggested the computation of through . Although this selection method may not select the best , it provides a solution that is close to the best with a closed form and greatly saves on computational costs.

## 5  Conclusion

Considering that the commonly used confidence interval based on a K-fold cross-validated t distribution suffers from a lower degree of confidence, we presented a novel way to construct credible intervals indirectly, based on the posterior distributions of precisions and recall. Two credible intervals based on a K-fold cross-validated beta posterior distribution were thus proposed.

Furthermore, we compared our proposed credible intervals with existing confidence intervals for precision and recall through simulated and real data experiments. With an acceptable degree of confidence, our methods outperformed these existing methods. Specifically, they exhibited shorter interval lengths in all cases. The first proposed credible interval is particularly recommended, given that it displayed high degrees of confidence and short interval lengths in almost all experiments.

One of the key uses of performance metrics is model (algorithm) selection, which is traditionally straightforward to do based on point estimations, but how would this be done based on the performance intervals proposed? When the credible interval is used to select the models A and B, if their credible intervals are uncrossed, the model with high precision (recall) should be selected. However, if the credible interval of precision of A completely contains that of B, we cannot directly provide a definitive conclusion and need further analysis. For example, we can select models by directly comparing their right or left intervals. However, is this appropriate? The use of the proposed credible interval in comparing models is currently being investigated.

In practical applications, we always need to take into consideration the two factors of precision and recall. This enables the construction of a utility function that directly captures the value of true positives and negatives versus the cost of false positives and negatives. The ROC curve is a useful tool that facilitates choosing an optimal classification threshold for a given application. For quantitatively evaluating the model performance, an AUC measure obtained based on the ROC curve is often used. However, the AUC measure remains a point estimation. How the credible interval of this measure can be constructed by analyzing the distribution of AUC is meaningful research work and our future research direction.

## Acknowledgments

This work was supported by National Natural Science Fund of China (61503228, 71503151) and Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund. Experiments were supported by High Performance Computing System of Shanxi University.

## References

Alpaydin
,
E.
(
1999
).
Combined cv F test for comparing supervised classification learning algorithms
.
Neural Computation
,
11
,
1885
1892
.
Bengio
,
Y.
, &
Grandvalet
,
Y.
(
2004
).
No unbiased estimator of the variance of K-fold cross-validation
.
Journal of Machine Learning Research
,
5
,
1089
1105
.
Bisani
,
M.
, &
Ney
,
H.
(
2004
).
Bootstrap estimates for confidence intervals in ASR performance evaluation
. In
Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing
.
Piscataway, NJ
:
IEEE
.
Dietterich
,
T.
(
1998
).
Approximate statistical tests for comparing supervised classification learning algorithms
.
Neural Computation
,
10
,
1895
1924
.
Efron
,
B.
, &
Tibshirani
,
R.
(
1993
).
An introduction to the bootstrap
.
London
:
Chapman and Hall
.
Fawcett
,
T.
(
2006
).
An introduction to ROC analysis
.
Pattern Recognition Letters
,
27
,
861
874
.
Flach
,
P. A.
(
2003
).
The geometry of ROC space: Understanding machine learning metrics through ROC isometrics
. In
Proceedings of 20th International Conference on Machine Learning
(pp.
194
201
).
Menlo Park, CA
:
AAAI Press
.
Frey
,
P. W.
, &
Slate
,
D. J.
(
1991
).
Letter recognition using Holland-style adaptive classifiers
.
Machine Learning
,
6
,
161
.
Goutte
,
C.
, &
Gaussier
,
E.
(
2005
).
A probabilistic interpretation of precision, recall and F-score, with implication for evaluation
. In
Proceedings of European Colloquium on IR Resarch
(pp.
345
359
).
New York
:
Springer
.
Grandvalet
,
Y.
, &
Bengio
,
Y.
(
2006
).
Hypothesis testing for cross-validation
(
Technical Report
).
Montreal
:
University of Montreal
.
Hastie
,
T.
,
Tibshirani
,
R.
, &
Friedman
,
J.
(
2001
).
The elements of statistical learning: Data mining, inference, and prediction
.
New York
:
Springer
.
Heck
,
D.
,
Knapp
,
J.
,
Capdevielle
,
J. N.
, &
Thouw
,
T.
(
1998
).
CORSIKA: A Monte Carlo code to simulate extensive air showers
.
Karlsruhe
:
Forschungszentrum Karlsruhe GmbH
.
Keller
,
M.
,
Bengio
,
S.
, &
Wong
,
S. Y.
(
2006
).
Benchmarking non-parametric statistical tests
. In
Y.
Weiss
,
B.
Schölkopf
, &
J. C.
Platt
(Eds.),
Advances in neural information processing systems
,
18
.
Cambridge, MA
:
MIT Press
.
Lobo
,
J. M.
,
Jimenez
,
V. A.
, &
Real
,
R.
(
2008
).
AUC: A misleading measure of the performance of predictive distribution models
.
Global Ecology and Biogeography
,
17
,
145
151
.
Mao
,
S.
,
Wang
,
J.
, &
Pu
,
X.
(
2006
).
.
Beijing
:
Higher Education Press
.
Markatou
,
M.
,
Tian
,
H.
,
Biswas
,
S.
, &
Hripcsak
,
G.
(
2005
).
Analysis of variance of cross-validation estimators of the generalization error
.
Journal of Machine Learning Research
,
6
,
1127
1168
.
Moreno-Torres
,
J.
,
Saez
,
J.
&
Herrera
,
F.
(
2012
).
Study on the impact of partition-induced dataset shift on k-fold cross-validation
.
IEEE Transactions on Neural Networks and Learning Systems
,
23
,
1304
1312
.
,
C.
, &
Bengio
,
Y.
(
2003
).
Inference for the generalization error
.
Machine Learning
,
52
,
239
281
.
Powers
,
D. M. W.
(
2011
).
Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and Correlation
.
Journal of Machine Learning Technologies
,
2
,
37
63
.
Shao
,
J.
(
2003
).
Mathematical statistics
.
New York
:
Springer
.
Shi
,
N.
(
2008
).
Statistical test theory and method
.
Beijing
:
Science Press
.
Wang
,
Y.
,
Li
,
J.
,
Li
,
Y.
,
Wang
,
R.
, &
Yang
,
X.
(
2015
).
Confidence interval for F1 measure of algorithm performance based on blocked cross-validation
.
IEEE Transactions on Knowledge and Data Engineering
,
27
,
651
659
.
Wang
,
Y.
,
Wang
,
R.
,
Jia
,
H.
, &
Li
,
J.
(
2014
).
Blocked cross-validated t-test for comparing supervised classification learning algorithms
.
Neural Computation
,
26
,
208
235
.
Yang
,
Y.
, &
Liu
,
X.
(
1999
).
A re-examination of text categorization methods
. In
Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
(pp.
42
49
).
New York
:
ACM
.
Yildiz
,
O. T.
(
2013
).
Omnivariate rule induction using a novel pairwise statistical test
.
IEEE Transactions on Knowledge and Data Engineering
,
25
,
2105
2118
.
Yildiz
,
O. T.
,
Aslan
,
O.
, &
Alpaydin
,
E.
(
2011
).
Multivariate statistical tests for comparing classification algorithms
.
Lecture Notes in Computer Science
: Vol.
6683
.
Learning and Intelligent Optimization
(pp.
1
15
).
New York
:
Springer
.