## Abstract

In typical machine learning applications such as information retrieval, precision and recall are two commonly used measures for assessing an algorithm's performance. Symmetrical confidence intervals based on *K*-fold cross-validated *t* distributions are widely used for the inference of precision and recall measures. As we confirmed through simulated experiments, however, these confidence intervals often exhibit lower degrees of confidence, which may easily lead to liberal inference results. Thus, it is crucial to construct faithful confidence (credible) intervals for precision and recall with a high degree of confidence and a short interval length. In this study, we propose two posterior credible intervals for precision and recall based on *K*-fold cross-validated beta distributions. The first credible interval for precision (or recall) is constructed based on the beta posterior distribution inferred by all *K* data sets corresponding to *K* confusion matrices from a *K*-fold cross-validation. Second, considering that each data set corresponding to a confusion matrix from a *K*-fold cross-validation can be used to infer a beta posterior distribution of precision (or recall), the second proposed credible interval for precision (or recall) is constructed based on the average of *K* beta posterior distributions. Experimental results on simulated and real data sets demonstrate that the first credible interval proposed in this study almost always resulted in degrees of confidence greater than 95%. With an acceptable degree of confidence, both of our two proposed credible intervals have shorter interval lengths than those based on a corrected *K*-fold cross-validated *t* distribution. Meanwhile, the average ranks of these two credible intervals are superior to that of the confidence interval based on a *K*-fold cross-validated *t* distribution for the degree of confidence and are superior to that of the confidence interval based on a corrected *K*-fold cross-validated *t* distribution for the interval length in all 27 cases of simulated and real data experiments. However, the confidence intervals based on the *K*-fold and corrected *K*-fold cross-validated *t* distributions are in the two extremes. Thus, when focusing on the reliability of the inference for precision and recall, the proposed methods are preferable, especially for the first credible interval.

## 1 Introduction

There are multiple candidate models (i.e., algorithms) for a typical machine learning application and we need to choose one or several among many. In classification tasks with two classes of supervised learning, this is done by comparing the misclassification error, which is the sum of false positives and false negatives. However, as Yildiz, Aslan, and Alpaydin (2011) pointed out, misclassification error does not make a distinction between false positives and false negatives. Thus, many other performance measures have been proposed to evaluate candidate models, such as precision and recall. Precision and recall that are based on a binary contingency table are two measures that are commonly used in machine learning applications such as information retrieval (see Tables 1 and 2).

. | . | Predicted Positive . | Class Negative . | Sum . |
---|---|---|---|---|

True | Positive | TP | FN | P |

class | Negative | FP | TN | N |

Sum | P | N |

. | . | Predicted Positive . | Class Negative . | Sum . |
---|---|---|---|---|

True | Positive | TP | FN | P |

class | Negative | FP | TN | N |

Sum | P | N |

Note: TP (resp. TN) is the number of true positives (resp. negatives) and FP (resp. FN) the number of false positives (resp. negatives).

Name . | Formula . |
---|---|

Error | (FP + FN)/(TP + FP + TN + FN) |

Precision | TP/(TP + FP) |

Recall | TP/(TP + FN) |

F_{1} score | 2TP/(2TP + FP + FN) |

Sensitivity | TP/(TP + FN) |

Specificity | TN/(FP + TN) |

True positive rate | TP/(TP + FN) |

False positive rate | FP/(FP + TN) |

Matthews correlation coefficient |

Name . | Formula . |
---|---|

Error | (FP + FN)/(TP + FP + TN + FN) |

Precision | TP/(TP + FP) |

Recall | TP/(TP + FN) |

F_{1} score | 2TP/(2TP + FP + FN) |

Sensitivity | TP/(TP + FN) |

Specificity | TN/(FP + TN) |

True positive rate | TP/(TP + FN) |

False positive rate | FP/(FP + TN) |

Matthews correlation coefficient |

In practice, to be able to eliminate the effect by chance (e.g., variance due to small changes in the training set), one typically does training and validation a number of times, possibly by various resampling methods such as cross-validation and bootstrap (Alpaydin, 1999; Bengio & Grandvalet, 2004; Dietterich, 1998; Efron & Tibshirani, 1993; Hastie, Tibshirani, & Friedman, 2001; Markatou, Tian, Biswas, & Hripcsak, 2005; Nadeau & Bengio, 2003; Wang, Wang, Jia, & Li, 2014; Yildiz, 2013). For example, after deriving *K* training and validation sets, classification algorithms are trained with the *K* training sets, and *K* confusion matrices are subsequently obtained based on the validation sets (Bengio & Grandvalet, 2004; Markatou et al., 2005; Moreno-Torres, Saez, & Herrera, 2012). Then the precision and recall values can be calculated based on the *K* confusion matrices from *K*-fold cross-validation, and these are commonly evaluated with two measures: the microaverage and the macroaverage. The so-called microaveraged precision (or recall) is computed based on the average of the corresponding elements of *K* confusion matrices, while macroaveraged precision (or recall) is the average of *K* precisions (or recalls) computed by each confusion matrix.

Traditionally, when applying a learning algorithm in machine learning, the focus is typically directed at the the single-point micro- and macroaveraged precision and recall values of the algorithm's performance from a *K*-fold cross-validation. However, as Wang, Li, Li, Wang, and Yang (2015) pointed out, point estimations are rather trivial and do not consider variations of the estimation. In response to this, symmetrical confidence intervals based on *K*-fold cross-validated *t* distributions have been proposed. As we confirmed through simulated experiments, however, these confidence intervals often exhibit lower degrees of confidence and short interval lengths (see section 4). This may easily lead to liberal inference results. When confidence intervals are used to compare the performance of two algorithms, for example, the results can be misleading insofar as they can imply that two algorithms are significantly different when in fact they are not.

Furthermore, a theoretical analysis of the posterior distributions of precision and recall in Goutte and Gaussier (2005) shows that they follow a beta distribution. As such, these distributions are always nonsymmetrical, owing to the occurrence of two different parameters in the beta distribution, as shown in Figure 1. Of course, when these two parameters are the same, a beta distribution is symmetrical, but this might not always occur because there will always be an unequal number of true positives (TPs) and false positives (FPs) (or false negatives, FNs) in practical applications. (See Goutte & Gaussier, 2005, and Wang et al., 2015.) Consider case finding for rare diseases as a practical example. In case finding, a good case-finding model may always have FP TP (due to class imbalance) and FN TP. Meanwhile, symmetrical confidence intervals may significantly affect the estimation accuracy of the confidence interval in some cases. This is because the values of precision and recall range between 0 and 1, whereas the symmetrical confidence interval can exceed the range of (Wang et al., 2015). Thus, the use of a symmetrical distribution, such as the commonly used *t* distribution, may be inappropriate for approximating the distribution of precision and recall, and this can result in large bias and a critically false conclusion.

To effectively measure the performance of an algorithm, it is crucial to construct faithful confidence (credible) intervals for precision and recall—that is, intervals with a high degree of confidence and a short interval length. In Bayesian statistics, credible intervals are analogous to confidence intervals in frequentist statistics. The degree of confidence of a credible interval is the probability of the inclusion of the true value in the credible interval. Interval length indicates the accuracy of the credible interval. Thus, in this study, two posterior credible intervals for precision and recall are constructed based on a *K*-fold cross-validated beta distribution.

The remainder of this study is organized as follows. Section 2 defines the standard precision and recall measures of an algorithm’s performance and then gives their (single-point) estimations based on a *K*-fold cross-validation. Two credible intervals based on *K*-fold cross-validated beta distributions proposed in this letter and confidence intervals based on *K*-fold and corrected *K*-fold cross-validated *t* distributions are described in section 3. Section 4 discusses the simulated and real data experiments that show how the confidence (credible) intervals behaves compare. Section 5 concludes the study.

## 2 Precision and Recall Measures of an Algorithm's Performance

In studies on a two-class classification problem of machine learning, the performance of the learning algorithm is always assessed with empirical measures, based on the TP, FP, true negative (TN), and FN values of a confusion matrix. In practice, a number of such measures have been developed depending on the type of error under consideration, including the precision value, the recall value, the *F*_{1} score, sensitivity, specificity, the TP rate, the FP rate, the receiver operating characteristic (ROC) curve, the area under the ROC curve (AUC), and the Matthews correlation coefficient as shown in Table 2 (Powers, 2011; Fawcett, 2006; Flach, 2003; Goutte & Gaussier, 2005; Lobo, Jimenez, & Real, 2008; Nadeau & Bengio, 2003; Wang et al., 2015; Yang & Liu, 1999). In this study, we focus on two important performance indicators in machine learning: precision and recall values.

Strictly speaking, the precision and recall values are estimations of the theoretical precision and recall measures for a specific practical application. Thus, we first discuss theoretical precision and recall measures.

### 2.1 Theoretical Precision and Recall Measures

*z*indicating whether it believes the class to be correct. Then precision may be defined as the probability that a class is positive (+) given that it is returned by the classification algorithm, while the recall is the probability that a positive class is returned (Goutte & Gaussier, 2005; Wang et al., 2015):

### 2.2 Precision and Recall Values Based on a Confusion Matrix

It is obvious that the precision and recall values are estimations of the theoretical precision and recall measures.

### 2.3 Microaveraged Precision and Recall Values Based on a *K*-Fold Cross-Validation

In practice, in order to eliminate the effect by chance (e.g., variance due to small changes in the training set), a resampling method is always used. *K*-fold cross-validation is probably the simplest and most widely used resampling method. It uses all available examples as training and test examples; it mimics *K* training and test sets by using some of the data to fit the model and some to test it.

*S*is split into

*K*disjoint and equal-sized blocks, denoted as . Let

*S*be the training set obtained by removing the elements in

_{k}*T*from

_{k}*S*; , and be the elements of confusion matrix returned by algorithm

*A*trained on the set

*S*and tested on

_{k}*T*(briefly denoted as

_{k}*TP*,

_{k}*FP*,

_{k}*FN*, and

_{k}*TN*). The averaged confusion matrix based on their respective averages of

_{k}*K*

*TP*s,

_{k}*FP*s,

_{k}*FN*s, and

_{k}*TN*s has the following form: Then, from equations 2.3 and 2.4, the microaveraged precision and recall values based on a

_{k}*K*-fold cross-validation can be obtained:

### 2.4 Macroaveraged Precision and Recall Values Based on a *K*-Fold Cross-Validation

The so-called macroaveraged precision (recall) value based on a *K*-fold cross-validation is the average of *K* precisions (recalls) computed by a confusion matrix obtained based on each *S _{k}* and

*T*for .

_{k}*p*as the precision value computed based on the

_{k}*k*th confusion matrix (

*TP*,

_{k}*FP*, and

_{k}*r*as the corresponding recall, the macroaveraged precision and recall values based on a

_{k}*K*-fold cross-validation are defined as the averages of precisions and recalls on

*K*groups: where are all identical for in

*r*. Then we have

^{Macro}From above analysis, we can see that the macroaveraged and microaveraged recall values are identical. However, for the macroaveraged and microaveraged precision values, there is no similar conclusion.

## 3 Credible Intervals for Precision and Recall Measures

In this section, we present four credible and confidence intervals that can be used to infer precision and recall measures. The first two are the posterior credible intervals we propose, and the third and fourth confidence intervals have already been discussed in the literature. The first credible interval for precision (or recall) measure is provided by studying the posterior distribution of the precision (or recall) inferred by all data sets corresponding to *K* confusion matrices from a *K*-fold cross-validation. The second credible interval for precision (or recall) is constructed based on the average of *K* beta posterior distributions, in which each beta posterior distribution is inferred by a data set corresponding to a confusion matrix from *K*-fold cross-validation. For convenience, we provide several useful lemmas.

*Property 1:*Each component*n*of_{i}*D*follows a binomial distribution for .*Property 2.*Each component*n*of_{i}*D*conditioned on another component*n*follows a binomial distribution for and ._{j}*Property 3.*The sum of*n*and_{i}*n*also follows a binomial distribution ._{j}*Property 4.*The distribution of*n*given the number of returned objects is a binomial with parameters and for and ._{i}

The proof of lemma ^{2} and properties 1 to 4 can be found in Goutte and Gaussier (2005). Furthermore, Goutte and Gaussier (2005) revealed that the distributions of precision and recall have the following forms:

Given two independent variables with binomial distributions and with identical parameter , the following property holds:

### 3.1 Credible Intervals Constructed Based on Beta Posterior Distributions Inferred by *K* Data Sets

First, we consider the posterior distribution of precision inferred by the *K* data sets corresponding to *K* confusion matrices from a *K*-fold cross-validation. These data sets are denoted , and *D _{K}*, where ,

*FP*, and . By assuming that the

_{k}*D*s are independent for , lemma

_{k}^{6}can be obtained:

Provided that the *D _{k}*s are independent for , the conditional random variables and have the same distribution, and they all follow a binomial distribution .

From lemma ^{2} and property 1, we know that *TP _{k}* follows for . Combined with the assumption of the independence of

*TP*s, we have follows from lemma

_{k}^{4}. For the variable , it is obvious that its distribution is from properties 1, 2, and 3. Then we have follows .

*D*s, we have

_{k}A similar conclusion can be obtained for recall. From equations 3.1 and 3.2 that we can see that and follow a binomial distribution .

Lemma ^{3} tells us that if we assume that *p* has a prior distribution of , we can infer the posterior distribution of *p* based on equations 3.1 and 3.2.

Provided that the *D _{k}*s are independent for , the posterior distribution of precision is a beta distribution with parameters and , that is, . A similar development yields the posterior distribution for the recall: .

Obtaining proposition ^{7} requires that *D _{k}*s be independent. However, the training sets from any two independent partitions in a

*K*-fold cross-validation contain common samples regardless of how the data set is split. In other words, the training sets are related. Furthermore, Bengio and Grandvalet (2004) pointed out that the correlations of training sets in a

*K*-fold cross-validation should not be negligible. Thus, the s, s, and s are actually not independent. This results in parameters of and that are greater than the true parameters of them; that is, the true parameters are actually smaller than . For this, the precision and recall should follow the distributions of and with . Here, the problem is that is unknown and needs to be estimated appropriately. When the correlations of the s, s, and s are large, the tends to be small. By contrast, when the correlations of these variables are small, the becomes large. Intuitively, using the average in the interval as the value of is a natural selection, denoted as . Indeed, the average may not be the best choice; however, it provides a solution that is close to the best with a closed form and greatly saves computational cost. (See the discussion based on the simulated experiments in the next section.)

### 3.2 Credible Intervals Based on the Average of the *K* Beta Posterior Distributions

From lemma ^{3}, we know that for a data set *D _{k}* corresponding to a confusion matrix from

*K*-fold cross-validation, we have for ,

*FP*, and . However, the posterior distribution of

_{k}*p*depends exclusively on a fractional sample set

*D*. To use all of the samples to infer the precision and recall, we might consider implementing the average of all . We might also seek to determine whether similarly follows a beta distribution.

_{k}From equation 3.5, we can see that despite the independence of the *D _{k}*s, the distribution of

*p*is nevertheless complex and cannot be used directly to construct a credible interval. A straightforward method, however, is to approximate this distribution with a beta distribution, given that

^{A}*p*is an average of the

^{A}*K*random variables following a beta distribution. Intuitively, its distribution should be close to a beta distribution:

*FP*, and . By equating the first and second moments of

_{k}*p*and the random variable following beta distribution, we have

^{A}*p*cannot simply be expressed as the average of the variances of the s. This is because the correlations between

^{A}*TP*s and

_{k}*FP*s from a

_{k}*K*-fold cross-validation cannot be negligible, as already noted. Thus, from lemma

^{5}, the variance of

*p*is written as where , denotes the correlation of

^{A}*p*s from different

*D*s: According to the recommendation in Nadeau and Bengio (2003), the ratio of the test sample size to the total sample size should be adopted when estimating , that is, .

_{k}To further validate the approximate extent of the beta distribution to the true distribution of *p ^{A}*, the density functions of , and the true density function of

*p*are compared by the following simulated experiment, where and refer to and obtained when from equation 3.7 (i.e., with independent

^{A}*D*s). A similar comparison is also conducted for

_{k}*r*.

^{A}#### 3.2.1 Simulated Experiment 1

*Density Functions of the True and Approximate Distributions for**p ^{A}*

*and*

*r*. Considering a classification problem with two classes, we have , with . Here, we take , and , where and denote the five-dimensional vector with the elements of all 0 and 1;

^{A}*I*

_{5}denotes the five-order identity matrix, . The sample size is 200.

First, we can obtain the observed , and for with classification tree and support vector machine classifiers. The parameters , , , , and are then computed. Thus, the approximate density functions of *p ^{A}* and

*r*can be obtained based on the distributions , , , and . Their true density function is computed by kernel density estimation with gaussian kernel.

^{A}In this experiment, we provide results from the most commonly used case (i.e., ). However, under other conditions, such as or , similar conclusions can be obtained. Next, we compare the difference of (), () and (), where *f* refers to the density function.

From Figures 2 and 3, we can see that each of the three density curves has a similar shape for *p ^{A}* and

*r*regardless of whether a classification tree classifier or a support vector machine classifier is used. However, the density curves of and closely approximate the true densities of

^{A}*p*and

^{A}*r*. Here, both and are based on the independence assumption and express a considerable bias at their peak points with respect to the true distributions of

^{A}*p*and

^{A}*r*. This again suggests the need to correct the parameters of , and . By not correcting these parameters, a liberal credible interval will doubtless obtain. This observation further indicates that the approximate beta distribution is relatively simple and easily adopted when constructing credible intervals compared to the complicated true distributions of

^{A}*p*and

^{A}*r*.

^{A}### 3.3 Symmetrical Confidence Intervals Based on the *K*-Fold Cross-Validated *t* Distribution

*t*distribution are widely used in the literature (Bisani & Ney, 2004; Keller, Bengio, & Wong, 2006; Nadeau & Bengio, 2003; Yang & Liu, 1999). The symmetrical confidence intervals based on the

*K*-fold cross-validated

*t*distribution at confidence level will look like where is a mean estimator based on the average of the

*K*-fold cross-validated estimators, is a variance estimator, and

*c*is a percentile from Students

*t*distribution with a degree of freedom of . Then the confidence intervals of precision and recall are written as where , .

### 3.4 Symmetrical Confidence Intervals Based on the Corrected *K*-Fold Cross-Validated *t* Distribution

*K*-fold cross-validation; otherwise, the variance will be grossly underestimated. Based on this, Grandvalet and Bengio (2006) obtained a corrected

*K*-fold cross-validated

*t*-test by correcting the variance of

*K*-fold cross-validation. If we let be

*p*(or

^{Macro}*r*), be (or ), we can obtain the symmetrical confidence interval based on the corrected

^{Macro}*K*-fold cross-validated

*t*distribution: where () is the ratio of the covariance of

*p*s (

_{k}*r*s) for and the variance of

_{k}*p*(

^{Macro}*r*). Grandvalet and Bengio (2006) suggested an empirical estimation of by conducting a large number of experiments.

^{Macro}## 4 Simulated Experiments for Comparison

In this section, we first demonstrate with a simulation that false conclusions proceed from the use of single-point micro- and macroaveraged precision and recall estimations to estimate precision and recall measures. It may be more suitable based on confidence (credible) interval to infer them. We then investigate the degree of confidence and the interval length of the four credible and confidence intervals based on *K*-fold cross-validation presented in this study for multiple classifiers on simulated and real letter recognition and MAGIC gamma telescope data sets. For a given problem, we generated 1000 independent data sets to fully take into account the effect of the randomness of the training set, as well as that of the test examples.

For comparison, we took (most commonly used in the literature) in *K*-fold cross-validation. We chose , the uniform prior, in the beta distribution. The sample sizes were and 1000 for simulated and real data sets. The confidence level , that is, .

### 4.1 Single-Point Estimations of Precision and Recall Based on Micro- and Macroaverages

The simulated data were generated in a manner similar to simulated experiment 1, but we took . The classifier was classification tree. The sample size was 200.

From Table 3, it is clear that the single-point *p ^{Macro}* value is higher than

*p*and that the values of

^{Micro}*r*and

^{Macro}*r*are equivalent. It is always said that the macroaverage is superior to the microaverage in the literature because higher precision and recall values are blindly desirable by the authors. However, as Goutte and Gaussier (2005) and Wang et al. (2015) noted, the point estimation does not consider the variance of the estimation, and thus it is prone to false conclusions. For example, in the case of , the confidence interval for precision based on the

^{Micro}*K*-fold cross-validated

*t*distribution inferred from

*p*was , which obviously includes the values of . This implies that even with a liberal confidence interval, it was difficult to make a distinction between

^{Macro}*p*and

^{Macro}*p*. In other words, the difference between

^{Micro}*p*and

^{Macro}*p*was not statistically significant, and this difference may result from random error. The fact that the conditional random variables and and have the same distribution also validated this point from a different perspective. Thus, it may be more suitable based on confidence (credible) interval to implement the inference for precision and recall. We next compare the degree of confidence and the interval length of four credible and confidence intervals of precision and recall for multiple classifiers on simulated and real data sets.

^{Micro}### 4.2 Comparison of Credible and Confidence Intervals on Simulated Data

The experimental setup in this section was similar to that of section 4.1, in which multiple combinations of , and were considered. The classifiers were a perceptron with one hidden layer, a classification tree, and a support vector machine with gaussian kernel.

Tables 4, 5, and 6 show the simulated results of the degree of confidence and interval length of four credible and confidence intervals based on *K*-fold cross-validation for precision and recall. First, we see that the confidence intervals based on a *K*-fold cross-validated *t* distribution exhibited a lower degree of confidence (below 95%) in almost all cases (in 28 of the 30 cases). For example, in six cases in Table 5, the degrees of confidence for this confidence interval of precision were 90.1%, 92.1%, 90.6%, 88.9%, 92.0%, and 87.9% for the classification tree classifier. In contrast, the degrees of confidence for credible intervals constructed based on the beta posterior distribution inferred by the *K* data sets corresponding to *K* confusion matrices from *K*-fold cross-validation all exceeded 95%. The confidence interval based on the corrected *K*-fold cross-validated *t* distribution elevated the degrees of confidence of those based on the *K*-fold cross-validated *t* distribution by correcting the variance of the *t* statistic.

. | . | Case: , . | Case: , . | Case: , . |
---|---|---|---|---|

DOC | 99.9% | 99.4% | 98.3% | |

IL | 0.256 | 0.099 | 0.184 | |

DOC | 99.8% | 98.7% | 63.7% | |

IL | 0.237 | 0.099 | 0.196 | |

DOC | 91.7% | 93.1% | 89.9% | |

IL | 0.172 | 0.074 | 0.144 | |

DOC | 99.5% | 99.4% | 98.2% | |

IL | 0.314 | 0.135 | 0.263 | |

DOC | 97.4% | 98.7% | 98.7% | |

IL | 0.254 | 0.101 | 0.234 | |

DOC | 97.7% | 97.2% | 94.3% | |

IL | 0.225 | 0.101 | 0.213 | |

DOC | 93.4% | 95.3% | 95.3% | |

IL | 0.247 | 0.094 | 0.216 | |

DOC | 99.4% | 99.7% | 99.8% | |

IL | 0.452 | 0.172 | 0.395 |

. | . | Case: , . | Case: , . | Case: , . |
---|---|---|---|---|

DOC | 99.9% | 99.4% | 98.3% | |

IL | 0.256 | 0.099 | 0.184 | |

DOC | 99.8% | 98.7% | 63.7% | |

IL | 0.237 | 0.099 | 0.196 | |

DOC | 91.7% | 93.1% | 89.9% | |

IL | 0.172 | 0.074 | 0.144 | |

DOC | 99.5% | 99.4% | 98.2% | |

IL | 0.314 | 0.135 | 0.263 | |

DOC | 97.4% | 98.7% | 98.7% | |

IL | 0.254 | 0.101 | 0.234 | |

DOC | 97.7% | 97.2% | 94.3% | |

IL | 0.225 | 0.101 | 0.213 | |

DOC | 93.4% | 95.3% | 95.3% | |

IL | 0.247 | 0.094 | 0.216 | |

DOC | 99.4% | 99.7% | 99.8% | |

IL | 0.452 | 0.172 | 0.395 |

. | . | Case: , . | Case: , . | Case: , . |
---|---|---|---|---|

DOC | 99.7% | 99.6% | 99.7% | |

IL | 0.256 | 0.116 | 0.255 | |

DOC | 99.8% | 99.6% | 99.4% | |

IL | 0.236 | 0.116 | 0.235 | |

DOC | 90.1% | 92.1% | 90.6% | |

IL | 0.164 | 0.070 | 0.165 | |

DOC | 99.1% | 99.2% | 98.8% | |

IL | 0.299 | 0.127 | 0.302 | |

DOC | 97.6% | 97.1% | 97.9% | |

IL | 0.254 | 0.116 | 0.253 | |

DOC | 98.1% | 97.5% | 97.5% | |

IL | 0.226 | 0.115 | 0.226 | |

DOC | 92.4% | 92.9% | 91.8% | |

IL | 0.232 | 0.107 | 0.228 | |

DOC | 99.7% | 99.5% | 99.2% | |

IL | 0.423 | 0.195 | 0.416 | |

Case: , | Case: , | Case: , | ||

DOC | 99.2% | 98.7% | 99.1% | |

IL | 0.219 | 0.090 | 0.221 | |

DOC | 96.5% | 99.3% | 94.7% | |

IL | 0.208 | 0.091 | 0.209 | |

DOC | 88.9% | 92.0% | 87.9% | |

IL | 0.157 | 0.067 | 0.157 | |

DOC | 99.1% | 99.2% | 99.3% | |

IL | 0.287 | 0.122 | 0.286 | |

DOC | 98.6% | 98.5% | 97.2% | |

IL | 0.219 | 0.094 | 0.220 | |

DOC | 93.3% | 96.2% | 91.9% | |

IL | 0.204 | 0.094 | 0.205 | |

DOC | 94.5% | 93.8% | 91.2% | |

IL | 0.203 | 0.083 | 0.194 | |

DOC | 99.6% | 99.6% | 98.9% | |

IL | 0.370 | 0.151 | 0.355 |

. | . | Case: , . | Case: , . | Case: , . |
---|---|---|---|---|

DOC | 99.7% | 99.6% | 99.7% | |

IL | 0.256 | 0.116 | 0.255 | |

DOC | 99.8% | 99.6% | 99.4% | |

IL | 0.236 | 0.116 | 0.235 | |

DOC | 90.1% | 92.1% | 90.6% | |

IL | 0.164 | 0.070 | 0.165 | |

DOC | 99.1% | 99.2% | 98.8% | |

IL | 0.299 | 0.127 | 0.302 | |

DOC | 97.6% | 97.1% | 97.9% | |

IL | 0.254 | 0.116 | 0.253 | |

DOC | 98.1% | 97.5% | 97.5% | |

IL | 0.226 | 0.115 | 0.226 | |

DOC | 92.4% | 92.9% | 91.8% | |

IL | 0.232 | 0.107 | 0.228 | |

DOC | 99.7% | 99.5% | 99.2% | |

IL | 0.423 | 0.195 | 0.416 | |

Case: , | Case: , | Case: , | ||

DOC | 99.2% | 98.7% | 99.1% | |

IL | 0.219 | 0.090 | 0.221 | |

DOC | 96.5% | 99.3% | 94.7% | |

IL | 0.208 | 0.091 | 0.209 | |

DOC | 88.9% | 92.0% | 87.9% | |

IL | 0.157 | 0.067 | 0.157 | |

DOC | 99.1% | 99.2% | 99.3% | |

IL | 0.287 | 0.122 | 0.286 | |

DOC | 98.6% | 98.5% | 97.2% | |

IL | 0.219 | 0.094 | 0.220 | |

DOC | 93.3% | 96.2% | 91.9% | |

IL | 0.204 | 0.094 | 0.205 | |

DOC | 94.5% | 93.8% | 91.2% | |

IL | 0.203 | 0.083 | 0.194 | |

DOC | 99.6% | 99.6% | 98.9% | |

IL | 0.370 | 0.151 | 0.355 |

. | . | Case: , . | Case: , . | Case: . |
---|---|---|---|---|

DOC | 99.6% | 99.2% | 98.5% | |

IL | 0.254 | 0.073 | 0.162 | |

DOC | 99.6% | 95.2% | 72.8% | |

IL | 0.234 | 0.075 | 0.172 | |

DOC | 90.1% | 93.9% | 92.2% | |

IL | 0.163 | 0.055 | 0.117 | |

DOC | 99.1% | 99.4% | 99.2% | |

IL | 0.297 | 0.100 | 0.213 | |

DOC | 97.9% | 98.4% | 97.4% | |

IL | 0.253 | 0.073 | 0.162 | |

DOC | 97.4% | 92.7% | 71.7% | |

IL | 0.226 | 0.076 | 0.172 | |

DOC | 91.7% | 94.9% | 91.3% | |

IL | 0.222 | 0.062 | 0.136 | |

DOC | 99.1% | 99.5% | 99.0% | |

IL | 0.405 | 0.113 | 0.248 | |

Case: , | Case: , | Case: , | ||

DOC | 99.9% | 100% | 99.8% | |

IL | 0.204 | 0.092 | 0.204 | |

DOC | 98.6% | 99.7% | 98.9% | |

IL | 0.196 | 0.092 | 0.193 | |

DOC | 91.1% | 93.5% | 91.4% | |

IL | 0.144 | 0.061 | 0.127 | |

DOC | 99.3% | 99.8% | 99.1% | |

IL | 0.262 | 0.111 | 0.230 | |

DOC | 99.4% | 98.6% | 98.7% | |

IL | 0.194 | 0.078 | 0.147 | |

DOC | 93.9% | 95.2% | 56.7% | |

IL | 0.190 | 0.080 | 0.165 | |

DOC | 93.6% | 94.5% | 90.7% | |

IL | 0.169 | 0.066 | 0.120 | |

DOC | 99.6% | 99.7% | 98.3% | |

IL | 0.309 | 0.121 | 0.219 |

. | . | Case: , . | Case: , . | Case: . |
---|---|---|---|---|

DOC | 99.6% | 99.2% | 98.5% | |

IL | 0.254 | 0.073 | 0.162 | |

DOC | 99.6% | 95.2% | 72.8% | |

IL | 0.234 | 0.075 | 0.172 | |

DOC | 90.1% | 93.9% | 92.2% | |

IL | 0.163 | 0.055 | 0.117 | |

DOC | 99.1% | 99.4% | 99.2% | |

IL | 0.297 | 0.100 | 0.213 | |

DOC | 97.9% | 98.4% | 97.4% | |

IL | 0.253 | 0.073 | 0.162 | |

DOC | 97.4% | 92.7% | 71.7% | |

IL | 0.226 | 0.076 | 0.172 | |

DOC | 91.7% | 94.9% | 91.3% | |

IL | 0.222 | 0.062 | 0.136 | |

DOC | 99.1% | 99.5% | 99.0% | |

IL | 0.405 | 0.113 | 0.248 | |

Case: , | Case: , | Case: , | ||

DOC | 99.9% | 100% | 99.8% | |

IL | 0.204 | 0.092 | 0.204 | |

DOC | 98.6% | 99.7% | 98.9% | |

IL | 0.196 | 0.092 | 0.193 | |

DOC | 91.1% | 93.5% | 91.4% | |

IL | 0.144 | 0.061 | 0.127 | |

DOC | 99.3% | 99.8% | 99.1% | |

IL | 0.262 | 0.111 | 0.230 | |

DOC | 99.4% | 98.6% | 98.7% | |

IL | 0.194 | 0.078 | 0.147 | |

DOC | 93.9% | 95.2% | 56.7% | |

IL | 0.190 | 0.080 | 0.165 | |

DOC | 93.6% | 94.5% | 90.7% | |

IL | 0.169 | 0.066 | 0.120 | |

DOC | 99.6% | 99.7% | 98.3% | |

IL | 0.309 | 0.121 | 0.219 |

However, the credible interval based on the average of the *K* beta posterior distributions from *K*-fold cross-validation returned somewhat ambivalent results. In 10 of the 30 cases, its degrees of confidence fell below 95%. One example is the situation represented by the case of for a perceptron classifier in Table 4. In this case, its degree of confidence was only 63.7%, which is far below 95%. This can be explained by the fact that this method merely adopts the average of the *K* results, and this average is significantly affected by a poor result. That is, this method is less robust than a credible interval constructed based on a beta posterior distribution inferred by the *K* data sets corresponding to *K* confusion matrices.

Indeed, the degree of confidence is not the only important consideration when choosing a statistical confidence (credible) interval. Another measure for the confidence (credible) interval is the interval length. From Tables 4, 5, and 6, we can see that the interval length of the confidence interval based on a *K*-fold cross-validated *t* distribution was the shortest among the four credible and confidence intervals. At the same time, however, it had the lowest degree of confidence.

It is thus important to consider how these two factors might be compromised. In general, when the degree of confidence is comparative, the confidence (credible) interval is always measured based on the interval length. That is, for a given degree of confidence, a fundamental principle for selecting the confidence (credible) interval is to select the one with the shortest interval length (Mao, Wang, & Pu, 2006; Shi, 2008; Shao, 2003). With an acceptable degree of confidence (above 95%), credible intervals of precision based on an average of the *K* beta posterior distributions had a shorter or comparable interval length compared to those based on the beta posterior distribution inferred by *K* data sets. Moreover, the interval lengths of these two credible intervals were both shorter than those based on the corrected *K*-fold cross-validated *t* distribution. Consider, for example, the case of in Table 6, classified using a support vector machine classifier. In this case, the interval length for the confidence interval of precision based on the corrected *K*-fold cross-validated *t* distribution was 0.262. However, the interval lengths for credible intervals based on the beta posterior distribution inferred by *K* data sets and the average of posterior distributions were 0.204 and 0.196, respectively.

In particular, when the sample size increased, there was little change to the degree of confidence for the credible and confidence intervals. However, their interval lengths decreased by approximately half.

In the extreme case where precision and recall were 1, the degree of confidence was 0 with the proposed credible intervals based on a *K*-fold cross-validated beta distribution with a confidence level of . This was demonstrated in the case where , and for a support vector machine classifier. Such a situation obtained because the quantile of the beta distribution does not exceed 1 when . Thus, the credible intervals do not include the true value of 1.

In fact, in this special case, precision and recall are fixed, not random variables, and thus the credible interval has degenerated into the confidence interval of frequentist statistics. Furthermore, because the precision and recall values were all equal to 1 in all replicated experiments, the variance of the estimation was zero. Thus, the traditional symmetrical interval estimation will actually degenerate into a point estimation.

### 4.3 Comparison of Credible and Confidence Intervals on Real Data

Two data sets from the UCI database, letter recognition data and MAGIC gamma telescope data, were considered in this section (Frey & Slate, 1991; Heck, Knapp, Capdevielle, & Thouw, 1998). Letter recognition data for identifying the letters of the roman alphabet comprise 20,000 examples described by 16 features. The 26 letters represent 26 categories, similar to Nadeau and Bengio (2003) and Wang et al. (2014), who turned it into a two-class (A–M versus N–Z) classification problem. In the MAGIC gamma telescope data, depending on the energy of the primary gamma, 10 features are allowed to discriminate statistically those caused by primary gammas (signal) from the images of hadronic showers initiated by cosmic rays in the upper atmosphere (background). We sampled, with replacement, 200 (1000) examples from the 20,000 (13,376) examples available in the letter recognition and the MAGIC gamma telescope data, respectively. Repeating this 1000 times, we then computed the degrees of confidence and interval lengths of the four credible and confidence intervals based on 1000 sets of data obtained.

As with the simulated data, the credible intervals constructed based on the beta posterior distribution inferred by the *K* data sets corresponding to *K* confusion matrices from *K*-fold cross-validation for precision and recall resulted in a considerable degree of confidence in almost all cases, as shown in Tables 7, 8, 9, and 10. One exceptional to this obtained when with the perceptron classifier. In this case, their degrees of confidence for recall were merely 83.5% and 87.7% for the letter recognition and the MAGIC gamma telescope data, respectively.

. | . | . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|---|

Classification | DOC | 99.1% | 98.0% | 90.6% | 98.9% | 98.2% | 95.4% | 93.6% | 99.5% |

tree classifier | IL | 0.238 | 0.221 | 0.168 | 0.306 | 0.237 | 0.215 | 0.220 | 0.403 |

Perceptron | DOC | 99.2% | 97.7% | 94.4% | 99.8% | 98.3% | 93.9% | 95.6% | 99.9% |

classifier | IL | 0.234 | 0.218 | 0.168 | 0.306 | 0.234 | 0.213 | 0.221 | 0.404 |

Support vector | DOC | 99.7% | 99.5% | 90.6% | 99.2% | 95.7% | 94.1% | 90.5% | 99.4% |

machine | IL | 0.237 | 0.221 | 0.154 | 0.282 | 0.235 | 0.215 | 0.212 | 0.388 |

classifier |

. | . | . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|---|

Classification | DOC | 99.1% | 98.0% | 90.6% | 98.9% | 98.2% | 95.4% | 93.6% | 99.5% |

tree classifier | IL | 0.238 | 0.221 | 0.168 | 0.306 | 0.237 | 0.215 | 0.220 | 0.403 |

Perceptron | DOC | 99.2% | 97.7% | 94.4% | 99.8% | 98.3% | 93.9% | 95.6% | 99.9% |

classifier | IL | 0.234 | 0.218 | 0.168 | 0.306 | 0.234 | 0.213 | 0.221 | 0.404 |

Support vector | DOC | 99.7% | 99.5% | 90.6% | 99.2% | 95.7% | 94.1% | 90.5% | 99.4% |

machine | IL | 0.237 | 0.221 | 0.154 | 0.282 | 0.235 | 0.215 | 0.212 | 0.388 |

classifier |

. | . | . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|---|

Classification | DOC | 97.2% | 96.6% | 90.9% | 98.7% | 98.0% | 89.7% | 85.9% | 99.1% |

tree classifier | IL | 0.100 | 0.100 | 0.077 | 0.140 | 0.103 | 0.102 | 0.102 | 0.186 |

Perceptron | DOC | 96.0% | 96.3% | 93.3% | 99.6% | 83.5% | 88.3% | 87.3% | 97.6% |

classifier | IL | 0.096 | 0.096 | 0.086 | 0.157 | 0.095 | 0.094 | 0.131 | 0.240 |

Support vector | DOC | 98.4% | 99.1% | 83.5% | 97.7% | 96.7% | 97.7% | 91.7% | 99.1% |

machine | IL | 0.113 | 0.113 | 0.069 | 0.126 | 0.114 | 0.113 | 0.099 | 0.181 |

classifier |

. | . | . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|---|

Classification | DOC | 97.2% | 96.6% | 90.9% | 98.7% | 98.0% | 89.7% | 85.9% | 99.1% |

tree classifier | IL | 0.100 | 0.100 | 0.077 | 0.140 | 0.103 | 0.102 | 0.102 | 0.186 |

Perceptron | DOC | 96.0% | 96.3% | 93.3% | 99.6% | 83.5% | 88.3% | 87.3% | 97.6% |

classifier | IL | 0.096 | 0.096 | 0.086 | 0.157 | 0.095 | 0.094 | 0.131 | 0.240 |

Support vector | DOC | 98.4% | 99.1% | 83.5% | 97.7% | 96.7% | 97.7% | 91.7% | 99.1% |

machine | IL | 0.113 | 0.113 | 0.069 | 0.126 | 0.114 | 0.113 | 0.099 | 0.181 |

classifier |

. | . | . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|---|

Classification | DOC | 98.9% | 98.5% | 87.8% | 99.2% | 98.6% | 91.0% | 94.2% | 99.7% |

tree classifier | IL | 0.226 | 0.213 | 0.165 | 0.302 | 0.229 | 0.211 | 0.209 | 0.381 |

Perceptron | DOC | 99.3% | 98.0% | 90.9% | 99.0% | 95.3% | 84.5% | 91.1% | 99.1% |

classifier | IL | 0.245 | 0.229 | 0.182 | 0.332 | 0.244 | 0.217 | 0.254 | 0.463 |

Support vector | DOC | 97.3% | 97.9% | 81.3% | 96.2% | 98.1% | 95.2% | 92.2% | 99.2% |

machine | IL | 0.240 | 0.224 | 0.164 | 0.300 | 0.241 | 0.219 | 0.208 | 0.379 |

classifier |

. | . | . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|---|

Classification | DOC | 98.9% | 98.5% | 87.8% | 99.2% | 98.6% | 91.0% | 94.2% | 99.7% |

tree classifier | IL | 0.226 | 0.213 | 0.165 | 0.302 | 0.229 | 0.211 | 0.209 | 0.381 |

Perceptron | DOC | 99.3% | 98.0% | 90.9% | 99.0% | 95.3% | 84.5% | 91.1% | 99.1% |

classifier | IL | 0.245 | 0.229 | 0.182 | 0.332 | 0.244 | 0.217 | 0.254 | 0.463 |

Support vector | DOC | 97.3% | 97.9% | 81.3% | 96.2% | 98.1% | 95.2% | 92.2% | 99.2% |

machine | IL | 0.240 | 0.224 | 0.164 | 0.300 | 0.241 | 0.219 | 0.208 | 0.379 |

classifier |

. | . | . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|---|

Classification | DOC | 97.6% | 97.0% | 88.4% | 98.6% | 95.2% | 95.5% | 91.9% | 99.2% |

tree classifier | IL | 0.096 | 0.097 | 0.074 | 0.135 | 0.103 | 0.103 | 0.095 | 0.174 |

Perceptron | DOC | 98.4% | 98.6% | 93.5% | 99.6% | 87.7% | 85.2% | 93.3% | 99.5% |

classifier | IL | 0.105 | 0.106 | 0.083 | 0.152 | 0.105 | 0.103 | 0.139 | 0.253 |

Support vector | DOC | 99.6% | 99.5% | 90.3% | 99.4% | 99.3% | 98.7% | 93.3% | 99.9% |

machine | IL | 0.114 | 0.114 | 0.072 | 0.132 | 0.114 | 0.113 | 0.101 | 0.185 |

classifier |

. | . | . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|---|---|

Classification | DOC | 97.6% | 97.0% | 88.4% | 98.6% | 95.2% | 95.5% | 91.9% | 99.2% |

tree classifier | IL | 0.096 | 0.097 | 0.074 | 0.135 | 0.103 | 0.103 | 0.095 | 0.174 |

Perceptron | DOC | 98.4% | 98.6% | 93.5% | 99.6% | 87.7% | 85.2% | 93.3% | 99.5% |

classifier | IL | 0.105 | 0.106 | 0.083 | 0.152 | 0.105 | 0.103 | 0.139 | 0.253 |

Support vector | DOC | 99.6% | 99.5% | 90.3% | 99.4% | 99.3% | 98.7% | 93.3% | 99.9% |

machine | IL | 0.114 | 0.114 | 0.072 | 0.132 | 0.114 | 0.113 | 0.101 | 0.185 |

classifier |

For a precision measure, the credible interval based on the average of the *K* beta posterior distributions from *K*-fold cross-validation all had acceptable degrees of confidence with the two sample sizes in two real data sets. For recall, however, in 7 of 12 cases, the degree of confidence fell below 95%. Similarly, the confidence interval based on the *K*-fold cross-validated *t* distribution exhibited a degraded degree of confidence.

With an acceptable degree of confidence (above 95%), the credible interval based on the average of the *K* beta posterior distributions had the shortest interval length compared with the other confidence and credible intervals. In particular, when , with an acceptable degree of confidence, the intervals were of comparable length to credible intervals based on the beta posterior distribution inferred by the *K* data sets and based on the average of the *K* beta posterior distributions whether for the letter recognition data or for the MAGIC gamma telescope data. Specifically, the intervals based on the beta distribution were 71.4%, 61.1%, 89.6%, 71.1%, 69.1%, and 86.4% of the interval length of confidence intervals based on the corrected *K*-fold cross-validated *t* distribution for the classification tree, perceptron, and support vector machine classifiers in the letter recognition and the MAGIC gamma telescope data, respectively.

### 4.4 Average Ranks of Four Credible and Confidence Intervals

To further investigate this problem, we compared the average ranks of four credible and confidence intervals with regard to their degree of confidence and interval length in all 27 cases of simulated and real data sets. Table 11 showed the results based on the simulated data sets with 15 cases and real data sets with 12 cases.

. | Rank . | |||
---|---|---|---|---|

. | . | . | . | . |

Case . | . | . | DOC . | . |

1 | 1 | 2 | 4 | 3 |

2 | 1 | 3 | 4 | 1 |

3 | 1 | 4 | 3 | 2 |

4 | 2 | 1 | 4 | 3 |

5 | 1 | 1 | 4 | 3 |

6 | 1 | 2 | 4 | 3 |

7 | 1 | 3 | 4 | 2 |

8 | 3 | 1 | 4 | 2 |

9 | 2 | 3 | 4 | 1 |

10 | 1 | 1 | 4 | 3 |

11 | 2 | 3 | 4 | 1 |

12 | 2 | 4 | 3 | 1 |

13 | 1 | 3 | 4 | 2 |

14 | 1 | 3 | 4 | 2 |

15 | 1 | 3 | 4 | 2 |

16 | 1 | 3 | 4 | 2 |

17 | 2 | 3 | 4 | 1 |

18 | 1 | 2 | 4 | 3 |

19 | 2 | 3 | 4 | 1 |

20 | 3 | 2 | 4 | 1 |

21 | 2 | 1 | 4 | 3 |

22 | 2 | 3 | 4 | 1 |

23 | 1 | 3 | 4 | 2 |

24 | 2 | 1 | 4 | 3 |

25 | 2 | 3 | 4 | 1 |

26 | 3 | 2 | 4 | 1 |

27 | 1 | 2 | 4 | 3 |

Average rank | 1.6(1) | 2.4(3) | 3.9(4) | 2(2) |

IL | ||||

Average rank | 2.5(3) | 2.2(2) | 1(1) | 4(4) |

DOC | ||||

Average rank | 2.3(2) | 3.1(3) | 3.6(4) | 1.0(1) |

IL | ||||

Average rank | 2.6(3) | 1.8(2) | 1.4(1) | 4(4) |

. | Rank . | |||
---|---|---|---|---|

. | . | . | . | . |

Case . | . | . | DOC . | . |

1 | 1 | 2 | 4 | 3 |

2 | 1 | 3 | 4 | 1 |

3 | 1 | 4 | 3 | 2 |

4 | 2 | 1 | 4 | 3 |

5 | 1 | 1 | 4 | 3 |

6 | 1 | 2 | 4 | 3 |

7 | 1 | 3 | 4 | 2 |

8 | 3 | 1 | 4 | 2 |

9 | 2 | 3 | 4 | 1 |

10 | 1 | 1 | 4 | 3 |

11 | 2 | 3 | 4 | 1 |

12 | 2 | 4 | 3 | 1 |

13 | 1 | 3 | 4 | 2 |

14 | 1 | 3 | 4 | 2 |

15 | 1 | 3 | 4 | 2 |

16 | 1 | 3 | 4 | 2 |

17 | 2 | 3 | 4 | 1 |

18 | 1 | 2 | 4 | 3 |

19 | 2 | 3 | 4 | 1 |

20 | 3 | 2 | 4 | 1 |

21 | 2 | 1 | 4 | 3 |

22 | 2 | 3 | 4 | 1 |

23 | 1 | 3 | 4 | 2 |

24 | 2 | 1 | 4 | 3 |

25 | 2 | 3 | 4 | 1 |

26 | 3 | 2 | 4 | 1 |

27 | 1 | 2 | 4 | 3 |

Average rank | 1.6(1) | 2.4(3) | 3.9(4) | 2(2) |

IL | ||||

Average rank | 2.5(3) | 2.2(2) | 1(1) | 4(4) |

DOC | ||||

Average rank | 2.3(2) | 3.1(3) | 3.6(4) | 1.0(1) |

IL | ||||

Average rank | 2.6(3) | 1.8(2) | 1.4(1) | 4(4) |

The average rank of the confidence interval based on the *K*-fold cross-validated *t* distribution was ranked first for interval length, but it ranked last among all four methods for the degree of confidence. By contrast, the confidence interval based on the corrected *K*-fold cross-validated *t* distribution ranked first for degree of confidence, but it ranked last for interval length. The two credible intervals proposed in this letter lay between the confidence intervals based on the *K*-fold and the corrected *K*-fold cross-validated *t* distributions. With an acceptable degree of confidence, the average ranks of our methods were first and second, and they were superior to the confidence interval based on the corrected *K*-fold cross-validated *t* distribution. The reason for this occurrence was the fact that the degrees of confidence of the confidence interval based on the *K*-fold cross-validated *t* distribution were all less than 95%.

### 4.5 Choice of

In the construction of the credible interval based on the beta posterior distribution inferred by the *K* data sets corresponding to *K* confusion matrices from *K*-fold cross-validation for precision and recall, the choice of is very important. Poor may affect the degree of confidence and interval length of the credible interval. Thus, in this section, we experimentally studied the changes in the degree of confidence and interval length as the changes of the values of the . Experimental results are in Table 12.

. | . | Letter Recognition Data, Classification Tree Classifier . | Letter Recognition Data, Support Vector Machine Classifier . | MAGIC Gamma Telescope Data, Classification Tree Classifier . | |||
---|---|---|---|---|---|---|---|

. | . | DOC . | IL . | DOC . | IL . | DOC . | IL . |

= 0.10 | 100% | 0.504 | 100% | 0.501 | 100% | 0.488 | |

100% | 0.503 | 100% | 0.499 | 100% | 0.490 | ||

= 0.45 | 99.8% | 0.262 | 99.9% | 0.261 | 99.8% | 0.250 | |

98.8% | 0.261 | 96.4% | 0.259 | 98.7% | 0.252 | ||

= 0.55 | 99.1% | 0.238 | 99.7% | 0.237 | 98.9% | 0.226 | |

98.2% | 0.237 | 95.7% | 0.235 | 98.6% | 0.229 | ||

0.65 | 98.6% | 0.220 | 99.3% | 0.220 | 98.4% | 0.210 | |

96.8% | 0.219 | 94.0% | 0.218 | 97.4% | 0.213 | ||

0.75 | 98.2% | 0.205 | 98.5% | 0.205 | 97.5% | 0.196 | |

95.6% | 0.204 | 92.1% | 0.204 | 94.5% | 0.198 | ||

0.85 | 97.6% | 0.193 | 97.1% | 0.193 | 96.8% | 0.185 | |

92.5% | 0.192 | 89.6% | 0.191 | 91.6% | 0.187 | ||

0.95 | 96.3% | 0.183 | 97.0% | 0.183 | 95.8% | 0.175 | |

92.7% | 0.182 | 84.9% | 0.182 | 92.8% | 0.177 | ||

1 | 95.5% | 0.178 | 96.6% | 0.178 | 93.8% | 0.170 | |

91.5% | 0.178 | 87.6% | 0.177 | 92.2% | 0.172 |

. | . | Letter Recognition Data, Classification Tree Classifier . | Letter Recognition Data, Support Vector Machine Classifier . | MAGIC Gamma Telescope Data, Classification Tree Classifier . | |||
---|---|---|---|---|---|---|---|

. | . | DOC . | IL . | DOC . | IL . | DOC . | IL . |

= 0.10 | 100% | 0.504 | 100% | 0.501 | 100% | 0.488 | |

100% | 0.503 | 100% | 0.499 | 100% | 0.490 | ||

= 0.45 | 99.8% | 0.262 | 99.9% | 0.261 | 99.8% | 0.250 | |

98.8% | 0.261 | 96.4% | 0.259 | 98.7% | 0.252 | ||

= 0.55 | 99.1% | 0.238 | 99.7% | 0.237 | 98.9% | 0.226 | |

98.2% | 0.237 | 95.7% | 0.235 | 98.6% | 0.229 | ||

0.65 | 98.6% | 0.220 | 99.3% | 0.220 | 98.4% | 0.210 | |

96.8% | 0.219 | 94.0% | 0.218 | 97.4% | 0.213 | ||

0.75 | 98.2% | 0.205 | 98.5% | 0.205 | 97.5% | 0.196 | |

95.6% | 0.204 | 92.1% | 0.204 | 94.5% | 0.198 | ||

0.85 | 97.6% | 0.193 | 97.1% | 0.193 | 96.8% | 0.185 | |

92.5% | 0.192 | 89.6% | 0.191 | 91.6% | 0.187 | ||

0.95 | 96.3% | 0.183 | 97.0% | 0.183 | 95.8% | 0.175 | |

92.7% | 0.182 | 84.9% | 0.182 | 92.8% | 0.177 | ||

1 | 95.5% | 0.178 | 96.6% | 0.178 | 93.8% | 0.170 | |

91.5% | 0.178 | 87.6% | 0.177 | 92.2% | 0.172 |

Table 12 shows that when increased, the degree of confidence and the interval length of the credible interval gradually decreased. In general cases, we opt to select an such that the credible interval has an accepted degree of confidence (larger than 95%) and a short interval length. However, the best cannot express a closed form because the correlations of the s, s, and s vary in different cases with different classifiers and data sets. For example, the best was for in the case of letter recognition data, , support vector machine classifier. However, the best s were 0.65 and 0.75 in the cases of letter recognition data, , classification tree classifier, and MAGIC gamma telescope data, , classification tree classifier, respectively. To determine the best , the entire interval from to 1 should be searched, an expensive computation. Considering this condition, we suggested the computation of through . Although this selection method may not select the best , it provides a solution that is close to the best with a closed form and greatly saves on computational costs.

## 5 Conclusion

Considering that the commonly used confidence interval based on a *K*-fold cross-validated *t* distribution suffers from a lower degree of confidence, we presented a novel way to construct credible intervals indirectly, based on the posterior distributions of precisions and recall. Two credible intervals based on a *K*-fold cross-validated beta posterior distribution were thus proposed.

Furthermore, we compared our proposed credible intervals with existing confidence intervals for precision and recall through simulated and real data experiments. With an acceptable degree of confidence, our methods outperformed these existing methods. Specifically, they exhibited shorter interval lengths in all cases. The first proposed credible interval is particularly recommended, given that it displayed high degrees of confidence and short interval lengths in almost all experiments.

One of the key uses of performance metrics is model (algorithm) selection, which is traditionally straightforward to do based on point estimations, but how would this be done based on the performance intervals proposed? When the credible interval is used to select the models *A* and *B*, if their credible intervals are uncrossed, the model with high precision (recall) should be selected. However, if the credible interval of precision of *A* completely contains that of *B*, we cannot directly provide a definitive conclusion and need further analysis. For example, we can select models by directly comparing their right or left intervals. However, is this appropriate? The use of the proposed credible interval in comparing models is currently being investigated.

In practical applications, we always need to take into consideration the two factors of precision and recall. This enables the construction of a utility function that directly captures the value of true positives and negatives versus the cost of false positives and negatives. The ROC curve is a useful tool that facilitates choosing an optimal classification threshold for a given application. For quantitatively evaluating the model performance, an AUC measure obtained based on the ROC curve is often used. However, the AUC measure remains a point estimation. How the credible interval of this measure can be constructed by analyzing the distribution of AUC is meaningful research work and our future research direction.

## Acknowledgments

This work was supported by National Natural Science Fund of China (61503228, 71503151) and Special Program for Applied Research on Super Computation of the NSFC-Guangdong Joint Fund. Experiments were supported by High Performance Computing System of Shanxi University.

## References

*K*-fold cross-validation

*t*-test for comparing supervised classification learning algorithms