Abstract

A cross-validation method based on replications of two-fold cross validation is called an cross validation. An cross validation is used in estimating the generalization error and comparing of algorithms’ performance in machine learning. However, the variance of the estimator of the generalization error in cross validation is easily affected by random partitions. Poor data partitioning may cause a large fluctuation in the number of overlapping samples between any two training (test) sets in cross validation. This fluctuation results in a large variance in the cross-validated estimator. The influence of the random partitions on variance becomes serious as increases. Thus, in this study, the partitions with a restricted number of overlapping samples between any two training (test) sets are defined as a block-regularized partition set. The corresponding cross validation is called block-regularized cross validation ( BCV). It can effectively reduce the influence of random partitions. We prove that the variance of the BCV estimator of the generalization error is smaller than the variance of cross-validated estimator and reaches the minimum in a special situation. An analytical expression of the variance can also be derived in this special situation. This conclusion is validated through simulation experiments. Furthermore, a practical construction method of BCV by a two-level orthogonal array is provided. Finally, a conservative estimator is proposed for the variance of estimator of the generalization error.

1  Introduction

In machine learning research, a cross-validation method is commonly used in model selection, estimation of the generalization error, and comparison of algorithm performances. Several versions of cross validation have been developed: repeated learning-testing (RLT), standard K-fold cross validation, Monte Carlo cross validation, cross validation and blocked cross validation (Dietterich, 1998; Alpaydin, 1999; Friedman, Hastie, & Tibshirani, 2001; Nadeau & Bengio, 2003; Arlot & Celisse, 2010; Yildiz, 2013; Wang, Wang, Jia, & Li, 2014). Among them, the standard two-fold cross validation has received considerable attention because of its simplicity and ease of use. For example, Nason (1996) employed two-fold cross validation and its variants to choose a threshold for wavelet shrinkage. Fan, Guo, and Hao (2012) used two-fold cross validation in variance estimation in an ultra-high-dimensional linear regression model. Stanišić and Tomović (2012) used two-fold cross validation in a frequent item set mining task. In practice, to improve the accuracy of estimation, data partitioning is conducted a number of times (i.e., two-fold cross validation is implemented in multiple replications). The generalization error is often estimated based on the average of the replicated two-fold cross validations.

Cross validation based on replications of two-fold cross validation is called cross validation; it is achieved by randomly splitting the data into two equal-sized blocks times. The cross validation is widely used in machine learning. Dietterich (1998) provided a -test for use in the comparison of algorithms based on cross validation. Alpaydin (1999) proposed a combined cross-validated -test along the line of the cross-validated -test and demonstrated its superiority through simulated comparisons. Yildiz (2013) adjusted the cross-validated -test and conducted comparison experiments on multiple real-life data sets in the UC Irvine Machine Learning Repository of databases widely used by the machine learning community (Lichman, 2013).

However, the performance of the cross-validation method often relies on the quality of data partitioning and the accuracy (variance) of the cross-validated estimator of the generalization error. Traditionally, a data set is randomly split into multiple different training and test data sets of equal size; training sets (test sets) from any two independent partitions contain common samples regardless of how the data set is split. The number of common samples is defined as the number of overlapping samples, which are defined in section 2. Markatou, Tian, Biswas, and Hripcsak (2005) theoretically proved that the number of overlapping samples follows a hypergeometric distribution with the mathematical expectation of (where is the size of a data set). Example 17 and plot 2 in Wang et al. (2014) showed that the variance of estimation of generalization error increases when the number of overlapping samples deviate from in the classification situation with the support vector machine classifier. Example 1 further validates the impact of the number of overlapping samples on variance in a simple linear regression situation.

Example 1.
Let be the data set in which the predictor vector is drawn from a multivariate normal distribution . For the covariance matrix , all diagonal elements are equal to 1, off-diagonal elements of the fourth column and fourth row of the matrix are equal to , and the other elements are equal to . The response variable is
formula
1.1
where and with the first four coordinates and 0 elsewhere. The lasso method is used as a learning algorithm. The squared loss function is used as the loss function (Fan & Lv, 2008). To simulate the covariance with regard to the number of overlapping samples, we set , , , and . The simulation result is depicted in Figure 1 .
Figure 1:

An example of covariance on a simulated regression data set.

Figure 1:

An example of covariance on a simulated regression data set.

Figure 1 shows that the initial decrease and subsequent increase in the covariance of any two two-fold cross-validated estimators corresponds to an increase in the number of overlapping samples. The covariance reaches the minimum when the number of overlapping samples is . This condition implies that poor data partitioning may cause a large fluctuation in the number of overlapping samples between any two training (test) sets and thus result in a large variance of the cross-validated estimator.

Wang, Li, and Li (2015) showed that the quantiles of the maximum numbers of the overlapping samples deviated from the increase as increases; thus, their influence on the variance of estimation of the generalization error intensifies as increases.

For this reason, Wang et al. (2014) proposed a new blocked cross-validation method with an equal number of overlapping samples in any two training (test) sets of and provided an accurate theoretical expression of the variance of the blocked cross-validated estimator of the generalization error. However, they did not investigate deeply the optimal property of the estimation of the generalization error based on blocked cross validation. In a more general case with , cross validation with a restricted number of overlapping samples is called block-regularized cross validation (abbreviated as BCV). In this letter, we study the property of the BCV estimator of generalization error theoretically and provide a novel construction algorithm of data partitioning for an BCV. Furthermore, we provide an empirical guide for selection of the replication count of and propose a conservative estimator of variance of the block-regularized cross-validated estimator.

This letter is organized as follows. Section 2 introduces several basic notations and definitions. Section 3 presents a theoretical analysis of the variance of the BCV estimator of the generalization error. A construction method of BCV based on a two-level orthogonal array is provided in section 4. Section 5 discusses the choice of in BCV. The developed variance estimators are described in section 6. Section 7 presents the simulation experiments, and section 8 concludes.

2  Notations and Definitions

We assume that data set consists of samples (i.e., ), where s are independently sampled from unknown distribution , is a predictor variable vector, and is a response variable. denotes the prediction model trained on data set by learning algorithm . We let be the loss function. In this letter, zero-one loss is used for classification problems and squared loss is used for regression problems. Then the generalization error of algorithm is defined as
formula
2.1

Generally the generalization error is estimated by some kind of cross validation in practice. In this study, we consider cross validation. In cross validation, each standard two-fold cross validation is conducted by randomly splitting the entire data set into two equal-sized blocks. Several notations and definitions follow:

Definition 1.

The is called a partition of index set , where and are random index sets from of data set . and satisfy , , and . Then is the set of partitions for .

Let and denote the training and test sets, respectively. Then . and serve as training or test sets in a two-fold cross validation.

Remark 1.

For the two-fold cross validation, the training set and the test set are both generally called data blocks.

Definition 2.

For any two partitions and in , is defined as number of overlapping samples between and , where , and . Matrix can be regarded as a measure of partition set .

Remark 2.

In fact, the th and th elements of are and , respectively. These two pairs of partitions can result in four numbers of overlapping samples through a comparison of and , and , and , and and . However, if we let the number of overlapping samples between and be , the other three numbers of overlapping samples are equal to , and . Moreover, these four numbers have the same distribution. Therefore, we simply consider , the number of overlapping samples between and , in definition 4.

Generally is an integer-valued random variable in . Markatou et al. (2005) proved that is drawn from hypergeometric distribution, and its expectation is . If there are more than two partitions, multiple numbers of overlapping samples should be considered. Furthermore, all differences between the multiple numbers of overlapping samples and should be regularized to reduce the variance of the cross-validated estimator of the generalization error. On the basis of these intuitions, we propose a new partitioning method, block-regularized cross-validation partitions, to control the differences. Our method aims to control the difference between each number of overlapping samples and into smaller than its expectation in a random situation. This expectation is provided by Wang et al. (2015) and is expressed as
formula
2.2
where is the data set size and .

Based on the above analysis, we propose a definition of the block-regularized cross-validation partitions as follows.

Definition 3.

Given a set of partitions of for all s () in , if regularized condition is satisfied, in which is called a regularization parameter, then the partition set is called a block-regularized partition set and denoted as and the measure , accordingly. The cross validation on is called block-regularized cross validation(abbreviated as BCV). When and , that is, , the corresponding BCV is called a balanced BCV and denoted as . In this case, the measure for degenerates into a constant matrix, denoted as .

Remark 3.

The regularization parameter should not exceed the expectation of of the cross validation, which is defined in equation 2.2.

In the following sections, the cross validation is abbreviated as CV. Definitions of some estimators of the generalization error (Friedman et al., 2001) are provided in the following paragraphs.

Definition 4.
For a given partition , the hold-out estimator (HO estimator) of is defined as
formula
2.3
The standard two-fold cross-validated estimator (S2CV estimator) of can be written as
formula
2.4
where .
The cross-validated estimator ( CV estimator)of can be expressed as
formula
2.5
where, is the S2CV estimator for partition . Accordingly, the estimator of based on is denoted as which is a block-regularized cross-validated estimator ( BCV estimator) of .

3  Theoretical Analysis of

The variance of the CV estimator can be decomposed as
formula
3.1
The should be expressed as exactly. But this letter considers the original sample size fixed and considers only the measure for . From the perspective of measure for , all s should be the same for each . Due to all samples s in the data set are i.i.d, depends on only the number of overlapping samples . Thus,
formula
3.2
where is with regard to random samples .
Motivated by experimental design in statistics (Wu & Hamada, 2011), we attempt to design a set of partitions to reduce the effects of random variable . We will prove that when , for all , reaches the minimum, that is, (balanced BCV) satisfies
formula
3.3

In the expression of in equation 3.2, the key issue is to comprehensively analyze the properties of . Lemma 9 characterizes the lower convex property of . Lemma 12 characterizes the minimum property of at .

Lemma 1.

We let be the loss function on for and and be two random partitions of , and is the number of overlapping samples between and . We have:

  • For , when , has the following form:
    formula
    where are constants, as shown in Figure 2.
  • If denoting , can be expressed as a quadratic polynomial function of :
    formula
    Therefore, is a lower convex function with regard to when .

Figure 2:

Demo of values of parameters , , , and .

Figure 2:

Demo of values of parameters , , , and .

Proof.
From the definition of a partition, we know that four index subsets , , , and can be obtained from and . For , we have
formula
For , the following equation holds:
formula
Obviously we can get
formula
3.5
Then,
formula
3.6
Finally, we obtain
formula
in which is a lower convex function with regard to when .
Remark 4.

In actuality, the parameters of , and have relationships with . Nevertheless, we mainly focus on the values of these parameters at the point of because the expectation of the number of overlapping samples is .

In order to clearly interpret the condition , we provide some intuitive clarifications of , , and :

  • is the covariance of two loss functions with test samples of each pair in block . Specifically, the first loss function uses training set of and the test sample of , . The second loss function uses as the training set and tests on , and . Given that and do not appear in the two training sets and are independent, merely measures the correlations caused by the two training sets, assuming that the correlation is affected by nothing else except the training and test sets. Moreover, corresponds to .

  • is the covariance of two loss functions with two training sets of and and two test samples of , and , (or ). Given that occurs in the training set of (or ), measures not only the correlation caused by the two training sets but also the correlation caused by the appearance of test sample in training set (or ). Therefore, is greater than .

  • is the covariance of two loss functions with two test samples of , , and , . The first loss function uses the training set and , as a test sample, and the second loss function uses the training set and , as a test sample. Test samples and both occur in the other’s training set. Therefore, measures the correlations caused by the two training sets and the appearance of both test samples in the other’s training sets. Intuitively, is greater than .

Furthermore, actually measures the differences between and . Specifically, it measures how the increment of covariance of two loss functions changes with the occurrence of one partition’s test sample in the other partition’s training set. indicates the increment of covariances caused by only one test sample occurring in the training set, and indicates the increment of covariances caused by both test samples appearing in the training sets. Therefore, the intuitive interpretation of is , that is, the increase from to is nonlinear and the increment between and is larger than that between and .

Remark 5.

Proving that the condition holds in broad families of loss functions, algorithms, and data populations is difficult. However, with squared loss function, proving that the condition holds for mean regression and multivariate regression is possible. The detailed proofs are in the appendix. Moreover, some simulation results are presented in section 7.2 to illustrate that this condition is true.

Lemma 2.

We let and be two S2CV estimators of on partitions and , and . Then, for any , . Function has the following two properties:

  • Symmetry: .

  • Boundedness: .

Proof.
If , then
formula
From the definition of S2CV estimator, we have
formula
3.7
According to the definition of , we have
formula
and
formula
Thus,
formula
3.8

Obviously, , that is, is a symmetric function, and its symmetry axis is . In particular, .

According to the property of covariance, we have
formula
3.9
Together with the fact that and the symmetric property of , we have
formula
According to the lower convex property of (lemma 9), based on Jensen’s inequality, we can easily obtain , that is,
formula
where, .

In the simulated experiments in section 7.3, we provide some simulated curves of function and their approximations in parameters , and at a neighborhood of .

Theorem 1.
Given a set of partitions of n samples , from the perspective of measure for , the variance of an CV estimator of generalization error satisfies
formula
3.10
where
formula
3.11
  • is the variance of HO estimators with the size of a training set of .

  • is the correlation coefficient between two HO estimators within an S2CV estimator.

  • is the correlation coefficient of any two S2CV estimators in an BCV estimator.

Proof.
We can easily equation 3.10 from lemma 12. Specifically, to prove the first inequality in equation 3.10, we introduce a random variable in which . Due to , we can obtain . Then, by employing Jensen’s inequality on , we can obtain
formula
3.12
Using the symmetric property of clarifies that
formula
3.13

Thus, the first inequality holds. The second inequality can be derived directly because reaches its minimum at .

Furthermore, the variance of a balanced BCV estimator can be decomposed into combinations of hold-out estimators as follows:
formula
Corollary 1.
For any two partition sets and of balanced BCV,
formula
Corollary 2.

The variance of a balanced BCV estimator obviously decreases with the increment of . As increases, the proportion of the second part in variance becomes large.

4  Nested Construction Algorithm of for BCV

Although BCV has good properties, it has no comprehensive use in practical applications if it cannot be easily constructed. A classical construction method for for BCV is provided in McCarthy (1976). The construction method employs rows of an orthogonal array. Specifically, the data set is divided into blocks based on columns of an orthogonal array. Then, a partition in can be derived by combining the blocks according to the levels of each row of the orthogonal array. A weakness of the construction method is that it is not nested; that is, the new for larger does not include the previous for small ; thus, it should be reconstructed from the beginning to create BCV with a larger . Accordingly, training or testing models should be retaken for any different of BCV.

In this section, we propose a nested construction algorithm of for BCV. The algorithm can construct for BCV, along with an increment of . The nested construction algorithm and its theoretical guarantee is presented in theorem 16:

Theorem 2.

Assuming that data set of size can be split into (k is a given value from ) disjoint and almost equal-sized blocks (such that the maximum difference of sizes of any two blocks is one), a partition set can be constructed by using an orthogonal array according to the following two steps:

  • The -th column of orthogonal array corresponds to partition , where is the set of rows of the “” level of the -th column. Similarly, the rows of the “” level of the -th column form the set .

  • According to step i, by taking over all columns of , we can obtain partitions s of for .

Then, is a block-regularized partition set with regularized condition for any .

Proof.

Orthogonal array corresponds to a matrix with rows and columns. The elements of the matrix consist of “” and “-”, which are called levels in statistics. For any two columns in , there are only four combinations of and equal replicated times for each combination. This condition means that the replicated time for each combination is , that is, the corresponding number of the same samples in any two columns is . Thus, replications of two-fold cross validation constructed by the above operation form the of .

Since the maximum difference of size of any two of disjoint and equal-sized blocks is one and testing sets from any two independent partitions contain common blocks, ,for any .

Example 2.

This example illustrates the construction process of of . Index set from data set is split into blocks denoted as . Orthogonal array is employed (see Table 1). Then of is constructed with Table 2. When , the expectation of of CV is about 3.98. However, our construction algorithm can constrain to 2.

Table 1:
Orthogonal Array .
Row Index
1 
2 
3 
4 
5 
6 
7 
8 
Row Index
1 
2 
3 
4 
5 
6 
7 
8 
Table 2:
Mapping between and for of .
Partition
   
   
   
   
   
   
   
Partition
   
   
   
   
   
   
   
Remark 6.

For data set with sample size , according to the construction method of theorem 16, the maximum value of in of should be because the employed is a saturated orthogonal array (Wu & Hamada, 2011).

Remark 7.

The blocked cross validation provided by Wang et al. (2014) is a special case of the proposed BCV with . In fact, the construction method of blocked cross validation is in accordance with our method constructed based on .

The construction of of is intuitively related to of . In data partitioning for of , each of the four blocks from of is split further into two equal-sized subblocks. These eight blocks can also be used to construct of . In essence, the partitions for of include the partitions for of .

Generally when , of can be constructed based on of . Specifically, of is expanded from of . In this letter, this construction method is called the nested construction algorithm. It is formulated as follows:

  1. Construct an orthogonal array based on , (Wu & Hamada, 2011). Specifically, the corresponds to a Hardmard matrix . Then matrix is still a Hardmard matrix, which corresponds to .

  2. Split all blocks used in of into two nearly equal-sized subblocks. For any , the original th block should be split evenly and denoted as the th subblock and the th subblock.

  3. Generate the th partition in of by employing step i of theorem 16 on the th column of the and the blocks of step ii.

The following example illustrates the nested construction of of based on of .

Example 3.
of is based on (see Table 3) and the four blocks . The upper left-hand corner subarray in Table 1 is identical to and the fourth column of . Next, the four blocks are split into eight subblocks using the following rules:
formula

Finally, the partitions of in of are derived using the last four columns of and eight subblocks. All the partitions in of and of are compared in Table 4. Their first three partitions are illustrated as identical.

Table 3:
Orthogonal Array .
Row Index
1 
2 
3 
4 
Row Index
1 
2 
3 
4 
Table 4:
Mapping of Blocks and Partitions between BCV and BCV.
Partition BCV BCV BCV BCV
     
     
     
     
     
     
     
Partition BCV BCV BCV BCV
     
     
     
     
     
     
     

Corollary 15 indicates that the increase in in BCV reduces the variance of the estimator of the generalization error. Thus, continually adding the number of partitions on the basis of the previous cross-validated estimators to form the next BCV is very useful in practical experiments.

5  Selection of

In practical applications, providing a selection method of is necessary. equation 3.11 of theorem 13 shows that . As increases, the magnitude of variance reduction declines as well, although the variance gradually decreases. Considering the reduction rate of variance,
formula
5.1

If the value is very small, such as smaller than ( or ), additional replications are not required. Hence, the problem of determining can be solved with this idea.

However, and in the reduction rate of variance are unknown. The values of and should be related to the sample size, the used model (algorithm), and so on. It should not be related to . Based on the large number of simulation experiments conducted by Wang et al. (2014), we believe that the ranges of the values of and should be .

We let ARRV denote the averaged reduction rate of variance over the range of regardless of the model used. Hence, we recommend determining by limiting ARRV to smaller than :
formula
5.2

Table 5 shows that the BCV provided by Wang et al. (2014) has an ARRV of less than . If one wishes the averaged reduction rate of variance to be smaller than , one should make at least larger than 5. This may provide an explanation as to why cross validation is empirically recommended by several researchers in the comparison of algorithm performance (Dietterich, 1998; Alpaydin, 1999; Yildiz, 2013). Furthermore, if one wishes the averaged reduction rate of variance to be smaller than , one must make .

Table 5:
Averaged Reduction Rate of Variance with Regard to .
mARRVScheme of Cross Validation
0.1552   
0.0984