## Abstract

A cross-validation method based on replications of two-fold cross validation is called an cross validation. An cross validation is used in estimating the generalization error and comparing of algorithms’ performance in machine learning. However, the variance of the estimator of the generalization error in cross validation is easily affected by random partitions. Poor data partitioning may cause a large fluctuation in the number of overlapping samples between any two training (test) sets in cross validation. This fluctuation results in a large variance in the cross-validated estimator. The influence of the random partitions on variance becomes serious as increases. Thus, in this study, the partitions with a restricted number of overlapping samples between any two training (test) sets are defined as a block-regularized partition set. The corresponding cross validation is called block-regularized cross validation ( BCV). It can effectively reduce the influence of random partitions. We prove that the variance of the BCV estimator of the generalization error is smaller than the variance of cross-validated estimator and reaches the minimum in a special situation. An analytical expression of the variance can also be derived in this special situation. This conclusion is validated through simulation experiments. Furthermore, a practical construction method of BCV by a two-level orthogonal array is provided. Finally, a conservative estimator is proposed for the variance of estimator of the generalization error.

## 1 Introduction

In machine learning research, a cross-validation method is commonly used in model selection, estimation of the generalization error, and comparison of algorithm performances. Several versions of cross validation have been developed: repeated learning-testing (RLT), standard K-fold cross validation, Monte Carlo cross validation, cross validation and blocked cross validation (Dietterich, 1998; Alpaydin, 1999; Friedman, Hastie, & Tibshirani, 2001; Nadeau & Bengio, 2003; Arlot & Celisse, 2010; Yildiz, 2013; Wang, Wang, Jia, & Li, 2014). Among them, the standard two-fold cross validation has received considerable attention because of its simplicity and ease of use. For example, Nason (1996) employed two-fold cross validation and its variants to choose a threshold for wavelet shrinkage. Fan, Guo, and Hao (2012) used two-fold cross validation in variance estimation in an ultra-high-dimensional linear regression model. Stanišić and Tomović (2012) used two-fold cross validation in a frequent item set mining task. In practice, to improve the accuracy of estimation, data partitioning is conducted a number of times (i.e., two-fold cross validation is implemented in multiple replications). The generalization error is often estimated based on the average of the replicated two-fold cross validations.

Cross validation based on replications of two-fold cross validation is called cross validation; it is achieved by randomly splitting the data into two equal-sized blocks times. The cross validation is widely used in machine learning. Dietterich (1998) provided a -test for use in the comparison of algorithms based on cross validation. Alpaydin (1999) proposed a combined cross-validated -test along the line of the cross-validated -test and demonstrated its superiority through simulated comparisons. Yildiz (2013) adjusted the cross-validated -test and conducted comparison experiments on multiple real-life data sets in the UC Irvine Machine Learning Repository of databases widely used by the machine learning community (Lichman, 2013).

However, the performance of the cross-validation method often relies on the quality of data partitioning and the accuracy (variance) of the cross-validated estimator of the generalization error. Traditionally, a data set is randomly split into multiple different training and test data sets of equal size; training sets (test sets) from any two independent partitions contain common samples regardless of how the data set is split. The number of common samples is defined as the number of overlapping samples, which are defined in section 2. Markatou, Tian, Biswas, and Hripcsak (2005) theoretically proved that the number of overlapping samples follows a hypergeometric distribution with the mathematical expectation of (where is the size of a data set). Example ^{17} and plot 2 in Wang et al. (2014) showed that the variance of estimation of generalization error increases when the number of overlapping samples deviate from in the classification situation with the support vector machine classifier. Example ^{1} further validates the impact of the number of overlapping samples on variance in a simple linear regression situation.

Figure 1 shows that the initial decrease and subsequent increase in the covariance of any two two-fold cross-validated estimators corresponds to an increase in the number of overlapping samples. The covariance reaches the minimum when the number of overlapping samples is . This condition implies that poor data partitioning may cause a large fluctuation in the number of overlapping samples between any two training (test) sets and thus result in a large variance of the cross-validated estimator.

Wang, Li, and Li (2015) showed that the quantiles of the maximum numbers of the overlapping samples deviated from the increase as increases; thus, their influence on the variance of estimation of the generalization error intensifies as increases.

For this reason, Wang et al. (2014) proposed a new blocked cross-validation method with an equal number of overlapping samples in any two training (test) sets of and provided an accurate theoretical expression of the variance of the blocked cross-validated estimator of the generalization error. However, they did not investigate deeply the optimal property of the estimation of the generalization error based on blocked cross validation. In a more general case with , cross validation with a restricted number of overlapping samples is called block-regularized cross validation (abbreviated as BCV). In this letter, we study the property of the BCV estimator of generalization error theoretically and provide a novel construction algorithm of data partitioning for an BCV. Furthermore, we provide an empirical guide for selection of the replication count of and propose a conservative estimator of variance of the block-regularized cross-validated estimator.

This letter is organized as follows. Section 2 introduces several basic notations and definitions. Section 3 presents a theoretical analysis of the variance of the BCV estimator of the generalization error. A construction method of BCV based on a two-level orthogonal array is provided in section 4. Section 5 discusses the choice of in BCV. The developed variance estimators are described in section 6. Section 7 presents the simulation experiments, and section 8 concludes.

## 2 Notations and Definitions

Generally the generalization error is estimated by some kind of cross validation in practice. In this study, we consider cross validation. In cross validation, each standard two-fold cross validation is conducted by randomly splitting the entire data set into two equal-sized blocks. Several notations and definitions follow:

The is called a partition of index set , where and are random index sets from of data set . and satisfy , , and . Then is the set of partitions for .

Let and denote the training and test sets, respectively. Then . and serve as training or test sets in a two-fold cross validation.

For the two-fold cross validation, the training set and the test set are both generally called data blocks.

For any two partitions and in , is defined as number of overlapping samples between and , where , and . Matrix can be regarded as a measure of partition set .

In fact, the th and th elements of are and , respectively. These two pairs of partitions can result in four numbers of overlapping samples through a comparison of and , and , and , and and . However, if we let the number of overlapping samples between and be , the other three numbers of overlapping samples are equal to , and . Moreover, these four numbers have the same distribution. Therefore, we simply consider , the number of overlapping samples between and , in definition ^{4}.

Based on the above analysis, we propose a definition of the block-regularized cross-validation partitions as follows.

Given a set of partitions of for all s () in , if regularized condition is satisfied, in which is called a regularization parameter, then the partition set is called a block-regularized partition set and denoted as and the measure , accordingly. The cross validation on is called block-regularized cross validation(abbreviated as BCV). When and , that is, , the corresponding BCV is called a balanced BCV and denoted as . In this case, the measure for degenerates into a constant matrix, denoted as .

The regularization parameter should not exceed the expectation of of the cross validation, which is defined in equation 2.2.

In the following sections, the cross validation is abbreviated as CV. Definitions of some estimators of the generalization error (Friedman et al., 2001) are provided in the following paragraphs.

## 3 Theoretical Analysis of

In the expression of in equation 3.2, the key issue is to comprehensively analyze the properties of . Lemma ^{9} characterizes the lower convex property of . Lemma ^{12} characterizes the minimum property of at .

We let be the loss function on for and and be two random partitions of , and is the number of overlapping samples between and . We have:

In actuality, the parameters of , and have relationships with . Nevertheless, we mainly focus on the values of these parameters at the point of because the expectation of the number of overlapping samples is .

In order to clearly interpret the condition , we provide some intuitive clarifications of , , and :

- •
is the covariance of two loss functions with test samples of each pair in block . Specifically, the first loss function uses training set of and the test sample of , . The second loss function uses as the training set and tests on , and . Given that and do not appear in the two training sets and are independent, merely measures the correlations caused by the two training sets, assuming that the correlation is affected by nothing else except the training and test sets. Moreover, corresponds to .

- •
is the covariance of two loss functions with two training sets of and and two test samples of , and , (or ). Given that occurs in the training set of (or ), measures not only the correlation caused by the two training sets but also the correlation caused by the appearance of test sample in training set (or ). Therefore, is greater than .

- •
is the covariance of two loss functions with two test samples of , , and , . The first loss function uses the training set and , as a test sample, and the second loss function uses the training set and , as a test sample. Test samples and both occur in the other’s training set. Therefore, measures the correlations caused by the two training sets and the appearance of both test samples in the other’s training sets. Intuitively, is greater than —.

Furthermore, actually measures the differences between and . Specifically, it measures how the increment of covariance of two loss functions changes with the occurrence of one partition’s test sample in the other partition’s training set. indicates the increment of covariances caused by only one test sample occurring in the training set, and indicates the increment of covariances caused by both test samples appearing in the training sets. Therefore, the intuitive interpretation of is , that is, the increase from to is nonlinear and the increment between and is larger than that between and .

Proving that the condition holds in broad families of loss functions, algorithms, and data populations is difficult. However, with squared loss function, proving that the condition holds for mean regression and multivariate regression is possible. The detailed proofs are in the appendix. Moreover, some simulation results are presented in section 7.2 to illustrate that this condition is true.

We let and be two S2CV estimators of on partitions and , and . Then, for any , . Function has the following two properties:

Symmetry: .

Boundedness: .

Obviously, , that is, is a symmetric function, and its symmetry axis is . In particular, .

In the simulated experiments in section 7.3, we provide some simulated curves of function and their approximations in parameters , and at a neighborhood of .

- •
is the variance of HO estimators with the size of a training set of .

- •
is the correlation coefficient between two HO estimators within an S2CV estimator.

- •
is the correlation coefficient of any two S2CV estimators in an BCV estimator.

^{12}. Specifically, to prove the first inequality in equation 3.10, we introduce a random variable in which . Due to , we can obtain . Then, by employing Jensen’s inequality on , we can obtain

Thus, the first inequality holds. The second inequality can be derived directly because reaches its minimum at .

The variance of a balanced BCV estimator obviously decreases with the increment of . As increases, the proportion of the second part in variance becomes large.

## 4 Nested Construction Algorithm of for BCV

Although BCV has good properties, it has no comprehensive use in practical applications if it cannot be easily constructed. A classical construction method for for BCV is provided in McCarthy (1976). The construction method employs rows of an orthogonal array. Specifically, the data set is divided into blocks based on columns of an orthogonal array. Then, a partition in can be derived by combining the blocks according to the levels of each row of the orthogonal array. A weakness of the construction method is that it is not nested; that is, the new for larger does not include the previous for small ; thus, it should be reconstructed from the beginning to create BCV with a larger . Accordingly, training or testing models should be retaken for any different of BCV.

In this section, we propose a nested construction algorithm of for BCV. The algorithm can construct for BCV, along with an increment of . The nested construction algorithm and its theoretical guarantee is presented in theorem ^{16}:

Assuming that data set of size can be split into (k is a given value from ) disjoint and almost equal-sized blocks (such that the maximum difference of sizes of any two blocks is one), a partition set can be constructed by using an orthogonal array according to the following two steps:

The -th column of orthogonal array corresponds to partition , where is the set of rows of the “” level of the -th column. Similarly, the rows of the “” level of the -th column form the set .

According to step i, by taking over all columns of , we can obtain partitions s of for .

Then, is a block-regularized partition set with regularized condition for any .

Orthogonal array corresponds to a matrix with rows and columns. The elements of the matrix consist of “” and “-”, which are called levels in statistics. For any two columns in , there are only four combinations of and equal replicated times for each combination. This condition means that the replicated time for each combination is , that is, the corresponding number of the same samples in any two columns is . Thus, replications of two-fold cross validation constructed by the above operation form the of .

Since the maximum difference of size of any two of disjoint and equal-sized blocks is one and testing sets from any two independent partitions contain common blocks, ,for any .

This example illustrates the construction process of of . Index set from data set is split into blocks denoted as . Orthogonal array is employed (see Table 1). Then of is constructed with Table 2. When , the expectation of of CV is about 3.98. However, our construction algorithm can constrain to 2.

Row Index . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|

1 | + | + | + | + | + | + | + |

2 | - | + | - | + | - | + | - |

3 | + | - | - | + | + | - | - |

4 | - | - | + | + | - | - | + |

5 | + | + | + | - | - | - | - |

6 | - | + | - | - | + | - | + |

7 | + | - | - | - | - | + | + |

8 | - | - | + | - | + | + | - |

Row Index . | . | . | . | . | . | . | . |
---|---|---|---|---|---|---|---|

1 | + | + | + | + | + | + | + |

2 | - | + | - | + | - | + | - |

3 | + | - | - | + | + | - | - |

4 | - | - | + | + | - | - | + |

5 | + | + | + | - | - | - | - |

6 | - | + | - | - | + | - | + |

7 | + | - | - | - | - | + | + |

8 | - | - | + | - | + | + | - |

For data set with sample size , according to the construction method of theorem ^{16}, the maximum value of in of should be because the employed is a saturated orthogonal array (Wu & Hamada, 2011).

The blocked cross validation provided by Wang et al. (2014) is a special case of the proposed BCV with . In fact, the construction method of blocked cross validation is in accordance with our method constructed based on .

The construction of of is intuitively related to of . In data partitioning for of , each of the four blocks from of is split further into two equal-sized subblocks. These eight blocks can also be used to construct of . In essence, the partitions for of include the partitions for of .

Generally when , of can be constructed based on of . Specifically, of is expanded from of . In this letter, this construction method is called the nested construction algorithm. It is formulated as follows:

Construct an orthogonal array based on , (Wu & Hamada, 2011). Specifically, the corresponds to a Hardmard matrix . Then matrix is still a Hardmard matrix, which corresponds to .

Split all blocks used in of into two nearly equal-sized subblocks. For any , the original th block should be split evenly and denoted as the th subblock and the th subblock.

Generate the th partition in of by employing step i of theorem

^{16}on the th column of the and the blocks of step ii.

The following example illustrates the nested construction of of based on of .

Finally, the partitions of in of are derived using the last four columns of and eight subblocks. All the partitions in of and of are compared in Table 4. Their first three partitions are illustrated as identical.

Corollary ^{15} indicates that the increase in in BCV reduces the variance of the estimator of the generalization error. Thus, continually adding the number of partitions on the basis of the previous cross-validated estimators to form the next BCV is very useful in practical experiments.

## 5 Selection of

^{13}shows that . As increases, the magnitude of variance reduction declines as well, although the variance gradually decreases. Considering the reduction rate of variance,

If the value is very small, such as smaller than ( or ), additional replications are not required. Hence, the problem of determining can be solved with this idea.

However, and in the reduction rate of variance are unknown. The values of and should be related to the sample size, the used model (algorithm), and so on. It should not be related to . Based on the large number of simulation experiments conducted by Wang et al. (2014), we believe that the ranges of the values of and should be .

Table 5 shows that the BCV provided by Wang et al. (2014) has an ARRV of less than . If one wishes the averaged reduction rate of variance to be smaller than , one should make at least larger than 5. This may provide an explanation as to why cross validation is empirically recommended by several researchers in the comparison of algorithm performance (Dietterich, 1998; Alpaydin, 1999; Yildiz, 2013). Furthermore, if one wishes the averaged reduction rate of variance to be smaller than , one must make .