## Abstract

In learning theory, the training and test sets are assumed to be drawn from the same probability distribution. This assumption is also followed in practical situations, where matching the training and test distributions is considered desirable. Contrary to conventional wisdom, we show that mismatched training and test distributions in supervised learning can in fact outperform matched distributions in terms of the bottom line, the out-of-sample performance, independent of the target function in question. This surprising result has theoretical and algorithmic ramifications that we discuss.

## 1 Introduction

A basic assumption in learning theory is that the training and test sets are drawn from the same probability distribution. Indeed, adjustments to the theory become necessary when there is a mismatch between training and test distributions. As we discuss, a significant body of work introduces techniques that transform mismatched training and test sets in order to create matched versions. However, the fact that the theory requires a matched distribution assumption to go through does not necessarily mean that matched distributions will lead to better performance, just that they lead to theoretically more predictable performance. The question of whether they do lead to better performance has not been addressed in the case of supervised learning, perhaps because of an intuitive expectation that the answer would be yes.

The result we report here is that, surprisingly, mismatched distributions can outperform matched distributions. Specifically, the expected out-of-sample performance in supervised learning can be better if the test set is drawn from a probability distribution that is different from the probability distribution from which the training data had been drawn, and vice versa. In the case of active learning, this would not be so surprising since active learning algorithms deliberately alter the training distribution as more information is gathered about where the decision boundary of the target function is, for example. In our case of supervised learning, we deal with an unknown target function where the decision boundary can be anywhere. Nonetheless, we show that a mismatched distribution, unrelated to any decision boundary, can still outperform the matched distribution, a surprising fact that runs against the conventional wisdom in supervised learning. We first put our result in the context of previous matching work and then discuss the result from theoretical and empirical points of view.

In many practical situations, the assumption that the training and test sets are drawn from the same probability distribution does not hold. Examples where this mismatch has required corrections can be found in natural language processing (Jiang & Zhai, 2007), speech recognition (Blitzer, Dredze, & Pereira, 2007), and recommender systems, among others. The problem is referred to as data set shift and sometimes is subdivided into covariate shift and sample selection bias, as described in Quiñonero-Candela, Sugiyama, Schwaighofer, and Lawrence (2009). Various methods have been devised to correct this problem and is part of the ongoing work on domain adaptation and transfer learning. The numerous methods can be roughly divided into four types (Margolis, 2011).

The first type is referred to as instance weighting for covariate shift, in which weights are given to points in the training set, such that the two distributions become effectively matched. Some of these methods include discriminative approaches as in Bickel, Brückner, and Scheffer (2007, 2009); others make assumptions regarding the source of the bias and explicitly model a selection bias variable (Zadrozny, 2004); others try to match the two distributions in some reproducing kernel hilbert space as kernel mean matching (Huang, Smola, Gretton, Borgwardt, & Schölkopf, 2007); others estimate directly the weights by using criteria as the Kullback-Liebler divergence as in KLIEP (Sugiyama, Nakajima, Kashima, Von Buenau, & Kawanabe, 2008) or least squares deviation as in LSIF (Kanamori, Hido, & Sugiyama, 2009), among others. Additional approaches are given in Rosset, Zhu, Zou, and Hastie (2004); Cortes, Mohri, Riley, and Rostamizadeh (2008), and Ren, Shi, Fan, and Yu (2008). All of these methods rely on finding weights, which is not trivial as the actual distributions are not known; furthermore, the addition of weights reduces the effective sample size of the training set, hurting the out-of-sample performance (Shimodaira, 2000). Cross-validation is also an issue and is addressed in methods like importance weighting cross validation (Sugiyama et al., 2008). Learning bounds for the instance weighting setting are shown in Cortes, Mansour, and Mohri (2010) and Zhang, Zhang, and Ye (2012). Further theoretical results in a more general setting of learning from different domains are given in Ben-David et al. (2010).

The second type of methods uses self-labeling or cotraining techniques so that samples from the test set, which are unlabeled, are introduced in the training set in order to match the distributions; they are labeled using the labeled data. A final model is then reestimated with these new points. Some of these methods are described in Blum and Mitchell (1998), Leggetter and Woodland (1995), and Digalakis, Rtischev, and Neumeyer (1995). A third approach is to change the feature representation, so that features are selected, discarded, or transformed in an effort to make training and test distributions remain similar. This idea is explored in various methods, including Blitzer et al. (2007), Blitzer, McDonald, and Pereira (2006), Ben-David, Blitzer, Crammer, and Pereira (2007), and Pan, Kwok, and Yang (2008), among many others. Finally, cluster-based methods rely on the assumption that the decision boundaries have low density probabilities (Gao, Fan, Jiang, & Han, 2008), and hence try to label new data in regions that are underrepresented in the training set through clustering, as proposed in Blum (2001), and Ng, Jordan, and Weiss (2002). (For a more substantial review on these and other methods, refer to Margolis, 2011, and Sugiyama & Kawanabe, 2012.)

However, while great effort has been spent trying to match the training and test distributions, a thorough analysis of the need for matching has not been carried out. This letter shows that mismatched distributions can in fact outperform matched distributions. This is important not only from a theoretical point of view but also for practical reasons. The methods that have been proposed for matching the distributions not only increase the computational complexity of the learning algorithms but also may result in an effective sample size reduction due to the sampling or weighting mechanisms used for matching. Recognizing that the system may perform better under a scenario of mismatched distributions can influence the need for, and the extent of, matching techniques, as well as the quantitative objective of matching algorithms.

In our analysis, we show that a mismatched distribution can be better than a matched distribution in two directions:

- •
For a given training distribution

*P*, the best test distribution_{R}*P*can be different from_{S}*P*._{R} - •
For a given test distribution

*P*, the best training distribution_{S}*P*can be different from_{R}*P*._{S}

The justifications for these two directions, as well as their implications, are quite
different. In a practical setting, the test distribution is usually fixed, so the second
direction reflects the practical learning problem about what to do with the training data if
they are drawn from a different distribution from that of the test environment. One of the
ramifications of this direction is the new notion of a dual distribution. This is a training
distribution *P _{R}* that is optimal to use when the test
distribution is

*P*. A dual distribution serves as a new target distribution for matching algorithms. Instead of matching the training distribution to the test distribution, it is matched to a dual of the test distribution for optimal performance. The dual distribution depends on only the test distribution and not on the particular target function of the problem.

_{S}The organization of this letter is as follows. Section 2 describes extensive simulations that give an empirical answer to the key questions and a discussion of those empirical results. The theoretical analysis follows in section 3, where analytical tools are used to show particular unmatched training and test distributions that lead to better out-of-sample performance in a general regression case. The notion of a dual distribution is discussed in section 4. Section 5 explains the difference of the results presented and the dual distribution concept with related ideas in active learning, followed by the conclusion in section 6.

## 2 Empirical Results

Consider the scenario where the data set *R* used for training by the
learning algorithm is drawn from probability distribution *P _{R}*,
while the data set

*S*that the algorithm will be tested on is drawn from distribution

*P*. We show here that the performance of the learning algorithm in terms of the out-of-sample error can be better when , averaging over target functions and data set realizations. The empirical evidence, which is statistically significant, is based on an elaborate Monte Carlo simulation that involves various target functions and probability distributions. The details of that simulation follow, and the results are illustrated in Figures 1 and 3.

_{S}We consider a one-dimensional input space, . There is no loss of generality by limiting our domain because in any practical situation, the data have a finite domain and can be rescaled to the desired interval. We run the learning algorithm for different target functions and different training and test distributions, and we average the out-of-sample error over a large number of data sets generated by those distributions and over target functions; then we compare the results for matched and mismatched distributions.

### 2.1 Simulation Setup

#### 2.1.1 Distributions

*R*and

*S*: 1 uniform distribution , 10 truncated gaussian distributions where is increased in steps of 0.3, 10 truncated exponential distributions Exp where is increased also in steps of 0.3, and 10 truncated mixture of gaussian Distributions such that , with increased in steps of 0.25. By truncating the distributions, we mean that if

*X*has a truncated gaussian distribution such that and has a gaussian distribution with , then where and

*Z*is a normalizing constant (). This applies as well to the truncated exponential and mixture of gaussian distributions.

#### 2.1.2 Data Sets

For each pair of probability distributions, we carry out the simulation generating 1000
different target functions, running the learning algorithm, comparing the out-of-sample
performance, and then averaging over 100 different data set realizations. That is, each
point in Figures 1 and 3 is an average over 100,000 runs with the same pair of
distributions but with different combinations of target functions and training and test
sets. The sizes of the data sets are and 300
and , where *N _{R}* and

*N*are the number of points in the training and test sets

_{S}*R*and

*S*.

#### 2.1.3 Target Functions

The target functions were generated by taking the sign of a polynomial in the desired interval. The polynomials were formed by choosing at random one to five roots in the interval [−1, 1]. The learning algorithm minimized a squared loss function using a nonlinear transformation of the input space as features. The non-linear transformation used powers of the input variable up to the number of roots of the polynomial plus a sinusoidal feature, which allows the model to learn a function that is close, but not identical, to the target. This choice of target functions allows the decision boundaries to vary in both number and location in each realization. Hence, the results presented do not depend on a particular target function, so that the distributions cannot favor the regions around the boundaries, as these are changing in each realization. Notice there is no added stochastic noise so that the two classes could be perfectly separated with an appropriate hypothesis set.

**Out-of-Sample Error.**The expected out-of-sample error in this classification task is estimated using the test set generated according to each of the

*P*with . It is computed as the misclassification 0-1 loss, that is, where denotes the expected value with respect to the distribution of random variable

_{S}*x*, denotes the indicator function of expression

*a*, ,

*R*is the training data set generated according to

*P*, and

_{R}*h*is the learned function.

### 2.2 Fixing the Training Distribution

*P*and

_{R}*P*. We fix

_{S}*P*and evaluate the percentage of runs where using yields better out-of-sample performance than if . That is, each entry corresponds to

_{R}The matrix places families of distributions together, with increasing order of standard deviation or time constant. The result that immediately stands out is that in a significant number of entries, more than 50% of the runs have better performance when mismatched distributions are used, as indicated by the yellow, orange, and red regions, which constitute of all combinations of the probability distributions used.

A number of interesting patterns are worth noting in this plot. The first row, which
corresponds to , falls under the category of better performance
for mismatched distributions for almost any other *P _{S}* used.
There is also a block structure in the plot, which is no accident due to the way the
families of distributions are grouped. Among these blocks, the lower triangular part of
the blocks in the diagonal corresponds to cases where the distributions are mismatched but
out-of-sample performance is better. We also note that the blocks in the upper-right and
lower-left corners show the same pattern in the lower triangular part of the blocks.

Perhaps it is already clear to readers why this direction of our result is not particularly surprising, and in fact it is not all that significant in practice either. In the setup depicted in this part of the simulation, if we are able to choose a test distribution, then we might as well choose a distribution that concentrates on the region that the system learned best. Such regions are likely to correspond to areas where large concentrations of training data are available. This can be expressed in terms of lower-entropy test distributions, which are overconcentrated around the areas of higher density of training points. Such concentration results in a better average out-of-sample performance than that of .

Figure 2 illustrates the entropy of different distributions. We plot versus , where is the entropy and and , marking the cases where using resulted in better out-of-sample performance of the algorithm. As it is clear from the plot, these cases occur when .

A simple way to think of the problem is to see that if we could freely choose a test distribution and our learning algorithm outputs as the learned parameters that minimize some loss function on a training data set . Then to minimize the out-of-sample error, we would choose , where is the delta-dirac function and the point in the input space where the minimum out-of-sample error occurs.

Results similar to those shown in Figure 1 are found when .

### 2.3 Fixing the Test Distribution

Figure 3 shows the result of the simulation in the
other direction. Each entry in the matrix again corresponds to a pair of distributions *P _{R}* and

*P*. However, this time we fix

_{S}*P*and evaluate the percentage of runs where using yields better out-of-sample performance than if . More precisely, once again, each entry computes the quantity in equation 2.3. Notice that this is the case that occurs in practice, where the distribution the system will be tested on is fixed by the problem statement. However, the training set might have been generated with a different distribution, and we would like to determine if training with a data set coming from

_{S}*P*would have resulted in better out-of-sample performance. If the answer is yes, then one can consider the matching algorithms that we mentioned to transform the training set into what would have been generated using the alternate distribution that generated the training set.

_{S}The simulation result is quite surprising, as once again, there is a significant number
of entries where more than 50% of the runs have better performance when mismatched
distributions are used. For of the
entries, a mismatch between *P _{R}* and

*P*results in lower out-of-sample error, as indicated by the light green, yellow, orange, and red entries in the matrix.

_{S}In this case, although the block structure is still present, there is no longer a clear
pattern relating the entropies of the training and test distributions that allows
explaining the result easily, as in the previous simulation. Notice that there are cases
where the mismatch is better if we choose *P _{R}* of both lower and
higher entropy than the given

*P*. This is clear in the plot since the indicated regions in the block structure are no longer lower triangular but occupy both sides of the diagonal. This effect is analyzed further from a theoretical point of view in the following section. Since analyzing this effect theoretically is intractable in the case of classification tasks due to the nonlinearities, we carry out the analysis in a regression setting, noting that the Monte Carlo simulations show empirical evidence that the result also holds for the classification setting.

_{S}## 3 Theoretical Results

We now move to a theoretical approach to the above questions. We have shown empirical
evidence that a mismatch in distributions can lead to better out-of-sample performance in
the classification setting, and now we focus on the regression setting to cover the other
major class of learning problems. In this section, we derive expressions for the expected
out-of-sample error as a function of *x*, a general test point in the input
space , and *R*, the training set,
averaging over target functions and noise realizations. We will derive closed-form solutions
as well as bounds that show the existence of with better
out-of-sample performance than .

*f*is more complex than the elements of , so , hence the deterministic noise.

Using this formula for the target function allows for a wide variety of functions since *C* can be as large as desired, and we can use an arbitrary nonlinear
transformation. Indeed, almost every function in the interval can be expressed this way. For example, we could
take the set of to be the harmonics of the Fourier series, so
that with a large enough *C*, any function *f* that satisfies
the Dirichlet conditions can be represented this way as a truncated Fourier series. Figure 4 shows just a few examples of the class of functions
that can be represented using such a nonlinear transformation.

Finally, we make the usual independence assumption about the noise. Assume the stochastic
noise has a diagonal covariance matrix , where and *I* is the identity matrix.
Similarly, assume the energy of the features not included in is finite,
with . For example, choosing Fourier harmonics as the
nonlinear transformations guarantees a diagonal covariance matrix.

Notice that expression 3.11 is independent of as well as
of the noise, and the only remaining randomness in the expression comes from generating *R*, which determines *Z _{M}*, and from

*z*, the point chosen to test the error, making the analysis very general.

The simulation shown in section 2.3, although in a
classification setting, suggests that this is the case. For completeness, we run the same
Monte Carlo simulation in this regression setting. The advantage is that the closed-form
expression found already averages over target functions and noise, allowing us to run in a
shorter time more combinations of *P _{R}* and

*P*, so that we only need to Monte Carlo the matrix

_{S}*Z*. The expectation over can also be taken analytically with the closed-form expression found. In this case, we consider the same families of distributions, but we vary the standard deviation of the distribution in smaller steps to obtain a finer grid.

*f*is determined by .

Notice that as shown in Figure 3, the cases where
mismatched distributions outperform matched ones cannot be explained using an entropy
argument, as was the case in section 2.2. Notice also
that there are now combinations for *P _{R}* and

*P*where almost of the simulations returned lower out-of-sample error for mismatched distributions, especially when

_{S}*P*was a truncated gaussian with small standard deviation () or when

_{S}*P*was a mixture of two gaussians with . In addition, we note the similarity between this simulation and the one shown for the classification setting in Figure 3.

_{S}We varied the size of *N* in order to see the effect of the sample size. We
see very little variation in the results. Holding the other parameters constant, we obtain a
very similar result. For and , we obtain an affirmative answer to the question
posed in equation 3.12 in and of the
cases where , respectively, so the result does not change from
what we obtained in the case. For , the percentage is even higher, at . Hence, it is clear that although the number of
combinations of distributions for which a mismatch between training and test distributions
is larger for smaller *N*, the result still holds as *N* grows. Notice that in the simulations, the target function has 21 parameters. Hence, roughly
for , there are effectively 5 samples per parameter,
while for , there are 150 samples per parameter. The latter
is quite a large sample size given the complexity of the target function.

*P*is a uniform distribution over or a gaussian distribution truncated to this interval, then The above result is trivial for the uniform distribution case and can be easily evaluated with numerical integration for the truncated gaussians. This implies that Now instead, pick to be distributed according to . In this case,

_{R}Figure 6 shows the closed-form bound for various
choices of *a* and , choosing *P _{S}* to be a truncated gaussian with . The dotted line shows the bound for the case . As it is clear from the plot, there are various
choices for

*a*so that equation 3.12 is satisfied.

*R*according to

*P*, while is generated according to . Notice that we use because this choice results in the lowest error bound from Figure 6. Using , , and averaging over realizations of

_{S}*R*and , we obtain Hence, we have a concrete example of a distribution

*P*that is different from

_{R}*P*(see Figure 7) that leads to better out-of-sample performance, averaging over noise realizations and target functions. The existence of such distributions leads to the concept of a dual distribution, which we define in the following section.

_{S}## 4 Dual Distributions

We illustrate the concept of a dual distribution with an example where can be readily found. Assume again that we want
to solve a regression problem, but for simplicity, let us assume that only stochastic noise
is present in the problem. Furthermore, we use a discrete input space , so that *P _{R}* and

*P*are vectors, transforming the functional minimization problem into an optimization problem in dimensions.

_{S}*R*, we can compute the expected out-of-sample error with respect to

*P*, the noise, and the target functions as In this case, there are possible data sets of size

_{S}*N*(allowing for repetition of points in the data set) that could be obtained for any given

*P*. To simplify the notation, since is finite, we assign each of the points a number, from 1 to

_{R}*d*, and we denote the out-of-sample error for each of these data sets as , where

*i*indicates the element number in that corresponds to the

_{k}*k*th data point in

*R*.

*P*as where all the can be found with equation 4.2. Therefore, is the solution to the following optimization problem: For illustration purposes, let : Solving the optimization problem given in equation 4.4 yields , with For this example, where

_{R}*R*is generated according to

*P*and according to . Clearly there is a gain by training with the dual distribution. When running the optimization for data sets that have repeated points that result in undefined out-of-sample error, we conservatively take their error to be the maximum finite out-of-sample error over all combinations of possible data sets. Figure 8 shows the dual distribution found, along with the given

_{S}*P*.

_{S}Hence, if a minimum is found, this is the global optimum with a corresponding dual
distribution. This problem can be solved with any convex optimization package. Furthermore,
in most applications, *P _{S}* is unknown and is estimated by binning
the data, obtaining a discrete version of

*P*. Hence, this discrete formulation is appropriate to find dual distributions in such settings.

_{S}The existence of a dual distribution has the direct implication that the algorithms
mentioned in section 1 should be used to match *P _{R}* to rather
than to

*P*. This applies even to cases where

_{S}*P*is in fact equal to

_{R}*P*, as it is conceivable that there will be gains if we now match to a dual distribution using as the quantitative objective for the matching algorithms. Hence, this new concept applies to every learning scenario in the supervised learning batch setting, not only to scenarios where there is a mismatch between training and test distributions.

_{S}## 5 Difference with Active Learning

The concept of a dual distribution in supervised learning is somewhat related to similar ideas in active learning and experimental design. Especially, the methods of batch active learning, where a design distribution is found in order to minimize the error, seem to be solving a similar problem to the dual distribution. However, the fundamental difference is that active learning finds such optimal distribution given a particular target function. Hence, most methods rely on the information given by the target function in order to find a better training distribution. A common example is when distributions give more weight to points around the boundaries of the target function. Yet the problem of finding the dual distribution is independent of the target function. The Monte Carlo simulations presented, as well as the bounds shown, average over different realizations of target functions.

For example, Kanamori and Shimodaira (2003) describe
an algorithm to find an appropriate design distribution that will lower the out-of-sample
error. In the algorithm proposed, a first parameter is estimated with *s* data points, and with this parameter, the optimal design distribution is found. Having a new
design distribution, *T*−*s* points are sampled from it, and a
final parameter is then estimated. Notice, however, that the optimal design distribution is
dependent on the target function. In the results we present, if a dual distribution is found
given a particular test distribution, such distribution is optimal independent of the target
function.

Other papers in the active learning community that focus on linear regression (e.g., Sugiyama, 2006) seem closely related to our work. For these, results apply to linear regression only and consider the out-of-sample error conditioned on a given training set. The nice property of the out-of-sample error in linear regression is that it is independent of the target function. This is the reason that even in the active learning setting, the dependence of the target function disappears and the mathematical analysis looks similar to the one we present. Yet although our analysis is done with linear regression and hence uses similar mathematical formulas, our approach is based on averaging over realizations of training sets and of target functions in the supervised learning scenario rather than in the cases addressed in Kanamori and Shimodaira (2003) and Sugiyama (2006). Furthermore, the problem of finding the dual distribution and the results presented can be applied to other learning algorithms besides linear regression for the classification and regression problems in the supervised learning setting.

Another difference that may stand out is the way the design distribution is used once it is
found in the active learning papers, as opposed to how we propose to use the dual
distribution here. In the active learning scenario, points are sampled from the design
distribution, but in order to avoid obtaining a biased estimator, as shown in Shimodaira
(2000), the loss function is weighted for these
points with , following their notation, where is the test distribution
() and is the
design distribution found. Notice that in the simulations presented in section 3, we do not reweight the points but instead explicitly
allow a mismatch between *P _{S}* and

*P*. Furthermore, in the supervised learning setting, where the training set is fixed and we are not allowed to sample new points, we propose that matching algorithms, as the ones described in section 1, be used to match the given training set to the dual distribution. In this case, the objective is to have weights , so that the training set appears distributed as the dual distribution. These weights are actually inverse to those used in the active learning algorithms described. Although we are aware that the estimator computed in the linear regression setting will be biased when we use the dual distribution, we are concerned with minimizing the out-of-sample error, which takes into account both bias and variance; hence, we may obtain a biased estimator but improve the mean-squared error performance as shown both analytically and through the simulation in section 3.

_{R}Furthermore, the results shown in Shimodaira (2000)
hold only in the asymptotic case, and since we are dealing with the supervised learning
scenario where only a finite training sample is available, the same assumptions are not
valid. Thus, it is no longer optimal to use the mentioned weighting mechanism when *N* is not sufficiently large, as also shown in Shimodaira (2000). In the active learning setting, it is desirable that as more
points are sampled, the proposed algorithms have performance guarantees. Hence, the
algorithms are designed to satisfy conditions such as the consistency of the estimator and
unbiasedness in the asymptotic case, which explains why the active learning algorithms use
the above-mentioned weighting mechanism. In our setting, minimizing the out-of-sample
performance with a fixed-size training set is our main objective, which is why the two
approaches differ.

## 6 Conclusion

We have demonstrated through both empirical evidence and analytical bounds that in a learning scenario, in both classification and regression settings, using a distribution to generate the training data that is different from the distribution of the test data can lead to better out-of-sample performance, regardless of the target function considered. The empirical results show that this event is not rare, and the theoretical bounds allow us to find concrete cases where this occurs.

This introduces the idea of a dual distribution, namely, a distribution *P _{R}* different from a given

*P*that leads to the minimum out-of-sample error. Finding this dual corresponds to solving a functional optimization problem, which can be reduced to a convex

_{S}*d*-dimensional optimization problem if we consider a discrete input space.

The importance of this result is that the extensive literature that proposes methods to
match training and test distributions in the cases where can be
modified so that *P _{R}* is matched to a dual distribution of

*P*. This means that those methods may work even in cases where .

_{S}## References

*Proceedings of the 23rd National Conference on Artificial Intelligence*