Abstract
In learning theory, the training and test sets are assumed to be drawn from the same probability distribution. This assumption is also followed in practical situations, where matching the training and test distributions is considered desirable. Contrary to conventional wisdom, we show that mismatched training and test distributions in supervised learning can in fact outperform matched distributions in terms of the bottom line, the out-of-sample performance, independent of the target function in question. This surprising result has theoretical and algorithmic ramifications that we discuss.
1 Introduction
A basic assumption in learning theory is that the training and test sets are drawn from the same probability distribution. Indeed, adjustments to the theory become necessary when there is a mismatch between training and test distributions. As we discuss, a significant body of work introduces techniques that transform mismatched training and test sets in order to create matched versions. However, the fact that the theory requires a matched distribution assumption to go through does not necessarily mean that matched distributions will lead to better performance, just that they lead to theoretically more predictable performance. The question of whether they do lead to better performance has not been addressed in the case of supervised learning, perhaps because of an intuitive expectation that the answer would be yes.
The result we report here is that, surprisingly, mismatched distributions can outperform matched distributions. Specifically, the expected out-of-sample performance in supervised learning can be better if the test set is drawn from a probability distribution that is different from the probability distribution from which the training data had been drawn, and vice versa. In the case of active learning, this would not be so surprising since active learning algorithms deliberately alter the training distribution as more information is gathered about where the decision boundary of the target function is, for example. In our case of supervised learning, we deal with an unknown target function where the decision boundary can be anywhere. Nonetheless, we show that a mismatched distribution, unrelated to any decision boundary, can still outperform the matched distribution, a surprising fact that runs against the conventional wisdom in supervised learning. We first put our result in the context of previous matching work and then discuss the result from theoretical and empirical points of view.
In many practical situations, the assumption that the training and test sets are drawn from the same probability distribution does not hold. Examples where this mismatch has required corrections can be found in natural language processing (Jiang & Zhai, 2007), speech recognition (Blitzer, Dredze, & Pereira, 2007), and recommender systems, among others. The problem is referred to as data set shift and sometimes is subdivided into covariate shift and sample selection bias, as described in Quiñonero-Candela, Sugiyama, Schwaighofer, and Lawrence (2009). Various methods have been devised to correct this problem and is part of the ongoing work on domain adaptation and transfer learning. The numerous methods can be roughly divided into four types (Margolis, 2011).
The first type is referred to as instance weighting for covariate shift, in which weights are given to points in the training set, such that the two distributions become effectively matched. Some of these methods include discriminative approaches as in Bickel, Brückner, and Scheffer (2007, 2009); others make assumptions regarding the source of the bias and explicitly model a selection bias variable (Zadrozny, 2004); others try to match the two distributions in some reproducing kernel hilbert space as kernel mean matching (Huang, Smola, Gretton, Borgwardt, & Schölkopf, 2007); others estimate directly the weights by using criteria as the Kullback-Liebler divergence as in KLIEP (Sugiyama, Nakajima, Kashima, Von Buenau, & Kawanabe, 2008) or least squares deviation as in LSIF (Kanamori, Hido, & Sugiyama, 2009), among others. Additional approaches are given in Rosset, Zhu, Zou, and Hastie (2004); Cortes, Mohri, Riley, and Rostamizadeh (2008), and Ren, Shi, Fan, and Yu (2008). All of these methods rely on finding weights, which is not trivial as the actual distributions are not known; furthermore, the addition of weights reduces the effective sample size of the training set, hurting the out-of-sample performance (Shimodaira, 2000). Cross-validation is also an issue and is addressed in methods like importance weighting cross validation (Sugiyama et al., 2008). Learning bounds for the instance weighting setting are shown in Cortes, Mansour, and Mohri (2010) and Zhang, Zhang, and Ye (2012). Further theoretical results in a more general setting of learning from different domains are given in Ben-David et al. (2010).
The second type of methods uses self-labeling or cotraining techniques so that samples from the test set, which are unlabeled, are introduced in the training set in order to match the distributions; they are labeled using the labeled data. A final model is then reestimated with these new points. Some of these methods are described in Blum and Mitchell (1998), Leggetter and Woodland (1995), and Digalakis, Rtischev, and Neumeyer (1995). A third approach is to change the feature representation, so that features are selected, discarded, or transformed in an effort to make training and test distributions remain similar. This idea is explored in various methods, including Blitzer et al. (2007), Blitzer, McDonald, and Pereira (2006), Ben-David, Blitzer, Crammer, and Pereira (2007), and Pan, Kwok, and Yang (2008), among many others. Finally, cluster-based methods rely on the assumption that the decision boundaries have low density probabilities (Gao, Fan, Jiang, & Han, 2008), and hence try to label new data in regions that are underrepresented in the training set through clustering, as proposed in Blum (2001), and Ng, Jordan, and Weiss (2002). (For a more substantial review on these and other methods, refer to Margolis, 2011, and Sugiyama & Kawanabe, 2012.)
However, while great effort has been spent trying to match the training and test distributions, a thorough analysis of the need for matching has not been carried out. This letter shows that mismatched distributions can in fact outperform matched distributions. This is important not only from a theoretical point of view but also for practical reasons. The methods that have been proposed for matching the distributions not only increase the computational complexity of the learning algorithms but also may result in an effective sample size reduction due to the sampling or weighting mechanisms used for matching. Recognizing that the system may perform better under a scenario of mismatched distributions can influence the need for, and the extent of, matching techniques, as well as the quantitative objective of matching algorithms.
In our analysis, we show that a mismatched distribution can be better than a matched distribution in two directions:
- •
For a given training distribution PR, the best test distribution PS can be different from PR.
- •
For a given test distribution PS, the best training distribution PR can be different from PS.
The justifications for these two directions, as well as their implications, are quite different. In a practical setting, the test distribution is usually fixed, so the second direction reflects the practical learning problem about what to do with the training data if they are drawn from a different distribution from that of the test environment. One of the ramifications of this direction is the new notion of a dual distribution. This is a training distribution PR that is optimal to use when the test distribution is PS. A dual distribution serves as a new target distribution for matching algorithms. Instead of matching the training distribution to the test distribution, it is matched to a dual of the test distribution for optimal performance. The dual distribution depends on only the test distribution and not on the particular target function of the problem.
The organization of this letter is as follows. Section 2 describes extensive simulations that give an empirical answer to the key questions and a discussion of those empirical results. The theoretical analysis follows in section 3, where analytical tools are used to show particular unmatched training and test distributions that lead to better out-of-sample performance in a general regression case. The notion of a dual distribution is discussed in section 4. Section 5 explains the difference of the results presented and the dual distribution concept with related ideas in active learning, followed by the conclusion in section 6.
2 Empirical Results
Consider the scenario where the data set R used for training by the
learning algorithm is drawn from probability distribution PR,
while the data set S that the algorithm will be tested on is drawn from
distribution PS. We show here that the performance of the
learning algorithm in terms of the out-of-sample error can be better when , averaging over target functions and data set realizations. The
empirical evidence, which is statistically significant, is based on an elaborate Monte Carlo
simulation that involves various target functions and probability distributions. The details
of that simulation follow, and the results are illustrated in Figures 1 and 3.
Summary of Monte Carlo simulation. Plot indicates, for each combination of probability
distributions, .
Summary of Monte Carlo simulation. Plot indicates, for each combination of probability
distributions, .
We consider a one-dimensional input space, . There is no
loss of generality by limiting our domain because in any practical situation, the data have
a finite domain and can be rescaled to the desired interval. We run the learning algorithm
for different target functions and different training and test distributions, and we average
the out-of-sample error over a large number of data sets generated by those distributions
and over target functions; then we compare the results for matched and mismatched
distributions.
2.1 Simulation Setup
2.1.1 Distributions
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline4.gif?Expires=1739899867&Signature=33I6MdMe4BQeXL1Q6wf4d-5Z3QHgHYWiT-UulXYY7VPUq0mFtwtbdzL8E2AfP0Te3CJg6bkAuVT1vgQNF-OXegdU4AJ3UsC4cFv~YF58iIB-dQnnC5GA0Bwq5BBYZrWFbX26fQSDIg7608zceEmStclG0S6GCFPEs4BAr4XDoSRBYQS7-nbT9CvKQMoa3xhUXaaJbsGksOrfuJjHMvn6kRjWtjoTBLM63Aq7t4w4s0GOgTl8MWVXDKEF6yaSrRtPdMk1z0IOIhr4wTZjoz4pqk-vuDeVSak1pC3wLUpwTbj~6DmnSTvAfpOXJ0xWfdc7SISxW5uXT3o-yUOtpNiWoQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline5.gif?Expires=1739899867&Signature=OoRKlp4GPC3yZDJU4JKRqqhP00KTTvu5yJn3bsYsRSDYfSOSSpAbkhWwwENqrXyuQumbf7oYi3kqnU2vxhwkOEh6-7G-QMZ~lzW4V213jZdUCBaCjFGou~5aCCJLj~TjT2mt6ZCpLV4dI86EwNzXdqG-i~I5GSeWF35kVtHYQph3Yy23NUT5GoBIktmjE5drWP9y1GAPRPe7-vC93swkmdWknm8kslkeg8Eyg9gsQ38d20~30-lFq2a7ffB7waIIPdgYFB9juvA8Fw9ZP-0tGonJqNO69jcbKf~l5axMxQZmLWtzQr96u8sykBb1MpMZlt2EkxxoavFylYh6CDnZJA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline6.gif?Expires=1739899867&Signature=X2qRLeXRdr~A3zjtohEwzOTHa239mlN~~FAUtuobiMgZO-fyMQIjmGf~kh4nw7MOl4eRgvzkv4lCS4rN5XX~b1crR6lAMSAwqenPwdgNsd06Ml6PcXJbAfsPBO9WRWJldxQ0e8jgIsyrwZdnVJqYR9LhyZ0KjT0Jzw7znLXlGO5Bx4V-CFsPJRQUdQC9VoiD1pHxUL0i4pkUUKaOPP5tUjEVug-3X~mB~hCeYKC68uigpEkrzCN74GVAEgartYeKfGYKX-Oc1RJ3B1t1rbypVRdIZoTM~Qd74ugDuN~z~RxHGxKRlvZ6too4iJSDI81UAEjL5TWJNI3NcO9ANjnHXQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline7.gif?Expires=1739899867&Signature=TAAcpcJJKS6dIxeScOn3H-L2CKkkHutKuEKeODo8qJgigHDxz8pCNkrv7LXKe2jxFDdt17V1qgRdiqR0q555LDE4bCVoh65KSYhtLm2tpg6WBTFfaF0rkFGQ6dzYQXFS030G7CwaUOLkBg4Jsck9dIxMA7LkNcY8jh9a4PqLEnMob-GBji8Y0X70SQO885RVxFGx-hOWeU7XciKE-H-Ilnx8mnTXuK6CMQ6K~qdKy85TXCa8-cT54alrhwdF~FJXNt88rvCYq6wVL5gXWdyuBLmnciqGRLSaQloXa-i1sefTXnr1e7gbJx0cK3K-dn2SV3dayXXxQUbcS7EX2RjeeA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline8.gif?Expires=1739899867&Signature=3HH1pZqyC~QNtmdjWeWDKWFqHFRPT9nvYgsPPk5Nj~1k5i8f8XX43NM4fav2L4YtdDb6lC1hjdHtntf~atHISMH4Ksqk-Bsh5UBKDoMPwDqRvZ48RAjoFec1ZOKY-lqc~E4LXN9N2u23lDQcmIXufrLLAuNIcKbqEJzylJkBa60ATg3uD0a2qnf3Qc-Pnwlfdd8w9hBTnoA1Ww5dl79GIuhcQ7ERa0mKnZ803jJAtaTrYukEquFlvZLb6IjAOoAUP-AnRylXuSLP~2XLqjrP1R1p2-hbp4jUv-yf7Azx~-EnKkS4J4iaEPaZpNAXUu6IT4lwFdv0TdBoe~EP3DI1mQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline9.gif?Expires=1739899867&Signature=GcZIrRDDf26Gxl7SVs0knEHOiIn3HHxtdz63MNSR9A~CJrf~fCz4v7fpYbgekBgOMjjspI13iowS2URFJMZE1JXGm7d-al7EcND-P5-KTgnDEBHnPecFnyyIdkZVjM4cHPsmRmet~-9mwVsPSNuj13X~2wzcLh-k-2bm8e~8ZceHmFeCk8c4CmXrRIHDIwY0PxqjmeBOKWSjCb4gSmtiClk4gHSgV0g3Vv2ljPWk4TZnQyrT0LQmaLpk~dBHMYTgEtHyr2QVdSLv6Z8Rg7GyqUM8uYbh9JUbALT~OHGoB7PSuykNn6d6wjA7NrFWODQ84NZ1azOKq6mJkGTKN14LZQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline10.gif?Expires=1739899867&Signature=uhp0XyayCTWpPSWO3IsielUEfkaAVM~KBTi3agPAoZp-7fht9RW387GVTbwuO-wRNm1EhJ8VlCylE41qkWbdo2VJxupegC190YLE7u5UZIMEnLWCebedQgUrYSwnvxs68USUfzuv~c1Nyd9TjCtMNN0LkYVRP4dIWt-wi04vl9hRl6M6uwngOXW332ZAKXuC2IGkkUHRcZWPqr7KKCFSY3Dn7GW4tQZN~nGJrEknVc0-0Zg4~E3A4cfV-nZAPo0Y4TrR9EIZLU96BfCnou4GUH0weqyTGM7r47G9Acq3BewGW9OHBcWZPTmihDN6mclqtraT3e10iTjvL58MdO6tcA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline11.gif?Expires=1739899867&Signature=SgbL-D4mQ4dRNMCLoTeB~HFdgysVaQxthzHgpa4QCQGwztmSCB84TWCQdiCSgqonUWs2psXrXXuxwvghuQoEJU73jdoqcvPfbXra2vCXjUXokVgCZY389X8VgLp31-GSuLM9PO~wPVxBJRMcpuTB6DwHrUgQCY-x4V8vs3VTkXlnskpUAyxdTvRfQxSsVntMqy11JQJfwJT0TlpHlogdTb0Mt6t2T7cj~HF7yEtBvFcOxd4QOFe1yrPPDlRzrsRkITobeqjuU0BzJwp3gUXIOf~QTwfIpa6GLfTlbxBfc3ZNxhC6Cwg1pGsK~sud6hvcdXTtH45GEi1kfNAm8vlkJQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline12.gif?Expires=1739899867&Signature=IJlvargdUIOFqMDJJkEPgYrfhF68hjAVFROQShuhF5H88R1PTDKqWodKFsTMW3juyD1RANYRBwWhWR7KmNI-0wmEGLlTEyYUR776i6i6FfFlwDVS~QLjeILFKAruLVK7hNxCXHzo8PpfDRehJ~DHIb30-W2m4uvo6O0XMMGcgHNuF6WabY6-NG8VB0yGdju3~CQv1K-OHbm08B7Q0HQvuVx36EFOMvGEBpSnxln1~PHmopGeSDDlNE9QG1~Whunj84g78TtgcOeo2DMnD2KoCjTQYdL0Iv1-kbWlSVXrs~gKMGjIiriFpwqkJnUqZAtQnS4vd6ipjnvMnbKN-0ytSA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline13.gif?Expires=1739899867&Signature=ZNObb6BRY8YRDfDCQw9N~SdDeBOYVbSO~hnp4llIh3BBbhOfDINPHTUfc6wuxIT~Aw0KQEcUIz8JaXRyF2NbpTRe3VWzVUpCrHq9cyYXqvTWHc9Ijl3Rmxu1uXBqcjdSblVdCLgdTNmnbyjAcMjI6oFRdwte1I~LjU7JXucS-MMxKqVO89TKYoRJm6SyJ7vJq~YGibqbDYOKLHD89nabX0OJIh1r3zOmHEm-FrIfm-UHcWIMYMMJuUn6ia1kVeyE-SMvYhHS1qkaAcbz1nE3kz0C10RK6AnHnXrBE69s2j0kQjg3FPFoLmY03zZkujgtqAZkHiaWJFu0g-7QgdZcBw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline14.gif?Expires=1739899867&Signature=QLOqsPQ6X4VF7wVp6Y4xyMxUUm7VVe9hM-oyiApp9ocnVPPzq79~Rr1OWxkTQTPEaQBM55XJASDjcki59Uc2TNMufPTuzqrQEi6RerljG3AjQRrBya-oh22yZ2N9FEpM28sTW7HA6js46K~OQk09~JFNloMgkGZFv4mQ3pZ7liDJmLcT0Rym3Zr7kRA5SiFMRewIxepFabSTcVniKT8FA2b5pbnt0bUnfBDxiXcmPdcBSOy7QbKmPq2V4pMY~koyNeoMyuLIsSEIR~xiAgx4UKFn-AMPrA0-i7ucfvUWm1R7xTArIIWg~KQ0RSVRCdJu4OczmQRyzLmsJI9~P07aXw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline15.gif?Expires=1739899867&Signature=qlbcrZng0Q0IdpLpGqfMly3a2tJFXIR9dVyDakZ5T466xShQyf792YughbF1QGTg1gNhHlPOV75yqB5kpQliTd74Nr8D2Rm88S7fkN9m9b9sp6XeJRv1UQlqq2L6sGElzoudrvq1svsH-ZNpm~5Kq6runFXayJ7GtbYUe7BnpIVGlULttFcg5uRGZCOhDayHrhWH3SkdN4~3OcSWE-R6wQTDfhCN5D2jpKJYstvhYrSHvtZUimx6~LyK6Y988rR3a9TdIIo~7r-OxgTZo-WL5yhWyYCT3~npwCWpDHljsxG20ZNdH5UXFC0isybGI94TRixkpAia-ugBM-DADky26g__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
2.1.2 Data Sets
For each pair of probability distributions, we carry out the simulation generating 1000
different target functions, running the learning algorithm, comparing the out-of-sample
performance, and then averaging over 100 different data set realizations. That is, each
point in Figures 1 and 3 is an average over 100,000 runs with the same pair of
distributions but with different combinations of target functions and training and test
sets. The sizes of the data sets are and 300
and
, where NR and NS are the number of points in the training and test
sets R and S.
2.1.3 Target Functions
The target functions were
generated by taking the sign of a polynomial in the desired interval. The polynomials
were formed by choosing at random one to five roots in the interval [−1, 1]. The
learning algorithm minimized a squared loss function using a nonlinear transformation of
the input space as features. The non-linear transformation used powers of the input
variable up to the number of roots of the polynomial plus a sinusoidal feature, which
allows the model to learn a function that is close, but not identical, to the target.
This choice of target functions allows the decision boundaries to vary in both number
and location in each realization. Hence, the results presented do not depend on a
particular target function, so that the distributions cannot favor the regions around
the boundaries, as these are changing in each realization. Notice there is no added
stochastic noise so that the two classes could be perfectly separated with an
appropriate hypothesis set.
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline19.gif?Expires=1739899867&Signature=pj8jDr8hPe4wdAJGToxroNlZSnfBlavkubGxgRl7w1p1gYNEo50Jfg9g-8aihNXLzLVrb0es4XDn80DrnBRt8b3~xBOkrNm4R41xsZ20YU9WQE2b7Bk3PerOlLcm7YzvqB3xibvcrz0sQJspLJ6n7Sr6nGFhHnc2Td6i0-Is~RighkWVRDuPQ2P6LVMFdcMHlEdYHxyPFs6FgFbORmEbYa2LKRiwNBggskxQ96yGRZrBRBiezu2ERF197M3wzDl5q-pSpt7Aj-7fWrZ9m~lLjNTE-bJr5nCgAsXMladdhW1W~74bofvmSG1~n8tea17cV2QPXgo98kPC7M~Y6064sA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline20.gif?Expires=1739899867&Signature=ZbvsacLipeD3Eg3KDLPZ6IUfvJ7Ik3gOfvjFfvIeNLNUHjqQbWDXKBR3Y4~CKzJ7kR9cn3g6eoivHXaCHGLwmgbHOAW8P9-AWYZJhD6a1GEuC1PMwgkuzMAt7DU-EDLzwIIa9rh7eAGy1kmkgPRMOu-rzloYVedBgl-ajA4NfA1q3J8FOMRxRbg3cGzObleN846pa65yNHfweReIB5EV4ujoGyoOE~~eBtmV~ClJbNmfMVAoPA45mCGpY5mpF-dU~fmbbKdglNqLB6JHo~4wHFQ18pz31Lgh1DgkT2Yy59eHzAPhOC4mgtgMoDo6l61JlqQ3FeIEsJygF2wgF2gSQQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline21.gif?Expires=1739899867&Signature=drQyWY7y84OXx3G4AGQJyNc8c1WGT60oOH4kgP9SWI7trghmWDcUpkj-rDPXK9midx~0UI9bo1G~NNB-~QjHh0u-F4wZFdR9lydffC-sgCzdaLQXDor0YTbVLZ-eqlB0i0RAwBE3UxfGwbHsojapOqTYeWiCHDIygWsBbaWm9Lcfaa9wCaMiRh-HtYS-QZli-v1m-Hd8gb7S4lHxr-MtArv-2atJCvWeNXIflNOJMi28RUAZ6otaZnraK9q3RR~st53pdzA2GDIEHSa-6Gkuaj~spcDZop--x2zUCWdqW2nGORnDCTCey6ETW3fhWRoQJPK-d7TLjm4d0Qogd2imcg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline22.gif?Expires=1739899867&Signature=U3cjvcclW7yMjKXoetaiDGZhLicrdzXXhaboqidsX9ff8WOzyyfLEFr4eRNLMN~VW4pL7x0R3wQ~Ig7R83yGzimR0wYg8EBQXDcrv65ANMxzfr~K50UWFohrAhtvhw5xLmlAIwY1huuC8qkOXIKqV4fIjHw~xy5WVsHOdpgEAk2IqP7yRS-9u~mmcGpcklWx~7Ax48Ho8bX0jEH5-kUP5B2qQoK5zfPt13dIxBXk-f1IBOYtxer0IM46jn~M3C5j6m1CuEo2XnbsSMsXA0RmQkHkJeZdOJpVlxMtS2fxptPF3w7MEDIUiGjFNIwsxtjHistLOdyf~bIDIoSa~iaxEQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
2.2 Fixing the Training Distribution
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline23.gif?Expires=1739899867&Signature=Gzry1VzvhEKhE9n7OtSSfGpCB5YglhfCLgWhIhYfc1MtUHEDRzNLap~4fs60jjRPQxcRfzoV7qU061oHAsW1aqtEXgRnXYzaSTXm8jUpUB00mQul5IBG3YJpJIlMeQbvgfkUFsU8PnCIsr1sWil~ztM8bN5WZiAyVziHsJPRBJmW89e7yYnNQ1ieeCN8MS4xW82IzSSnjz7VxiB3I2J6qNz~kwpCB~6obsrBHQ5whMk17k99aub7nQzyjwuKaDH7HynqG-mmEYzxw2JBMqdrHKMjblSU4awVX1EJg9Z1Z~DM-vO7O3HZ7hNU9woR1KHd5M65Gu4tgADQmULjVIfI9A__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline24.gif?Expires=1739899867&Signature=WiEpHSuMyUCPmlSnDKzaTGnqKqcioiwNLawrEqevYh06w8Ei6-mCtXMoZR1~itfND0YqB2Waj6USD39PGC8ytuXbySrjoAV7JLq9KZCy1eIDhJL0NoFeF8WQcFvfaurDKiLr-lyZW-ANthMq5knIV5ANVKskQawEgH0I6CVaEmmjI5hTMphOklAA6R5hivGbdHsuIB3pYHXfFEQScMkvmzqoTKvwE2QxQNCZfIMO3QogtUVNq7xtUeee9QWxvFV6xclsYh95o6YORDnJtffd4aTnFT43BKvSAoEW9N8k39FFdjnf9IYYGafQNlF-h6n-AVI~Y8VaAsMOXLWSHnaDOw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline25.gif?Expires=1739899867&Signature=Q-F-GE~nxeBAva08~vaSPuJPrTZn-uYLMiECDGf9hzKPnsI6VQyHs5yNxV5iehkKdd4a0P~SmuzGYwEeeJQsDgaWYXiY5Dctd3PJCY3fzl6DSHsk07OwaFaKeDpsJcJ03OJ0fs1GUyohLllxVZ0BZ-7XXN19x9dGCFRc-spfQ-UKapLwA5VKM5F2VMQ7lUGhssQ4dPkAArufAw6Mi8OPEgpvYkO1-ntr-kRRtdcFRwhOiLmqxNybUJKHiiTudpWxTZDhlKj8I7dws2ky9oSsJa8U-LGEtqDGjsW8jASuvac3izwGlSQ6hPr3GT28LHzeN5hcKcgzXjak7y6Vl6U3sA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
The matrix places families of distributions together, with increasing order of standard
deviation or time constant. The result that immediately stands out is that in a
significant number of entries, more than 50% of the runs have better performance when
mismatched distributions are used, as indicated by the yellow, orange, and red regions,
which constitute of all combinations of the probability
distributions used.
A number of interesting patterns are worth noting in this plot. The first row, which
corresponds to , falls under the category of better performance
for mismatched distributions for almost any other PS used.
There is also a block structure in the plot, which is no accident due to the way the
families of distributions are grouped. Among these blocks, the lower triangular part of
the blocks in the diagonal corresponds to cases where the distributions are mismatched but
out-of-sample performance is better. We also note that the blocks in the upper-right and
lower-left corners show the same pattern in the lower triangular part of the blocks.
Perhaps it is already clear to readers why this direction of our result is not
particularly surprising, and in fact it is not all that significant in practice either. In
the setup depicted in this part of the simulation, if we are able to choose a test
distribution, then we might as well choose a distribution that concentrates on the region
that the system learned best. Such regions are likely to correspond to areas where large
concentrations of training data are available. This can be expressed in terms of
lower-entropy test distributions, which are overconcentrated around the areas of higher
density of training points. Such concentration results in a better average out-of-sample
performance than that of .
Figure 2 illustrates the entropy of different
distributions. We plot versus
, where
is the
entropy and
and
, marking
the cases where using
resulted
in better out-of-sample performance of the algorithm. As it is clear from the plot, these
cases occur when
.
versus
:
Characterization of why out-of-sample performance is better if there is a mismatch in
distributions when PR is fixed, using entropy.
versus
:
Characterization of why out-of-sample performance is better if there is a mismatch in
distributions when PR is fixed, using entropy.
A simple way to think of the problem is to see that if we could freely choose a test
distribution and our learning algorithm outputs as the
learned parameters that minimize some loss function
on a
training data set
. Then to minimize the out-of-sample error, we
would choose
, where
is the
delta-dirac function and
the point
in the input space where the minimum out-of-sample error occurs.
Results similar to those shown in Figure 1 are found
when .
2.3 Fixing the Test Distribution
Figure 3 shows the result of the simulation in the
other direction. Each entry in the matrix again corresponds to a pair of distributions PR and PS. However, this time
we fix PS and evaluate the percentage of runs where using yields better out-of-sample performance than if
. More precisely, once again, each entry
computes the quantity in equation 2.3.
Notice that this is the case that occurs in practice, where the distribution the system
will be tested on is fixed by the problem statement. However, the training set might have
been generated with a different distribution, and we would like to determine if training
with a data set coming from PS would have resulted in better
out-of-sample performance. If the answer is yes, then one can consider the matching
algorithms that we mentioned to transform the training set into what would have been
generated using the alternate distribution that generated the training set.
Summary of Monte Carlo simulation. The plot indicates, for each combination of
probability distributions, .
Summary of Monte Carlo simulation. The plot indicates, for each combination of
probability distributions, .
The simulation result is quite surprising, as once again, there is a significant number
of entries where more than 50% of the runs have better performance when mismatched
distributions are used. For of the
entries, a mismatch between PR and PS results in lower out-of-sample error, as indicated by the
light green, yellow, orange, and red entries in the matrix.
In this case, although the block structure is still present, there is no longer a clear pattern relating the entropies of the training and test distributions that allows explaining the result easily, as in the previous simulation. Notice that there are cases where the mismatch is better if we choose PR of both lower and higher entropy than the given PS. This is clear in the plot since the indicated regions in the block structure are no longer lower triangular but occupy both sides of the diagonal. This effect is analyzed further from a theoretical point of view in the following section. Since analyzing this effect theoretically is intractable in the case of classification tasks due to the nonlinearities, we carry out the analysis in a regression setting, noting that the Monte Carlo simulations show empirical evidence that the result also holds for the classification setting.
3 Theoretical Results
We now move to a theoretical approach to the above questions. We have shown empirical
evidence that a mismatch in distributions can lead to better out-of-sample performance in
the classification setting, and now we focus on the regression setting to cover the other
major class of learning problems. In this section, we derive expressions for the expected
out-of-sample error as a function of x, a general test point in the input
space , and R, the training set,
averaging over target functions and noise realizations. We will derive closed-form solutions
as well as bounds that show the existence of
with better
out-of-sample performance than
.
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline52.gif?Expires=1739899867&Signature=4a2~Jm6ZE4~i4NT0QF2hvJddjc~0qTs5VaVMMrgrX5ANj9TCaqmWEK7FbMgoY7Z78SfU3FHN0FtQ~Qpm7QAcYg~UxuMy9z2gqKQ6riEkr5-Z869U4W87BvxpyH24wFs6p4tvtY96IO4oUWaE8n0gU~LHUninQmS0kBBMdojJAcMbgKNG2OuB4Umoon7LGRLxUygDEBiWs8sX9xawfYfBJYVwoFJo4r7Ggjm27d8caetpVOXY~Lv8oM5edUExH-HCTPxG-TGBSsThKytT8AoCyIElOaJUmJyAsnOK0sEf2syjjk1dslZQQK7WKTohirqBHe~Q9Abc8ILEkChTCn9Gtw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline53.gif?Expires=1739899867&Signature=FLvasjiUf65Sl3MuMhqiHdxkfLiOGyvdwAg2oT2oWoCZBndy-Kd3W7cDlUr090gyePWioXMUMtH-68quYGuqGPVC9F13C1jCaW6Df0F0i~xHbdm7ja0JpstKSAikHj9CzLI~XgpMueCxrqBjPcxQT8MFCv-DPZAo9bT5u~tckSapXvrSw3dhrn1Jrue6UGRIEqjv3~-N5O3Pj6FswFhNG4SEkYkPICS~GGnIgGu5KIWmKMu57rDu0at0cZItroKKWKeioKnvuXDBpQedinMk7VSC8BlicdW5V3jltlWcr3YkmvCaEGT7058AGqWTuV3RLTg8MQezkndy1x6GxVv2pw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline54.gif?Expires=1739899867&Signature=NO01E5pnalpLZhJ3lBVKUXiq7Irt1eAW2nuSypb2ijAmZ6lCy~d2uI078R3uSBivigEVIKXaNqnf9bAAxbeJTyDdtgBlCdV9EJrBYdSvbIVzxlq58oR0DCwk96dZhcNy8UiqgMZGMgLj2attZ5jnsbFNEq22cWWN8x8yki7aGrMog-~BMT7-UZ0~1hBEHGbop-6pSbKq9Y9pNaY8u2sF2CZQ6tM5CVKhLTnhrwpa-Kc2k~3x2SgL2umhC~0ZMagyfhhPfj7uGTt~OUucOtXhX3wE5gmTcpxn3CqAHvcAVB5Iku9rSqH8lo~DvV0ZDMtupC6yfpFvEzpQTAJ2JaoP0g__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline55.gif?Expires=1739899867&Signature=DwnU16XONv8lTYjHOBiAgV3CjBAi8zWaxUjX-5z271zcF8OQcjlckIO~2Ck~ouXYYx3DK0Y4pEZHwEVXuql4JXnSkF3zpjgINDECEnbI4BLhtdowLwRgucvGHAK2ItsGBRcdsi6BmaiS-qSYjtrVIUa8RGjA1JvyDqQT~6JTx-oAPEPLJ9WtqIYHrDfOVCALoL9hEKwXPedV0MkZroQXTrAcoB9MIYp1uY~Qs6e6Ux0VetvtpsWESxIWpGu2DsN-Vp0lXVJW7jEVyHPL~ZbssiPpIZgv-K9eWviELG2BoPOBg7GkGHaibADn3z2F2ctBP0cCUZeGoxol4VJ5aWvdow__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline56.gif?Expires=1739899867&Signature=pjbcYfXzFPtSZH79FN7u0WSmr36sMBqQxBbuJin8zQVHdkmN6FZjvfwT4Crw07Yii76KpVK14dmNE-1NpmApXsLMxiCOx684jDrGzOHOGCd0XwC0qe~z1xOvwidOpHLauw5Pmtao4r7Skk9akuVmmEbk0FVdxFHwQ7pexXKSxZeG2ypwVGVG6CopJibBxS-3ix1NyWBmV2z7~Pq45ljDEeerutpaCYxvDkKhRwlZy0yyv6rL7ZJmb4wlpLCKjm~MUg2zeyXSe5-~dqcEfianlwG8n-nYMF3~BOga-kVLSHGmnnSAfOeqw4aSvdv0x0mvYFXDSx2Rih8hebg-qaK-Ng__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline57.gif?Expires=1739899867&Signature=JOAgHtxJ6CKfOp~VFo5Kkye2L2euCR06RzbLFwg5WB5KJkRjV3WB3Kij0bLxQh6LvwGFG0DzgDJChqWDX7XK1qC82O6qtlhup7sTI7flNetm8OCKxvj4EsLGaKhHaydmkpGcSC6kBjHQ9iQ4GcRgkA7Ys1gHIH9V8J7XNhJTuixH-mhEbq3Vo1FkvcqTSliBUe14f5ULklyp0xGahmrL3H0aSZcDMEeUgUUQYV1~QcbxphexO3QgPf0hAhXFGm5K8xSaC7ZIzSxGJQUooYkMidsXtCVuC4RkI9qG9zZkrOdb6kE~EzDM~M--ii~M067vLV~XQKNjsfjv9Odyarb97A__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline58.gif?Expires=1739899867&Signature=HXxnzQmNzda7XPcz6~6bKcYICrXBi3iz7~teemPWah~I6dsyhAvChUoPqKMi0amjydRb58JWV84bSiLCqyA22ibp1FtpmjrWUOZoeSIxoG~ZqLMXeQJJ82pnwbwKJG09yfG-nHHUXSu9koiOQrX7iA22zOHJmLiyn4mG3plLDOjoyvBvxiB3ANsTT~qcLUoeD8J7baI7JOF7P30Cfh5ylCgSAln~adr7J2edeZkkFSidVbhwUSt0Oe0NBXPc93dJj94ATMZBM2r4K~8FArBgQeTXxXo~DSTBUvmKF1eSe9AfXABOmy2cZ4Fi5OeLg3vhqeVUAxeXvyumJtbu6mvEeA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline59.gif?Expires=1739899867&Signature=gu--VY~7P8PyFzAq8Lk8~9YlzkY-L-X6pPzCWXCfFbOy4PafnFGYnSywD5V8HCzHuPzZHOHjFkxGE0aVueWGcHwY7kN6rNFi0WNUVdieFJjAqBscTVJKXhzhvRImqNzvSMfpfBjeWxa8TbHYuiOC9i0Oj~3WewwNgWl8cT1Knzs4F6yKKRQ-4MUSON1vZ3x7Z4snj~1QIJCe7YjTc5yE5PCotfhMUePQUyu0WEAzzOczOsuZDPPoM-nrmLaEsTibr4rnurOJD~08uU9APbUhzlNzha5z6vRRvFV~UzKQOwYwmc1oNKr-FsWlHWBmwNdJ7YAwNjVHrQPbYiaRcF1RbA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline60.gif?Expires=1739899867&Signature=oR1V4AsZOZzfDzQGV3fWuPxNsW33xb8Og6D3qjC6hANSi9EBgrgu7P6oP8Xdxl~bFsfz0-q3yhGgzVyyqSQ9D12a3Z4orc2GRL1F-GNeRkZwi83hx~mZ6sliam1pSdtINo1RRPsIOfb5z3q~Cablp0a2iyr6v7WHGiMUuGeXu48RmWV-8zCdo6swrPT0xj1JzjGJX9VE1ZnLm5~ZnhRWt1ol1V1LLdCCQSMZzCU45yPRRw7zLYxk9HI5cFjUzVJChq5hx0o8KZJF0QDdXGnC3otNW53wX5pIQEXofKN~9yKY0RfT739N1PBXGUKwYnNVV5RoJWXqCSjkfRWHVv-Dcg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline61.gif?Expires=1739899867&Signature=U-mcl356UUfdXyRYGb0V1-akg27C4QRzNF3AS3ZvDeEKJrq4iP8s9MwQ1rCHA7dWfSqmKtADyFJwM8~TkKGpKFD2A606~wTTh8ZbPUK808rjR433xPU9Kgahiyw9EAAIYRoc9rd7fIyIkhtCjm7~cag-cv4RZas4wAfx8G~IGPYlBfqCNxUFDsQUYAWglLdQkhWk2aUSSW9lMMz6ziYjeWRyMuW6kNUY-VY9MyjP4Lgb5qqKFfAfsTpdv4krJXxhpI1Iiz3aDKdukyfx2SQGNZPMa2xp4HuKj6rVHlaQGoAyHo-aTPHC631nSVhLZZBVUbrplo7EsNB6bUbtr2qC4w__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline62.gif?Expires=1739899867&Signature=NJN711x0zk2Bz1406RvW6sg1bWd5tJhqorPPiJ~Irjf0WRt3ZXG4jla9TxF32qxLSD5pkytq1Ce4ywsAZudML8Pr8vdUdRwtOgOX8JL3olICOwYZChkzLG1Rn9NxLYM66GP7WcjwO0OXiTzGhIiV8DHfD0AEdSrdcydQl25PHTYIfgeNZ2feDAgJVFFzi8msY04aNrNrbFboj2jrehMtZ~EGD8FhlWnj8vZy-vhXf5J5RRydHRhL3o2qBweWdMuIuH6kpbLPh5dskFUSoKSCTTlO9LXvwN8MIJMMUnHXp6xGhG41WzMECvaaCtD79FqQV04GAp777BF376RQiyaJVg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Using this formula for the target function allows for a wide variety of functions since C can be as large as desired, and we can use an arbitrary nonlinear
transformation. Indeed, almost every function in the interval can be expressed this way. For example, we could
take the set of
to be the harmonics of the Fourier series, so
that with a large enough C, any function f that satisfies
the Dirichlet conditions can be represented this way as a truncated Fourier series. Figure 4 shows just a few examples of the class of functions
that can be represented using such a nonlinear transformation.
Sample realizations of targets generated with a truncated Fourier series of 10 harmonics.
Sample realizations of targets generated with a truncated Fourier series of 10 harmonics.
Finally, we make the usual independence assumption about the noise. Assume the stochastic
noise has a diagonal covariance matrix , where
and I is the identity matrix.
Similarly, assume the energy of the features not included in
is finite,
with
. For example, choosing Fourier harmonics as the
nonlinear transformations guarantees a diagonal covariance matrix.
Notice that expression 3.11 is independent of as well as
of the noise, and the only remaining randomness in the expression comes from generating R, which determines ZM, and from z, the point chosen to test the error, making the analysis very
general.
The simulation shown in section 2.3, although in a
classification setting, suggests that this is the case. For completeness, we run the same
Monte Carlo simulation in this regression setting. The advantage is that the closed-form
expression found already averages over target functions and noise, allowing us to run in a
shorter time more combinations of PR and PS, so that we only need to Monte Carlo the matrix Z. The expectation over can also be
taken analytically with the closed-form expression found. In this case, we consider the same
families of distributions, but we vary the standard deviation of the distribution in smaller
steps to obtain a finer grid.
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline85.gif?Expires=1739899867&Signature=W8ELB37P~vk8Rj8bZtVoYYQqOVpTL1sN4yc~WRPYhCLfFLLTlf18bXp3Ic0m5TMfW0vemGud19S1vxKKB~OZ1ti1VZXGwCWyRhuSTyrih~J-vic8g1lYWA-0uBlNzLVQ-8ry6mZgpH2i5GU-leP7k5MmqQj~dZTZZmOuBokRU~Pe5~hgOBq1f6u7Cnb3yovIXf0EMzcodOvXq7wre4OhfCSV7IWcD~Nh1msKNb16o8s9vIOdxymAlC67Hi3YRCjLyizTzLxxDJycHkyk-OtVJxHot-OplVC9cUxebnmKRN6DIq4mAyYNTztJYqlE8oHh8AFLRX7MFRUSrB-Eu1O4gA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline86.gif?Expires=1739899867&Signature=xjpBWHWXeHFNvQHTAr0p38N-zqUEm5bghGVMdMWjq9armk5DUhFtzO4Pxo3GoHJ1C-MX-WlsRWJkxvXdAkd-ujoc0fLTvPm5wHEj9Rx4N3ml9MDXTmtoMHaacDc8OyugT3XTSCoDAO3eLnnb8IPCLG2TJ74VW9SX~luYxQf6RY9~w8665LV~R73-h2lkLcpHFJ0DZBrx7HxOG7Sn0jMe3BvXRtzRgTtK78S5-CsWldJVoVK8XVcdAL8R5Ovilg59QnUcUrTyXQL8lzvUxL3ZFvDHRfLCnaJAr14X8v2S83dBVwm0mBph9vJ2XVaJdJnYmy73k8Xy0bbjDkY38ijzfQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline87.gif?Expires=1739899867&Signature=Nk-EeYnh7YgAh8iD9lWeg0VzTjLziw1rtfXq-1Dh5oASVxM80Fs6N4Tu4w3LPm78vAdTqaK4YdUmc726EmI0BzxaAGdL5Cmr~FQBG0qTXNs~AgD3YZNGUe9EnHRtasrp0BbjoC~OrPFpeibWaCnnmtAQs3l96mME38JFoLDJ0U5fmFb4UHKdL2M55wK8A1h5LYhxFEygYse1u~t7Vsp5sAevqo1x8qbzy-KiVEzj7ivb4aSnijahv0ilRdAPc3qym2q38VIhYxGgLS8kFGsxiEv9excIUD6osrTXT~Wn2diwgZY5xjkBD0LvaqdNNiGYplVETEZQU6kBh6SlyuUqxQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline88.gif?Expires=1739899867&Signature=UEGa44Xn3ZJx532An5CpPkRc4dQWN~4MtC1jvNnytf3LTGAaEuT~SBf3vRGf2QBN0uSyGSIMzZI9bJ0fCS3Fvc-CrsPVFJHzWk9vEfYTTlSBvuURU3v5hNjWvB9Y70hrk0p6ulPR83MxRc42CMIuK0voIcAmnBE7KdeG2bmiSn5zxevOEGpXM99s8BhvE6HPEERgTTVhnkkQCYVLqIoETSN02pV7XvByiTQmahYIgZCEfhA9HZ6Nz-TqOBEGRTwExZuDHIy~sIqWYsqzeNg-g9b8x1FP13ki1D2-Jq91Un~fjGf4nUatd4~w2HTCkH0CeST7TBe~OrquC5yoMEDBOw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline89.gif?Expires=1739899867&Signature=J5tN5p-dmV7wscLnrXjq3sA5K8yuVcZrS2hsb2CZgF2lJbYJ4A~FSPxOgjgzi6-IgTAJoUXLrFAAzW6yA6S1on7yKzZl-6jb7RJVlRenAMiBhet5HDOpZpqqaW2NSY9N7nNNL13poZ7aR5USEDechc9JRdSclxIGfecPXuqRo7ETCnZvdYej9U0icPg3owKz-F8rivsJipH1LM5gFdfTRnvaIRJqhZYt6UIZMwzsQiTwYyG0rvI6Qt7LdbYnKS9AckfmCqP565QxHFUYzOq0K-NtRbfY0dkhHxCEWZ8obO7xCKESAsifEkj5V5sEAySozfiZETKdLU9xd2Cu8ksc5g__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline90.gif?Expires=1739899867&Signature=WHnoJ1RQRUWkvGNylzV~8N8Frck3XEU8rnUaEmunUHrKemkRCEYGAqgKsBQmzvjBv6ohdEPz4GOPlmH20Yzpvpwr700pdb-ao3dsu-HfbYBzaCfEglo3UTLYGGzprjaoYEBUn9K-rqyDLAC7LnvNtel-3DTAJoPsOmonuFZVhPt~F-lc82QG8ELKvtDPARz2OKSlguY-D4TkwFCEgvjQkSqw27jTcENhrTKlyxLEYbQdU4jJFc0ZaYK7peT6yoPJL7xzj6jCnOb5Xs6wAeZXf7gZuXAddFLzT-Cc00BHmu6YzUlDuS2dBsp-Y2shH3YVoBNRbc69ZF0DXpVHsqmaIA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline91.gif?Expires=1739899867&Signature=bnfTKLehEwncuJUgkdgBDUq0~J78-id05bDTGQspZBCUTWdtcAuEKEQuAkj07nbN7qdW6HKMyck9PQVjuDNoxBCHj1kfUL6wVBBF-ABd80-2Ax75awrimtNoYReBvtdDnbivQmlTX6sCST~~1CSbIn006yvjlAVXxPMY9PAZudw5yicZOn5XhoI8f0edCGQJju8FAOataYYibHi1WeUBjJ2EiMpB7pNuFsondr58y3p1vgly4eWqZ2S7MZWcw3yJp6LFkjDdGXa7A0Qou2sEEXP5wNGhO4AQ6MNF7kjjht6rgMGLkIrtch9y1co6HGCDe9kjTTdaypXKitMFRSvRlA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Notice that as shown in Figure 3, the cases where
mismatched distributions outperform matched ones cannot be explained using an entropy
argument, as was the case in section 2.2. Notice also
that there are now combinations for PR and PS where almost of the
simulations returned lower out-of-sample error for mismatched distributions, especially when PS was a truncated gaussian with small standard deviation
(
) or when PS was a
mixture of two gaussians with
. In
addition, we note the similarity between this simulation and the one shown for the
classification setting in Figure 3.
We varied the size of N in order to see the effect of the sample size. We
see very little variation in the results. Holding the other parameters constant, we obtain a
very similar result. For and
, we obtain an affirmative answer to the question
posed in equation 3.12 in
and
of the
cases where
, respectively, so the result does not change from
what we obtained in the
case. For
, the percentage is even higher, at
. Hence, it is clear that although the number of
combinations of distributions for which a mismatch between training and test distributions
is larger for smaller N, the result still holds as N grows. Notice that in the simulations, the target function has 21 parameters. Hence, roughly
for
, there are effectively 5 samples per parameter,
while for
, there are 150 samples per parameter. The latter
is quite a large sample size given the complexity of the target function.
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline110.gif?Expires=1739899867&Signature=UapnJWwkqiWCTSdeMcCzReVJAB4aE5YukLga000vTV0gpOUHZABFH7mJRwREZQ0yd0eiFQO9WP1K-VhnBO4sezVHyTH1LGmlcR1XubB-Nq54zUnb1NexFdRDyvVoHJOVGkwX9zBKPHbushk2P-XBXs1OoWpB7RVHDQAdS~tMh5LcQRkVKaADb2ILtNWSiC4ZttJHNghxqpuqEn00-u5XgHX07iTHrxJBZ~0erQkhiITXDizQc5Yu23by4DDxfIxAOM4m5578~zWEpUFZH2DywijdrGRjVHW1uWXSANdXNiqbMBld-myM48VluP6vQytvd6xWZjclI~BNl5cq2dP30g__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline111.gif?Expires=1739899867&Signature=bfBIr1R0nk5qF1-rVF7iz1amk-JTRhN6m~Jz44FlRaDk1lQlqBV9NXsZsTB1JRimcE9DGW6PX5o5a9CeiIcBBsVJUI0ws19b27W7hZc8ssuLURMbp1e1dlJizuXpSdqqcY~-6tEv8zcYaFDhM88oqAWQraUxMSIVXg8jzocurcnelcMyRIlYvD7hW8KqhvBe6wcsZ8ZmyPStD5toZzalTsBI4wiVqoYoHAkRjgVXPidCR5pkhWBPYkXijytJy~SmZ4DEpVWw~3HQ3mc82Nt4B1e4pTJuduVjzDJDhTiuhmrDeXSATSI2G0WkZ8pymBlC3xZKOuujbVCWaZaZ~DVaSg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline112.gif?Expires=1739899867&Signature=hhkUPlDRIjorZ7zA92JW9udWCnfykxbjdGbpVL~8Z-jqpHnM3lzyv1MOeMPq33ObWltZ9JlvopLoiQyTE-XYFQaHywrfdxfZoj4GbVC48XHiiS8SOwjwq0GTCoNqTrbj5~EZe75Ryszy9TmM9KDh2MJk8SN14~y~K9kvdUXXpZJAnfYlvsXjoBGlAtTPkTOlyMmBlEWyQYZt1rbJLMHRbL~IGRGjk4xnZYs2s1wxvAQkk55DGT7g7LhN7redqMX2SfPn6iDmBsilXPyOuwYzgX0PozCB7AaS7uw9pHoIzzSo1D9g5d2~36QvZ4KH2l8U6IOU8ANNO0eSpALKZWIRFQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline113.gif?Expires=1739899867&Signature=DKgMuXtHlz5IS0ipys1v65brrKYM5xnsmweufZc-XyE8Lp2TE21aXLFZhn8SteVgJal4XVUZFx4m~bKYkJelSSIeooMZatrCAcwG7gqn0EmshaVu8HPNfBE16ohc8EDhAWMxiwiRBTtn8qIBpOzqAy5sHnba2uSiJ5pDARsryUHDITXW3ZzAYoDIzcslYr2wYkX3exva6w3DoziofMwfhA-s9GpZ5fRaXjdND7BBrqSUBV-lZGOHiZOO6cPCLo1gIodd11M9cY4ko8Eo39TIgj~qDhYwtk7YOGPnCDdFIY~vaj1kwjr74kujAOzetKcoKyueop6yicCD5D2I4uNjrg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline114.gif?Expires=1739899867&Signature=ViDh~qQOh7rkHGlY1nfI2KzzKxYgTDZB0GfthEb7hshw8rO9xTZnCbTlxjs8g3z54-pbBupshlJi3KrUgM9dN5s~r35xQI31xcoGXU5L~90swH0XdW3Ugatik6cO5aZAMx8p~g37sZiCDNCbaupC1dBBqNU9qqoOSuQGK1S8KXoYujhDE01poJ7HAH4K56O2z4vWzH06cNR7Z12anuHNYtHWWeaU7IJJ-tFYgMqG0gRJAgm~B7O69yDa09FppjBZGNBaYanDCw-sZaZ3n8kaJI97ZjI9OXOG~-Y4bj3kd6cVy~vFWBFH7vcYTekw0vHbltbPr4m9BMA7Tu~35LjR~Q__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Figure 6 shows the closed-form bound for various
choices of a and , choosing PS to be a truncated gaussian with
. The dotted line shows the bound for the case
. As it is clear from the plot, there are various
choices for a so that equation 3.12 is satisfied.
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline122.gif?Expires=1739899867&Signature=UcPXfdQigLttcfKwfoB-DRMxpW4H31vBJ0R6QbATrP0ilHGpEI046UkZ2vMCOkz~-9ZajOFhGj1BcA1WDUxiA2SPz98PvhRZKJhpD~GQJMVFRhmEO-7uEOhXO7GOGx9bY-2N45Q9RR8sWC-G1~~ti7gxPH6k3qq6Oq6fhqVj9RONLMIFha6yvzmLh8uoPOD1OYSKFdeL~UeRDgQpIYSvKwKh1R16yKRZX9rBkWRfs8tO~wmGDCXNzD~aDBd1oTl5VsxLfjggfr5eJzRMZKykRMylayzodFlpXnvnq7a2j2BUqyCjlBWeji0fmL5HiB0tTKu-44aCgK~d6exY2EORTQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline123.gif?Expires=1739899867&Signature=W7b6GHRohxn5XEai6jEfrBj4G1V~uHR3uIN-pDNJN3ca49QHj4v3isw4fGe9E4fRag4vyWPZ0MRdwG1wpKCZnMRhvFvWW~VTHi7nGSaFn70-TzZ4mQrqr-U4Tqx60Bw47flpqOG3YoCzQyZjv0KlK5ZfjAcWpJDk~hYTEaynininhAx7nCXvHPR4o0LjuAncF3mYWuEsN3heCWa0DcaJ~v4dBsOHjKbdsoAsUYAL211H~uSNbcUWvmBLEubv0YbidPmXfrK6uzrsKharhhr3SGyFHWCJm~Gj94vg90OcYbSmFDhmlILleFRllsAFltLf4GmGKzCwuQYX7C5DQ0hzZg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline124.gif?Expires=1739899867&Signature=ddnaq~KTFXTa0nkTWYjA87F7wJWpDYwe8i56TUOvgh8HeBvFmx3DLSnF-A0tbYc0y-0B9eq0syhwspo5Aqf6M~Uevudy-mEArUvzq5eDAw9PP8Fy3HkSFciYi7PFz8tRkAbpPea6O~iQjRE8HtDNtP3fqBQhkxJBADJVcyB83WdVoxpyHEESxKldDruUsehPxgw7YRQrYs6CLcm6Nj3Bq~4oiKGeRetIkA2m9-sA6dDiee-ISVe2Mvd9PEntC5gF803q0wuGUOVd4F~LhT0PhT2HVkYCndj1OfICjwPHQamYlXT-9-AbeMd7EODO85rkRDHg-LIWXPDFCOHXvewCoA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline125.gif?Expires=1739899867&Signature=coWoX-Ru0spyFd7i3NVLRH-joSc76JBHkOH2aLkrxL5iivbn0yzE3ZXvUvYEcIf2NJOmdgaEGNVwjnsncY9t-IbM4DN75Jzu7spRzBRUqQuY7QyVu8959UUd1e1FG89sqvDbU7uZ8fYr5cZW5kD14xTxx8WdquD4cl2ggxSwhzBLUyxhba5X5XHzp9~WG0J1ueX2Z4U1Y4m7ZmmsVYYCzA-O1f3MHfO-nIvs0g0Tj5Z02uFn4LJUH1BE1XUTrUWgbZOJQ9IchgKrh29H5HKb-v7Qy1i~mSmXfx0HenikTbIwBr1ZKjrHMuOC~gxZeX9WQgQ2W9Bdx4U3~lrtXySpjQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline126.gif?Expires=1739899867&Signature=NIV0xFo6iE~MXXr3abvWdAK0usmXw~5ind9YNHumj6FD372pjovK-CFh2FSmrTkTaf9z6DJDFyVk59AVKRQJnBaDGCNEkcF~wwQ5TREpbkGsOwWO9hog4k9fdpPwSiIr-Q-rxGtE492g9F6LLtStsmhgXmLQyNm6jiJFp2tlPsZ4KIx3vUB1vJLq0Qh6kItvn-oOyoA9FJOIiV0jYXfmxpo5q6meN48A2iTdDDfH3RUBbbv-61LZ61wUMszSg-IIJKChSCt2XwjanWJK2wUKxT55ZzLQ6iW-U8oOTJ~dx4m1~yj2Su8X7NespagqfgVEmMSFVlqm1ga6DB194RNu3w__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline127.gif?Expires=1739899867&Signature=1RbMh97pL07HsPraZyrOBTqS8VYtGFaSzwWRsQJPap97cfMoteBimRjl~axE5yw~kjiZp5d1xK~2Cb5deIVB6lqkNsswoja2ZSFb2FyY7ZJ~GmYaIM1eB~KcQ7cql2wRe0Y8bo1oiF4RAN1d7MfJd8UaRsRxdH0unpMdVAgrrFoJ2cb5UJ5vZsPqCJF-jtHk~atjSaxs6a-rYWuipbnImHb6sybI~KNQcHLYkdT9zdYfSqWs0Cn3IpJPnp~4e6ehhVam3R1vB8yWF43fOFwF~nSM30kC0hWAMAvQuVLS78RcCZbSD-ctzJduN7WiUO-WhixMN9M-XQZi7ZOYENc4Kg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline128.gif?Expires=1739899867&Signature=V-QkPfkwd8rYgPZk3teJScmM~qvOw9b0R8ETyKdcVAa4R2WeXx1eRwT1T-ptSX~uFt7Vx2NA52EAv6RE4z87f3-31c5G~XK1mCh043296o4OAIXIUXc-Q-nHdwpkMCk3MnsfMqom57BCm9UDCsvKcaH7KQ7u67eAqE3C2m4EOP3dfqkzY0ZAhQiR1BErQoZw7u3-8T57itX2QvWAf4TTymO3~JdH~gKouZdd9-UMcKn5KAhj3K-ri6QciljQI-djyO4lT83V4Y~tTyhclLfxbCySJXuUHBoy87hfWTq99KC1T20MXYPyJzELm5KFLiOABXo-F8AjnibO3a8srO8juA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline129.gif?Expires=1739899867&Signature=laLVkpzt6-ySIZdPvt1Zuk8~~fujw6U1dU8iRYHXKKoRA83Su6cX-C4BAP58AJGGA6dSaE3DmWhsX6TJ~X5dD952IbX02SWujEB8A9mRgvx3HEU9jRuQCdJadqqgVlVpJrm46D5j~nWCfbg3eHj7~vYQ9TTJDJcymYG1qVCyHpQvGxsM9w0iEL8UqlsYTM-aQcN06pB8AHBhRTH-nEj64ft~V~VyTh9Txh1AInX1mUP2HINPAybUDQOn6ATRLCDHVoPvHYssB2Ckc0xrIWTli0QygRloOJRBkGy1tDWeTsyKv9Xk-1pRePXA7Qz5SKCq~chMutt3I8N1z7I3VFUBCw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Pair of distributions such
that expected out-of-sample error is lower when R is generated
according to PR rather than according to PS for a regression problem in the domain
.
Pair of distributions such
that expected out-of-sample error is lower when R is generated
according to PR rather than according to PS for a regression problem in the domain
.
4 Dual Distributions
We illustrate the concept of a dual distribution with an example where can be readily found. Assume again that we want
to solve a regression problem, but for simplicity, let us assume that only stochastic noise
is present in the problem. Furthermore, we use a discrete input space
, so that PR and PS are vectors, transforming the functional minimization
problem into an optimization problem in
dimensions.
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline138.gif?Expires=1739899867&Signature=h4pg8wSrvSHHYo~4wuOdfilI8lGGSesfncNdMuveBte7bSSKvJnKB266tP3utfsULZRavQCyp1p87JTLIzn3FfOYtei56BHX2NOVXIeRIz522nKtE8FnukFNH8CMBYMdRuGP3XNlRGRv6oDZMIAwbHb~u5jJiN3eUN8e7gy5QJ~UW2RjgNoZsuU88Fy9v5k1eVsRau1jOxdMB7AzZeOjUnrFiKwyGLuCet~jbhyCnqAk6VncNlLHEtCxFAOd8BmOLHYSeKym4kgntSefsgQQWsr11x6QLm-fXy~aRS84Wk5EpHdn9s7wqEcDXmGAf5xy2wEmP7ikyUOXENphak8yaw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline139.gif?Expires=1739899867&Signature=NPTHMGsSjBNzWeXdAaLlL9pos-AWfMtkV1C2UzORAzqwuswZj1VeBniiLGk-IKB027zG33YSd0-Nj8s-g28q17S4DSyrLoSeKjMLKBte7a224X4JQVs~XNOmPlLbqwx0If4D89HlxXTf3lTqw6jpQKmLzlGoqafuLTLTiRRAiAWVpLQxJotpLrHfLe-81aGELzxlEw-LqwTdtdD64xOJEo0Q74bjMcMCKzCR7PJDBlci3pXkrGc-2dfAcOja-kylpZ~YZtqOCiZx39Z2gF40IDiP1DWMabG8lN-Pd3K5Wdlb8o3lZRZ4MEKLqYU~zGN9TQqfd6hbeNv5Q1v4xyFzaQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline140.gif?Expires=1739899867&Signature=OLguQWfgKa1fhyqIBa~y9G3FlfrZHkgh9N0LMjVK8fOPCyQikslaBY3bxjvBJEpHPNVd0Fgf3BpHmG2kMQZPw3K6Kq-ZupD3NdVSi-HMuuAcRmSzEvCzmsGo3CGPgr5Xg~qesNeuCjcUCbdvvC0RWLqeUl9aTS9DUJHJFGtiU5gosuaQVnqKTCO71NYm-5QsccVYS1gfySZGQaATrjR8nfGi2UB9oArZDVs1tjBriKr6wIugnaL~dH73ovofs5VDc7w7qRtm-oJK6mxhdtDoyYrlkeYFDGSLZ90np9mWfNqlQiLFoUciSDcH~hh3qs7lWkFpoKfSXxa6tCkBW29fNw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline141.gif?Expires=1739899867&Signature=NLc9VE8rIMxRvQ1UnKgBanpCVdlSymgLSZ2fgtakleKm312PkMVJLK8xB5TTKt~lQCD~nIH2j7utHHGH5XdEKL9FiVEXTq1rZSrvYicSQioTU8DhCu5Z5DgLzp~Ev-S8PuiQKgMsxGIKNUXPx-M13FxX~ZWDDt5DtqCc7S4LCnScfXhl3~aExPzHwcV1ELguVHQURvyqHz3GcqLceDQUr3MyLUXS3UcTYU~871wlNfP0Q3uKh6oSxZoqWTZhrDi8g~SWNnTvbbFomRlZnCw3sUTnfYl-YLPbcntu3NUhzlcFD9VPIHioiumVxR4FO6-53CMdpHY~KbF14IVI6HNO8Q__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline142.gif?Expires=1739899867&Signature=2jMCNwFUl4Dx2JYhb3Jdw3CZiI3LEwjMh7xFTerOtKsvkMnzOsHAQOB5ef4D8NoQTMYoQ6Zuz3MlzyKH4en5RMBYCDAi0wrtWkiUqep6FzYkRrhZ3ZjNJfH94J0FRsVyZYB2t8WTkDl5nybwOJj5ADSlub72u6HhFjcLTuBoVXF9w7DVrRmXWhvCN~AEtk2Ak1Err0iUiSY7gZtyCpRLsD2gBdZVl3IGc4z1M468gw6uqpWgDfCFeyiwcKyVOOiAdYDwwH8L5D7Le~sa4dq~WY-2ygeDBLbzikbtJGSyyL4tPXkM~nKmPWFWyKYsvIgqV5zY8Ih1U2LI8h9XPrO6kQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline143.gif?Expires=1739899867&Signature=UEhZ0AmTZ3P-TrNuNf0HpcaD8C0XZKa9nlhIg34omDVppPqfq1zD5aH6mIKCZGBvBaG7CZJgwlv~oE0yTrQOYSWoXZDrTrNMm5HvrJLKIHYzJ2RPsKIKw9qFKioRg-YQbIlUXOl0ytbLETcJiwwgtp0ADDHXcqmB2b8Y~tKrXZ4O0ACRr4UDlNEc~msAewKzY0m9Ssgadcc38vVTOQSoeO2VPVVZJkw9TTbcRHHb009ByNh94XDLxH0DgzSx6xd91QUYu-tEaA1T9kJOTSCYkfgzdUxk5YRe3EK5CNK9j3zg2en8gDRxL4b3jZaF5VtsPa69iE7TKW~kgJ2zKHnl2g__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline144.gif?Expires=1739899867&Signature=N8aNoTYHvcQ9276mEv~Z9Na85eq76xxYw3uMjoiRHXrFQUnkMSaYfm76qQr9H~z4J6NLn-CP92Uh1UxK2TVc79ZeAjlCIzEN-N0WJ~W4ydxaU3BBFG1irRglMwXEyIxSOZjAOPkqTN5lXIsNHQxdpQ5CY8BeEU0qSM0u6U56H1Lr~k4UphVW8jMGln0aaEvdwpTODe0bRE-xtdHK98ghMXkUCXx4ZZ-VbW11J7jNE8btBhuuXC2RJmS4SXxDpvvYRKG1e5YRC~lBbsqfVw1eIPjZ2tx18SJ7n~MWU6Wbin7yW2ff4Fqwioaw5qgbC6YQvyMME2qEGihzavoxGtpbwg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline145.gif?Expires=1739899867&Signature=rODClrstSj-0OJTkyTim1POx9gj5Vr2vR9~uaPJfMueWeyJQT~IbAbkKe8bb-~kXncjSck-RRPwZzFMYzqzQ~WZw1lrqwkxcoNcHkLm0UdDtmW44JiOxKtTIAWJuhKixFVZQJpJz3Ldr7ew2T-Refmm6AyBvDcsoCkKOaHbSFJagndVrZVySUj~0zsqLyeRwU4m61UgrfbZkMJEek4fGa69ySfFUMY5TSj7W~Bn6NHmIIce729BFt8FdTyTILZOmRa3eUITBsu~nSaDQphUtDH0ap~J3NcXDbgixC56zMuCSlPiFV79KJDHk3hvjvEi4sQgMUvWMwBDrXnWAotSKZA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline146.gif?Expires=1739899867&Signature=yCGzzp1~Vn1vT~uHjLIeJtk-qHKo0nIWK-GakU27PBscRMGmVKZTkhS76CwN78uxCtJQPGFe9IYWggtrey8lJRuAVQtgeiSnI9ExjhnVGVjhn25~cBWDrgsUa1~gOxPmkt5CR7Sp~h99FMpdbqUCkIsWVR1a50agUFLJUsJAYV98j2Eq9djrInqNduWzjGCtfcsZUNSkHxLVdDhoUmnEZnf0XThzAInSCG9aRybNxbvMVZ5jr582aWCj-cad-msH8ifWOn-HLTBl2KitleuQ2jLLzkVYaKOZG0R~Glu8ABI1BOhdP44IHBAFN~82iFdcr690cDuU2Ng6mlgw1iOCtg__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline147.gif?Expires=1739899867&Signature=wVNdD1VAhGvcf4ekqhX5M-zRRIEZ~Un0Wr6p0twmzORrwiTv0-gPlpkmeVtzIQEIBF4qUOeNtYymwioCIL5jQBMroX-Y3ByPTy-2~~rt1DqR~s862goyxDPJi2001soI5ubCbwVIcz8D4DNI1KDkzsLxgsjGeCO9s3S0hLmIX9k9KoCcjUw8Wy0frnzWtqsgg-kCG9DtEDE6ylgLGcCuZAYOIRBUxKQcjN7ZAPk8hE1lhkZRxxdbs7CN7UqpjhnNVhic9JoMi1FCvmSP7m77XgO4t3fvSAJqig-cl5W2PLocpxGOf0lRaynrbGcK5dJ4KATHe044o-R1z9ynrOkNOA__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Probability mass functions for a given PS and its dual in a regression problem with stochastic
noise, discrete input space
, and
.
Probability mass functions for a given PS and its dual in a regression problem with stochastic
noise, discrete input space
, and
.
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline151.gif?Expires=1739899868&Signature=2h3KlpQslGFahm6w9fDCmm6Ls~FFZDKyETl33POO9Hx3y2KYpxWhVZopcqLpEBb48vAUWv2cJQnGdw1V9Dz9GQ6Cend6yWrAHEYvFu698u70ML2DaT9~dF7NBUYIZDgCqTs03fv68rgLCKGPxMNLLk0veEzblJLhhMu9YK2IoBHXahy9oM2TG2udq2JdUa5WpwTXlb7LwS~Gyjc1XkXANue0Zlg5j0oTTNoBUZJvKWYpI3Us85YrGm9gm1hqtLwuOWimFNvvoQ-GeQo4qs94R4XIg8VOU5v5bj0vxe9WoEmVuAaaHLRF8U5Oxbn9elYbcCQDj5YOi-HHm5zmxRbN5A__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline152.gif?Expires=1739899868&Signature=DHJtaiouVASep4ESiK9DTuxRZIFsTN0j2~1IxkVTWK41YqPasw~eKD5WX83fS8s5jJeQHNM8feADPt0sDSsxYNyd~pvaStJ7wpYSDbyykV1ZWQmNxMLTeFMEusyQ1n~074ia-SlJZihSx2GGQztEjJNExTiO4lnevEASEp7ZWWh79CllnUb1zk3n79oGoehnZ0ITyVB2G086N13zRqOy2pSENGOH~wIsj5~uSoRkJWIFdpQHT9StluPCVpn5es1lD2J31seoFhhAP5WBHzxH-Gr~te0NkNdKDb~NRLz4mW1puzy3eLdwv040bWdcdhDtyRSIqpfjFHYAHNn4ulzhGQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
Hence, if a minimum is found, this is the global optimum with a corresponding dual distribution. This problem can be solved with any convex optimization package. Furthermore, in most applications, PS is unknown and is estimated by binning the data, obtaining a discrete version of PS. Hence, this discrete formulation is appropriate to find dual distributions in such settings.
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline153.gif?Expires=1739899868&Signature=K4e9ZbxtN6WrBDDcx9n3Qv1id~1IaIkwOMaKk3cegDEBBql55AokDn5n9ZrpHQETulongOOveh0OC5BUDAiQLezRBZpfGnA8aAIxZiDtaLXhT1rJ6majmfL9alpCeSWB9qV2-FwOjn2RxVP0N0Cztk9Hnr8EWY0UCUOsihd872Iimp5UWgWXJm4TOwSIcZa2WzO7kUGrHAnPOvi3hbgl07fOrME~CjD1uQZgIaGOhIEEJmt3BDPhZuVBeFhWhk80M1xTE35AT05uzDw0RtTLvQAN21kRKZIG4YiA7qdiaGeZcMtConu2V90Z7zKRIsTgYMkXzuNKwEw7ITddrK0pQw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline154.gif?Expires=1739899868&Signature=IJbudCZq6OI3pjNyl5Q1bCL7YWRss-EBrgtSakpMgmE7qcRBa0wiaz8~Vllpja0GkANXD0LxMqJHJL8KZIMc-~1ySVA6cvcWaq8rQT2t6D9Jj5k4pa4wRzMt3fHCq2vl8bBQd8dcVdVwWIZbjYlaJ9POKFB1RbgUNDtRnhhnZTZNT7cV6GP3jfeyf77l-OUXuCTV86iWwhq7IR40HugFHz2x2Eui1J2RqR~38ifyj9jm0lt0CeBhqef~~9bVdKfhEbEPJiLBwqC4LiMily0utSdYv-TfDyKVMyU~ykxGiuSPYYWU8rdxaiN9mhuruvS8Gq6LbTkRwuka5P2blcr7ng__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
![](https://mitp.silverchair-cdn.com/mitp/content_public/journal/neco/27/2/10.1162_neco_a_00697/4/m_neco_a_00697inline155.gif?Expires=1739899868&Signature=ym8AY9YujtSguV7tAbJAzAACfFCNYbMG~Z~Or2A9fFlxa7CT0KG0wboCHrT8apVLK6u4odpbrI8r1XMAtdaLAG009zTh6~8yiGeX7Cv4HNbC5isyMrjHt~OA61uLYLrdt1UAc8rTJg6BXOGwrAmkgKfDHiJyqSAhzyQZfS7JQJh~k~muLxPQQ-K2ZFVN-CXL1tNf~0bOxu-8LYRU5QRVoTIRmOZZXUDmFUNjjglSrYIOTWegOFPuQtrA92r9HbrfSCVDv4tnwZ2gKcWInysBYuJpnHiun1GA1vzoxe8OWijewswCNsIQXAftetKEBLT38zu2oHsoLb4oss4Lbklhlw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
The existence of a dual distribution has the direct implication that the algorithms
mentioned in section 1 should be used to match PR to rather
than to PS. This applies even to cases where PR is in fact equal to PS, as it
is conceivable that there will be gains if we now match to a dual distribution using
as the quantitative objective for the matching
algorithms. Hence, this new concept applies to every learning scenario in the supervised
learning batch setting, not only to scenarios where there is a mismatch between training and
test distributions.
5 Difference with Active Learning
The concept of a dual distribution in supervised learning is somewhat related to similar ideas in active learning and experimental design. Especially, the methods of batch active learning, where a design distribution is found in order to minimize the error, seem to be solving a similar problem to the dual distribution. However, the fundamental difference is that active learning finds such optimal distribution given a particular target function. Hence, most methods rely on the information given by the target function in order to find a better training distribution. A common example is when distributions give more weight to points around the boundaries of the target function. Yet the problem of finding the dual distribution is independent of the target function. The Monte Carlo simulations presented, as well as the bounds shown, average over different realizations of target functions.
For example, Kanamori and Shimodaira (2003) describe an algorithm to find an appropriate design distribution that will lower the out-of-sample error. In the algorithm proposed, a first parameter is estimated with s data points, and with this parameter, the optimal design distribution is found. Having a new design distribution, T−s points are sampled from it, and a final parameter is then estimated. Notice, however, that the optimal design distribution is dependent on the target function. In the results we present, if a dual distribution is found given a particular test distribution, such distribution is optimal independent of the target function.
Other papers in the active learning community that focus on linear regression (e.g., Sugiyama, 2006) seem closely related to our work. For these, results apply to linear regression only and consider the out-of-sample error conditioned on a given training set. The nice property of the out-of-sample error in linear regression is that it is independent of the target function. This is the reason that even in the active learning setting, the dependence of the target function disappears and the mathematical analysis looks similar to the one we present. Yet although our analysis is done with linear regression and hence uses similar mathematical formulas, our approach is based on averaging over realizations of training sets and of target functions in the supervised learning scenario rather than in the cases addressed in Kanamori and Shimodaira (2003) and Sugiyama (2006). Furthermore, the problem of finding the dual distribution and the results presented can be applied to other learning algorithms besides linear regression for the classification and regression problems in the supervised learning setting.
Another difference that may stand out is the way the design distribution is used once it is
found in the active learning papers, as opposed to how we propose to use the dual
distribution here. In the active learning scenario, points are sampled from the design
distribution, but in order to avoid obtaining a biased estimator, as shown in Shimodaira
(2000), the loss function is weighted for these
points with , following their notation, where
is the test distribution
(
) and
is the
design distribution found. Notice that in the simulations presented in section 3, we do not reweight the points but instead explicitly
allow a mismatch between PS and PR.
Furthermore, in the supervised learning setting, where the training set is fixed and we are
not allowed to sample new points, we propose that matching algorithms, as the ones described
in section 1, be used to match the given training set
to the dual distribution. In this case, the objective is to have weights
, so that the training set appears distributed as
the dual distribution. These weights are actually inverse to those used in the active
learning algorithms described. Although we are aware that the estimator computed in the
linear regression setting will be biased when we use the dual distribution, we are concerned
with minimizing the out-of-sample error, which takes into account both bias and variance;
hence, we may obtain a biased estimator but improve the mean-squared error performance as
shown both analytically and through the simulation in section 3.
Furthermore, the results shown in Shimodaira (2000) hold only in the asymptotic case, and since we are dealing with the supervised learning scenario where only a finite training sample is available, the same assumptions are not valid. Thus, it is no longer optimal to use the mentioned weighting mechanism when N is not sufficiently large, as also shown in Shimodaira (2000). In the active learning setting, it is desirable that as more points are sampled, the proposed algorithms have performance guarantees. Hence, the algorithms are designed to satisfy conditions such as the consistency of the estimator and unbiasedness in the asymptotic case, which explains why the active learning algorithms use the above-mentioned weighting mechanism. In our setting, minimizing the out-of-sample performance with a fixed-size training set is our main objective, which is why the two approaches differ.
6 Conclusion
We have demonstrated through both empirical evidence and analytical bounds that in a learning scenario, in both classification and regression settings, using a distribution to generate the training data that is different from the distribution of the test data can lead to better out-of-sample performance, regardless of the target function considered. The empirical results show that this event is not rare, and the theoretical bounds allow us to find concrete cases where this occurs.
This introduces the idea of a dual distribution, namely, a distribution PR different from a given PS that leads to the minimum out-of-sample error. Finding this dual corresponds to solving a functional optimization problem, which can be reduced to a convex d-dimensional optimization problem if we consider a discrete input space.
The importance of this result is that the extensive literature that proposes methods to
match training and test distributions in the cases where can be
modified so that PR is matched to a dual distribution of PS. This means that those methods may work even in cases
where
.