Abstract
Ideally pattern recognition machines provide constant output when the inputs are transformed under a group G of desired invariances. These invariances can be achieved by enhancing the training data to include examples of inputs transformed by elements of G, while leaving the corresponding targets unchanged. Alternatively the cost function for training can include a regularization term that penalizes changes in the output when the input is transformed under the group. This paper relates the two approaches, showing precisely the sense in which the regularized cost function approximates the result of adding transformed examples to the training data. We introduce the notion of a probability distribution over the group transformations, and use this to rewrite the cost function for the enhanced training data. Under certain conditions, the new cost function is equivalent to the sum of the original cost function plus a regularizer. For unbiased models, the regularizer reduces to the intuitively obvious choice—a term that penalizes changes in the output when the inputs are transformed under the group. For infinitesimal transformations, the coefficient of the regularization term reduces to the variance of the distortions introduced into the training data. This correspondence provides a simple bridge between the two approaches.