Multiple measures have been developed to quantify the similarity between two spike trains. These measures have been used for the quantification of the mismatch between neuron models and experiments as well as for the classification of neuronal responses in neuroprosthetic devices and electrophysiological experiments. Frequently only a few spike trains are available in each class. We derive analytical expressions for the small-sample bias present when comparing estimators of the time-dependent firing intensity. We then exploit analogies between the comparison of firing intensities and previously used spike train metrics and show that improved spike train measures can be successfully used for fitting neuron models to experimental data, for comparisons of spike trains, and classification of spike train data. In classification tasks, the improved similarity measures can increase the recovered information. We demonstrate that when similarity measures are used for fitting mathematical models, all previous methods systematically underestimate the noise. Finally, we show a striking implication of this deterministic bias by reevaluating the results of the single-neuron prediction challenge.