Performance on the test sets under different BERT training setups. The best score obtained by our models for each dataset under each metric is marked by †. The overall best scores are highlighted. Each score is the average from three runs with different random initialization. The previous state-of-the-art results are given when available. All come from Pouran Ben Veyseh et al. (2019), except the MAE score on UW, which comes from Stanovsky et al. (2017).