Skip to Main Content
Table 4: 
Automatic evaluation results. The best performance is highlighted in bold. The results of golden story are in italic. The perplexity scores marked with N/A are not comparable with ours because the corresponding models tokenize stories by words rather than by byte pair encodings used in GPT-2.
ModelsPPLBLEU-1BLEU-2CoverageRepetition-4(%)Distinct-4(%)
ConvS2S N/A 0.312 0.132 13.64 22.87 72.78 
Fusion N/A 0.322 0.137 12.02 24.23 72.82 
Plan&Write N/A 0.308 0.126 13.38 17.06 67.20 
SKRL N/A 0.267 0.088 10.82 18.34 69.42 
DSRL N/A 0.293 0.117 10.38 15.36 73.08 
 
GPT-2 (Scratch) 11.82 0.311 0.134 10.76 22.87 73.33 
GPT-2 (Pretrain) 33.50 0.257 0.085 8.04 39.22 64.99 
GPT-2 (Fine-tune) 7.96 0.322 0.141 12.40 29.41 73.85 
 
Ours 7.85 0.326 0.143 18.48 21.93 78.96 
w/o Pretrain 11.04 0.316 0.134 16.33 21.52 77.17 
w/o Knowledge 7.70 0.314 0.136 13.95 25.08 73.24 
w/o Multi-task 8.04 0.324 0.140 17.19 24.40 79.43 
 
Golden Story N/A N/A N/A 19.28 7.64 89.51 
ModelsPPLBLEU-1BLEU-2CoverageRepetition-4(%)Distinct-4(%)
ConvS2S N/A 0.312 0.132 13.64 22.87 72.78 
Fusion N/A 0.322 0.137 12.02 24.23 72.82 
Plan&Write N/A 0.308 0.126 13.38 17.06 67.20 
SKRL N/A 0.267 0.088 10.82 18.34 69.42 
DSRL N/A 0.293 0.117 10.38 15.36 73.08 
 
GPT-2 (Scratch) 11.82 0.311 0.134 10.76 22.87 73.33 
GPT-2 (Pretrain) 33.50 0.257 0.085 8.04 39.22 64.99 
GPT-2 (Fine-tune) 7.96 0.322 0.141 12.40 29.41 73.85 
 
Ours 7.85 0.326 0.143 18.48 21.93 78.96 
w/o Pretrain 11.04 0.316 0.134 16.33 21.52 77.17 
w/o Knowledge 7.70 0.314 0.136 13.95 25.08 73.24 
w/o Multi-task 8.04 0.324 0.140 17.19 24.40 79.43 
 
Golden Story N/A N/A N/A 19.28 7.64 89.51 
Close Modal

or Create an Account

Close Modal
Close Modal