Skip to Main Content
Table 5: 
Manual evaluation results. The scores indicate the percentages of Win, Lose, or Tie when our model is compared with a baseline. κ denotes Fleiss’ kappa (all are fair agreement or moderate agreement). The scores marked with * mean p-value< 0.05 and ** indicates p-value< 0.01 in sign test.
GrammaticalityLogicality
ModelsWin (%)Lose (%)Tie (%)κWin (%)Lose (%)Tie (%)κ
Ours vs. Fusion 50.0** 27.0 23.0 0.421 57.0** 28.0 15.0 0.455 
Ours vs. DSRL 58.0** 24.0 18.0 0.441 58.0** 29.0 12.0 0.475 
 
Ours vs. GPT-2 (Scratch) 54.0** 24.5 21.5 0.385 54.0** 26.0 20.0 0.304 
Ours vs. GPT-2 (Pretrain) 52.0** 31.5 16.5 0.483 56.5** 32.5 11.0 0.493 
Ours vs. GPT-2 (Fine-tune) 42.0** 28.0 30.0 0.344 51.0** 27.5 21.5 0.371 
 
Ours vs. Ours w/o Pretrain 51.0** 31.0 18.0 0.378 56.0** 28.0 16.0 0.375 
Ours vs. Ours w/o Knowledge 46.0** 23.0 21.0 0.289 48.0** 29.0 23.0 0.314 
Ours vs. Ours w/o Multi-task 37.5 31.0 31.5 0.313 48.5** 25.5 26.0 0.297 
GrammaticalityLogicality
ModelsWin (%)Lose (%)Tie (%)κWin (%)Lose (%)Tie (%)κ
Ours vs. Fusion 50.0** 27.0 23.0 0.421 57.0** 28.0 15.0 0.455 
Ours vs. DSRL 58.0** 24.0 18.0 0.441 58.0** 29.0 12.0 0.475 
 
Ours vs. GPT-2 (Scratch) 54.0** 24.5 21.5 0.385 54.0** 26.0 20.0 0.304 
Ours vs. GPT-2 (Pretrain) 52.0** 31.5 16.5 0.483 56.5** 32.5 11.0 0.493 
Ours vs. GPT-2 (Fine-tune) 42.0** 28.0 30.0 0.344 51.0** 27.5 21.5 0.371 
 
Ours vs. Ours w/o Pretrain 51.0** 31.0 18.0 0.378 56.0** 28.0 16.0 0.375 
Ours vs. Ours w/o Knowledge 46.0** 23.0 21.0 0.289 48.0** 29.0 23.0 0.314 
Ours vs. Ours w/o Multi-task 37.5 31.0 31.5 0.313 48.5** 25.5 26.0 0.297 
Close Modal

or Create an Account

Close Modal
Close Modal