Table 7: 

Human evaluation on 1600 generated FaithDial responses (200 × 8) from different models on the test data. * and ** indicates that the results are significantly different from the best result in that column (bolded) with p-value < 0.05, < 0.01 respectively. ‘Coop.’, ‘Abst.’, and ‘Enga.’ means cooperativeness, abstractiveness, and engagingness, respectively.

ModelsInterpretableHallucinationFaithfulnessGeneric
Coop.Abst.Enga.
WoW T5 93.2% 055.8%** 2.97* 1.95* 1.72* 2.2% 
T5-CTRL 95.2% 44.2%* 1.97* 0.92* 1.33* 0.9% 
T5-LossTruncation 94.3% 042.5%** 2.87* 1.87* 1.83* 1.2% 
FaithDial T5 94.4% 23.2%* 3.63 2.43* 2.33 1.4% 
T5-WoW 95.2% 20.9%* 3.59 2.44 2.37 1.0% 
T5-CTRL 96.7% 20.8%* 2.55* 1.42* 2.10* 1.0% 
T5-LossTruncation 94.2% 24.2%* 3.59 2.42* 2.03* 0.9% 
T5-InfoNCE 97.2% 19.93.79 2.92 2.60 0.9% 
ModelsInterpretableHallucinationFaithfulnessGeneric
Coop.Abst.Enga.
WoW T5 93.2% 055.8%** 2.97* 1.95* 1.72* 2.2% 
T5-CTRL 95.2% 44.2%* 1.97* 0.92* 1.33* 0.9% 
T5-LossTruncation 94.3% 042.5%** 2.87* 1.87* 1.83* 1.2% 
FaithDial T5 94.4% 23.2%* 3.63 2.43* 2.33 1.4% 
T5-WoW 95.2% 20.9%* 3.59 2.44 2.37 1.0% 
T5-CTRL 96.7% 20.8%* 2.55* 1.42* 2.10* 1.0% 
T5-LossTruncation 94.2% 24.2%* 3.59 2.42* 2.03* 0.9% 
T5-InfoNCE 97.2% 19.93.79 2.92 2.60 0.9% 
Close Modal

or Create an Account

Close Modal
Close Modal